HowTo: Building Personal Index for Data Discovery

The “deceptively simple” nature of the Web (a Document Network layered atop the Internet) starts and ends with the power unleashed by the follow-your-nose exploration and discovery pattern enabled by hyperlinks (specifically, HTTP URIs).

Unfortunately, following-your-nose and sharing links also creates a common problem that manifests as an inability to find or recall relevant information with the same degree of simplicity which is why services from the following companies are vital to the Web experience:

  • Google – for finding documents
  • Amazon – for finding documents about consumer products
  • Facebook – for finding out what family, friends, and acquaintances are up to via a collection of documents
  • LinkedIn – for finding out what professional contacts are up to via a collection of documents
  • Twitter – a combination of all of the above via micro-documents (or snippets)

Every example above is ultimately about a solution that’s driven by indexing keywords and phrases contained in documents identified by hyperlinks.

What’s the Problem?

Google, Amazon, Facebook, LinkedIn, Twitter, and others maintain general-purpose indexes that – despite their best efforts – are suboptimal for personal use in solving the perennial challenge of finding relevant information that accurately satisfies individual needs.

What’s the Solution?

The best solution is to build your own indexes that optionally incorporate content discovered via the more general indexes.

The only fundamental requirement is that you must have a document storage location where you have read-write privileges at your disposal. Examples include storage provided by:

To simplify matters further, you can exploit the end-user productivity benefits provided by tools such as:

How do I build a Personal Index?

Perform one of the following actions:

  • Visit a page of interest and then invoke the Sponger Middleware
  • Use OSDS to save an RDF-based description of a document-of-interest to your personal data space
  • Use a Faceted Browser or Query Tool to look in your index whenever you need to find something

Showcase Examples

We have a Github Repository for RDF-Turtle Docs generated from the Weekly Google Spreadsheets via OpenRefine. (JSON docs used for this transformation are also part of the repo.)

We have also sponged these documents using our URIBurner Service, thereby creating a rich index oriented towards our needs, i.e., a Knowledge Graph comprising references to Tweets and Blog Posts:

Faceted Browsing Pages

SPARQL Queries

DEFINE get:soft "soft"
DEFINE input:grab-var "url" 

PREFIX schema: <http://schema.org/> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/> 

SELECT DISTINCT ?url ?post ?postTitle ?postComment ?date

WHERE {
         GRAPH <https://github.com/OpenLinkSoftware/lod-refine-showcases/raw/master/Week-Ending-2019-05-03.ttl> {
                    ?s foaf:topic ?topic.
                    ?topic schema:url ?url .}

         GRAPH ?g {
                    ?url schema:mainEntity ?post . 
                         ?post rdfs:label ?postTitle ;
                         rdfs:comment ?postComment ;
                         dcterms:created ?date . 
                  }

      } 

  • Crawling Tweets – Post URIs resolve to basic Entity Description Pages
  • Crawling Tweets – Post URIs resolve to basic Entity Description Pages that include Faceted Browsing functionality