HowTo: Building Personal Index for Data Discovery

kidehen · April 25, 2019, 8:06pm

The “deceptively simple” nature of the Web (a Document Network layered atop the Internet) starts and ends with the power unleashed by the follow-your-nose exploration and discovery pattern enabled by hyperlinks (specifically, HTTP URIs).

Unfortunately, following-your-nose and sharing links also creates a common problem that manifests as an inability to find or recall relevant information with the same degree of simplicity which is why services from the following companies are vital to the Web experience:

Google – for finding documents
Amazon – for finding documents about consumer products
Facebook – for finding out what family, friends, and acquaintances are up to via a collection of documents
LinkedIn – for finding out what professional contacts are up to via a collection of documents
Twitter – a combination of all of the above via micro-documents (or snippets)

Every example above is ultimately about a solution that’s driven by indexing keywords and phrases contained in documents identified by hyperlinks.

What’s the Problem?

Google, Amazon, Facebook, LinkedIn, Twitter, and others maintain general-purpose indexes that – despite their best efforts – are suboptimal for personal use in solving the perennial challenge of finding relevant information that accurately satisfies individual needs.

What’s the Solution?

The best solution is to build your own indexes that optionally incorporate content discovered via the more general indexes.

The only fundamental requirement is that you must have a document storage location where you have read-write privileges at your disposal. Examples include storage provided by:

Github Repositories
Google Drive, Amazon S3, OneDrive, Dropbox, Box etc
LDP-compliant Data Spaces (e.g., Solid Pods, ODS-Briefcase, Virtuoso’s native LDP, others)
WebDAV-compliant Data Spaces (e.g., ODS-Briefcase, Virtuoso native WebDAV, others)

To simplify matters further, you can exploit the end-user productivity benefits provided by tools such as:

OpenLink Data Explorer (ODE)
OpenLink Structured Data Sniffer (OSDS)
OpenLink Structured Data Editor (OSDE)
URIBurner Service (or your own Virtuoso instance with the Linked Data Middleware Module enabled)
Faceted Browsing Service
SPARQL Query Tool
OpenRefine

How do I build a Personal Index?

Perform one of the following actions:

Visit a page of interest and then invoke the Sponger Middleware
Use OSDS to save an RDF-based description of a document-of-interest to your personal data space
Use a Faceted Browser or Query Tool to look in your index whenever you need to find something

Showcase Examples

We have a Github Repository for RDF-Turtle Docs generated from the Weekly Google Spreadsheets via OpenRefine. (JSON docs used for this transformation are also part of the repo.)

We have also sponged these documents using our URIBurner Service, thereby creating a rich index oriented towards our needs, i.e., a Knowledge Graph comprising references to Tweets and Blog Posts:

Faceted Browsing Pages

SPARQL Queries

DEFINE get:soft "soft"
DEFINE input:grab-var "url" 

PREFIX schema: <http://schema.org/> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/> 

SELECT DISTINCT ?url ?post ?postTitle ?postComment ?date

WHERE {
         GRAPH <https://github.com/OpenLinkSoftware/lod-refine-showcases/raw/master/Week-Ending-2019-05-03.ttl> {
                    ?s foaf:topic ?topic.
                    ?topic schema:url ?url .}

         GRAPH ?g {
                    ?url schema:mainEntity ?post . 
                         ?post rdfs:label ?postTitle ;
                         rdfs:comment ?postComment ;
                         dcterms:created ?date . 
                  }

      }

Crawling Tweets – Post URIs resolve to basic Entity Description Pages
Crawling Tweets – Post URIs resolve to basic Entity Description Pages that include Faceted Browsing functionality