Question about how the LDP/SPARQL "magic" is Implemented

markwilkinson · April 7, 2020, 8:22pm

Hi all,

I’m curious about how you get your LDP to be SPARQLable. Here’s what I believe is true, but please fix my thinking

LDP is primarily a document store. The LDP infrastructure let’s me automate the browsing of that document store.
Virtuoso, God bless them, makes it possible to SPARQL over RDF documents that are POSTed or PUT into that store (as far as I can tell, this only works for Turtle? I have never successfully been able to SPARQL a JSON-LD document)

And here’s where my real questions begin

it seems to be possible to PATCH a document. This is very cool! … But that’s not a “normal” behavior for a document store!
every document, and every container, becomes a named graph. Love it!
when I create SPARQL queries that span multiple documents, they will often time-out UNLESS I split the query into separate pieces that use independent GRAPH ?g clauses… If I do that, the query response is almost instant! These are not complicated queries… If all of the triples were in a single triplestore, the query would answer almost immediately…
All of this leads me to believe that the documents in the LDP server are being consumed at query-time, rather than being constantly indexed in the back-end triplestore. Is that true?
this belief is further supported by the fact that I can only request a document in the format that it was originally POSTed (e.g. I cannot request an ntriples representation of a document that was POSTed in Turtle)… So it’s clear that you’re not absorbing that document into the triplestore, and then dynamically generating it based on it’s named graph… The document seems (behaviorally) to be a “physical” thing…

This is only important because it becomes necessary to know, a priori, what the data structure is, before you construct the query - i.e. I need to know which triples are in which graph. This is frustrating, though not a total blocker in most cases.

And if Virtuoso is storing documents, then what is the PATCH doing? Is it modifying the document? Or are the patches being applied at the time I download it from it’s URL?

So many questions

Cheers guys!

M

kidehen · April 7, 2020, 11:12pm

LDP is one of the HTTP-based interfaces to the Virtuoso hosted File System, alongside WebDAV.

kidehen · April 7, 2020, 11:18pm

Thanks!
Yes it enables the use of SPARQL via HTTP for performing Read-Write operations on RDF documents in the Virtuoso-hosted File System. In the case of LDP, operations occur over HTTP where the PATCH Method can be used to perform INSERT, UPDATE, or DELETE operations using SPARQL 1.1 statements.

PATCH is more granular that PUT or POST, and its use of SPARQL for granular operations is what manifests as Turtle specificity. Remember, the body of a SPARQL Query is really RDF-Turtle plus the support of variables.

kidehen · April 7, 2020, 11:24pm

Well in the case of RDF you have documents comprising sentences, just like the real world. PATCH allows you to perform add and delete sentences. It also gives up update capability by combining SPARQL INSERTS for the new stuff and DELETES of whatever is to be replaced in a single block.

The pattern is usual, but unusual to in regards to mainstream programming expectations since the RDF mode of operation aligns tightly with the file create, publish, and delete pattern

kidehen · April 7, 2020, 11:39pm

markwilkinson:

when I create SPARQL queries that span multiple documents, they will often time-out UNLESS I split the query into separate pieces that use independent GRAPH ?g clauses… If I do that, the query response is almost instant! These are not complicated queries… If all of the triples were in a single triplestore, the query would answer almost immediately…

All of this leads me to believe that the documents in the LDP server are being consumed at query-time, rather than being constantly indexed in the back-end triplestore. Is that true?

this belief is further supported by the fact that I can only request a document in the format that it was originally POSTed (e.g. I cannot request an ntriples representation of a document that was POSTed in Turtle)… So it’s clear that you’re not absorbing that document into the triplestore, and then dynamically generating it based on it’s named graph… The document seems (behaviorally) to be a “physical” thing…

Fundamentally, GRAPH ?g provides better hints for the query optimizer.

Document content is constantly being reindexed subject to what set in the Virtuoso Scheduler i.e., Text Indexing is on by default and invoked periodically.

The content of RDF documents (specifically literal objects) are indexed, as per what I described above.

PATCH is performing a SPARQL INSERT, UPDATE, or DELETE operation on the triples (which have literal objects) associated with a Named Graph (which can be a 1:1 mapping re a document uploaded to the File System or Sponged via the Sponger Middleware Module).

You are in a sense operating on a sentence (one triple), paragraphs (relations or collections of triples that have a common predicate) or a page (a named graph).