Virtuoso (RDF-store) underlying source code for caching [techniques] and query processing

Pouya · February 14, 2020, 7:50pm

I am a computer science research student who is interested in DBMS with a specific focus on RDF-stores. I’ve recently used Virtuoso (built from GitHub as a columnar triple-store) to run some SPARQL queries against an RDF dump in my local machine. Thanks for providing such a fantastic DBMS. In this regard, I have some questions:

How SPARQL queries are executed using Virtuoso? It means does Virtuoso columnar translate SPARQL to SQL (e.g., similar to this paper’s explanation: http://ceur-ws.org/Vol-301/Paper_5_Erling.pdf) and then run SQL against the underlying relational DBMS or not? (this link is broken: “How Virtuoso uses Relational Technology in its RDF Triple Store and SPARQL implementation” http://vos.openlinksw.com/owiki/wiki/VOS/VOSRDFWP)
Which part of the source code is related to the implementation of the caching techniques? Does Virtuoso cache the query results as well?
Apache Jena contains a way to extract SPARQL syntax called SSE (link: Apache Jena - SPARQL S-Expressions (or "SPARQL Syntax Expressions")). Did you implement something like SSE for VIrtuoso in C/C++? If yes, which part of the code contains that?

I hope a few words from this community as answers of these questions help me to learn more about the excellency of Virtuoso.

Thanks very much for taking time to read this post.

imikhailov · February 14, 2020, 11:33pm

Hello Pouya,

1 . Yes, Virtuoso first translates SPARQL to SQL and then executes the SQL. If you connect via isql you can read the intermediate SQL by writing

set blobs on;
select sparql_to_sql_text ('text_of_your_sparql_query_here') as x long varchar;

(the text of sparql query should not contain leading SPARQL keyword and if it contains single quotes then they should be repeated twice, like inside any other SQL string constant)

The query execution plan and SPARQL to SQL text can also be obtained as detailed here.

If you connect via SPARQL web service endpoint then you can mark the
“Generate SPARQL compilation report (instead of executing the query)”
checkbox to read the SQL text among other diagnostics as detailed here.

2 . Virtuoso does not cache any results; of course it caches recently accessed parts of source data in tables and it caches the compiled queries but not the results.

3 . For the sake of speed SPARQL is parsed into internal tree-like representation and other compiler’s internal data at the same time, there’s no point in time when there’s a “nice to process” syntax tree.

hwilliams · February 14, 2020, 11:57pm

The broken link above has been fixed …

Pouya · February 15, 2020, 12:33am

Hi Ivan,

Thanks very much for your reply. I have two additional questions as well:

This link (http://docs.openlinksw.com/virtuoso/opttechniques/) presents Virtuoso’s optimization techniques. I can see that the implementation of hash joins has been taken into account. However, I could not see any information about merge join. Does Virtuoso use merge join? (as far as I know, for sparql queries with subject-to-object join (e.g., ?s ?p1 ?o . ?o ?p2 ?o2), merge join may perform better)
Unfortunately, I am not experienced enough to easily explore a large codebase. Would you please give me some hints that which parts of the Virtuoso’s source code (GitHub - openlink/virtuoso-opensource: Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform) is related to the:
a) caching (e.g., query plans)
and
b) the query execution engine (the part that I think is responsible for converting SQL to C code to execute it).

Thanks very much and I’m sorry if my questions are diverse.

imikhailov · February 15, 2020, 7:52am

Virtuoso uses loop joins (with optional inline checks for conditions on fields other than leading parts of keys), hash joins, bloom filters when possible, and it uses im-memory merges for some special cases such as joins of ID streams of different words in search for phrase in free-text index. There are also some special tricks for joins on matching beginnings of keys (this might look like a merge join but that’s actually a loop join that just starts and stops a search in right side where merge sort would start and stop the iteration on its right side).There’s a special logc to deal with the “extreme case” of merge joins — a self-join on primary key; this is recognized by SQL compiler to remvoe the join at all and, moreover, there’s an SQL extension called BREAKUP to select more than one result row from one data row, this results in better execution plans on some queries over RDF Views.

2a. Caching of queries is per-connection and depending on Virtuoso version it may contain a logic to detect series of queries that differ in values of parameters only and to conver dynamic queries to parametrized, cutting the size of query cache and the number of recompilations. Each connection stucture, namely client_connection_t , has member cli_text_to_query that is pointer to a hastable whose keys are texts of queries and values are compiled queries of type query_t. Some queries may obsolete due to schema changes and the like, the flag qr_to_recompile is set if the query should be recompiled before the next invocation. Grep for these names to see the (numerous) files where the related logic resides. (“Here be dragons”!)

2b. SQL is not compiled to C. Instead it is turned into mix of bytecode to be interpreted (code_vec_t — code vectors) and table sources (table_source_t) of various types. The good portion of libsrc/Wi directory is about the SQL compiler and the runtime; exceptions are
— CLI*.c that are client-side;
— bif*.c that are collections of Built In Functions (c functions that can be called from SQL);
— sql_code*.c that contain texts of Virtuoso/PL things to be compiled and SQL statements to be executed at startup;
— sparql*.c that is sparql compiler that is somewhat “at top of” SQL runtime, it is entirely not used if not invoked explicitly with SPARQL keyword in SQL text, and even if invoked it just replaces the SPARQL text with equivalent SQL text and let the SQL compielr and runtime do the rest.

The rule of thumb is that if you can write your new logic in Virtuoso/PL then do that in Virtuoso/PL; if you need an ultimate performance for some portion of that logc then write a new BIF(s) to implement the fast logic in C rather than interpret it in Virtuoso/PL; adding any new features to the SQL runtime costs an absolute fortune and should be avoided by all means.

An additonal rationale for writing your things on Virtuoso/PL or Virtuoso/PL + BIFs is that you can ship your Virtuoso/PL and your BIFs as “100% your own” plugin (or ship your Virtuoso/PL as “100% your own” VAD) whereas your patched Virtuoso is still GPL v2. Plugins and VADs for Virtuoso Open Source will work with commercial Virtuosos of same version (and in may cases they will stay compatible with newer versions), and when a plugin is non compatible with a newer version then it usually can be just recompiled with the header files of a newer version without any other changes.

Pouya · February 15, 2020, 8:27am

Thanks very much Ivan. You’re answer is comprehensive and informative.