Sparql slow if more object value specified

anatta · October 26, 2020, 11:13am

Hi, I am using virtuoso 7 open source edition,

This query below is fast, it return the result instantly,

 select *
                 WHERE {
 
                  graph <http://localhost:8890/graph> {
 
                       ?catAtt qq:catId        ?catId;
                               qq:caDataType   ?caDataType;
                               qq:showInView   ?showInview;
                               qq:valFormat    ?valFormatKey;
                               qq:multiple     ?multiple; 
                               qq:position     ?position; 
                               qq:link         ?link; 
                               qq:catAttName   ?catAttName; 
                               qq:setting      ?setting; 
                               qq:flag         ?flag;
                               qq:unit         ?caUnit.
        
                     }  
                   }
                
                   LIMIT 20

This query is so slow, took 10 seconds for the result, the only difference is the ?catId is replaced with literal value 1. I thought this should be faster because it’s more specific.

 select *
                 WHERE {
 
                  graph <http://localhost:8890/graph> {
 
                       ?catAtt qq:catId      1;
                               qq:caDataType ?caDataType;
                               qq:showInView   ?showInview;
                               qq:valFormat    ?valFormatKey;
                               qq:multiple     ?multiple; 
                               qq:position     ?position; 
                               qq:link         ?link; 
                               qq:catAttName   ?catAttName; 
                               qq:setting      ?setting; 
                               qq:flag         ?flag;
                               qq:unit         ?caUnit.
        
                     }  
                   }
                
                 
                   LIMIT 20

Why is it so? I am still new to SPARQL and triplestore, seems like I have to do some indexing on the object?

I have tried creating the GOPS index with, I don’t understand about the Partition part, is using O correct? What does that partition mean? Creating this index has improved the execution for the second query to 4 seconds, but still too slow for a DBMS.

CREATE COLUMN INDEX RDF_QUAD_GOPS
  ON RDF_QUAD (G, O, P, S)
  PARTITION (O VARCHAR (-1, 0hexffff));

Btw, does Virtuoso store triplets in the regular RDBMS table? in the table DB.DBA.RDF_QUAD? So virtuoso does not have a native storage of graph? Or what I don’t understand?

Thanks

hwilliams · October 26, 2020, 1:23pm

What is the specific version and gitid of the Virtuoso 7 server being used? This can be obtained from the output of running the command ./virtuoso-t -? from the bin directory of your Virtuoso installation?

The second query checking for a specific catId value and taking a longer time to run is probably because it is resulting in a Full Table Scan on the RDF_QUAD table which, depending on the number of triples in the database, can take significantly longer. This can be determined by obtaining a query compilation plan, showing how the SPARQL query is being prepared for execution against the database, as detailed in Generate SPARQL Query Compilation Report from a Virtuoso /sparql Endpoint.

Rather than checking for the catID value in the triple pattern you can try adding a SPARQL filter clause to the first query to check for the specific value required which should eliminate the need for the Full Table Scan and hence run much faster, i.e., filter (?catId = 1), See:

select *
                 WHERE {
 
                  graph <http://localhost:8890/graph> {
 
                       ?catAtt qq:caDataType   ?caDataType;
                               qq:showInView   ?showInview;
                               qq:valFormat    ?valFormatKey;
                               qq:multiple     ?multiple; 
                               qq:position     ?position; 
                               qq:link         ?link; 
                               qq:catAttName   ?catAttName; 
                               qq:setting      ?setting; 
                               qq:flag         ?flag;
                               qq:unit         ?caUnit;
                               qq:catId        ?catId.
                               filter (?catId = 1)
 
                     }  
                   }

                   LIMIT 20

Generally, there is no need to create additional indexes on the RDF_QUAD table, as the 5 default indexes (2 Full & 3 partial) suffice for most uses with few exceptions, as indicated in the Virtuoso RDF Performance Tuning Guide, thus I would suggest deleting the index you created.

The Virtuoso RDF Quad Store is built on top of an Object Relational Database system (which Virtuoso is at its core), as detailed in the Implementing a SPARQL compliant RDF Triple Store using a SQL-ORDBMS white paper.

anatta · October 26, 2020, 2:35pm

Hi, thanks for the help. I am using Windows. I downloaded from https://udomain.dl.sourceforge.net/project/virtuoso/virtuoso/7.2.5/2018_08_28_Virtuoso_OpenSource_Server_7.2.x64.exe

C:\Program Files\OpenLink Software\Virtuoso OpenSource 7.2\bin> .\virtuoso-t.exe -?
Virtuoso Open Source Edition (multi threaded)
Version 07.20.0.3229-threads as of Aug 27 2018 (17c4ba1d5)
Compiled for Win64 (x86_64-generic-win-64)
Copyright (C) 1998-2018 OpenLink Software

When I used your FILTER query, the query has reduced from 8 seconds to 4 seconds, there’s improvement, but still pretty slow.

Previously I only have 4 indexes, after adding my own then it become 5 indexes, the first one was mine. So it should have 5 indexes by default ?

Here is the explain output for the Filter query you suggested, which still takes 4 seconds.

Fyi, my RDF data was generated from MySQL using d2rq, then I imported the RDF into Virtuoso.

hwilliams · October 26, 2020, 6:15pm

Your screenshot is missing the PRIMARY KEY INDEX on the table itself which is the main PSOG index, and visible if you view the RDF_QUAD table Definition in the conductor and select the Show SQL option to see the DDL command for the table creation:

Screenshot 2020-10-26 at 18.08.51

Can you also provide the explain output for the first query that runs fast for comparison?

How many triples were imported into Virtuoso from the MySQL D2RQ dump?

What is the output of running the “status();” command from isql against your Virtuoso instance, to show the status of the server at that point?

anatta · October 27, 2020, 5:23am

Hi, I see the main primary Index now, I have dropped my GOPS index.

EXPLAIN for fast query
https://pastebin.pl/view/fffaacd9

EXPLAIN for slow query (3 to 4 seconds) with ?catAtt qq:catId 1;
https://pastebin.pl/view/46de4705

Btw, now my initial query by specifying the value for ?catId 1 that took 8 seconds now only 3 seconds. Which become slightly faster than using FILTER, it’s strange. I know first time query is always slower than the second time, all the time I tested the query multiple times.

I used the “statistic” tab to count the triplet in “graph” that was imported from MySQL D2RQ; here it is, I deleted the graph URL because this forum does not allow me to post more than 2 links.

this:Dataset a void:Dataset ; 
 rdfs:seeAlso <url/graph> ; 
 rdfs:label "" ; 
 void:sparqlEndpoint <url/sparql> ; 
 void:triples 108759 ; 
 void:classes 7 ; 
 void:entities 9282 ; 
 void:distinctSubjects 9291 ; 
 void:properties 211 ; 
 void:distinctObjects 49384 .

status() output

Query result:
REPORT
VARCHAR
OpenLink Virtuoso Server
Version 07.20.3229-threads for Win64 as of Aug 27 2018
Started on: 2020-10-27 12:07 GMT+7

Database Status:
File size 299892736, 36608 pages, 28352 free.
20000 buffers, 2457 used, 218 dirty 0 wired down, repl age 0 0 w. io 0 w/crsr.
Disk Usage: 2472 reads avg 0 msec, 0% r 0% w last 0 s, 139 writes flush 0 MB/,
77 read ahead, batch = 21. Autocompact 0 in 0 out, 0% saved.
Gate: 305 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
Log = virtuoso.trx, 6310 bytes
7690 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 0 connects, max 0 concurrent
RPC: 1 calls, 0 pending, 1 max until now, 0 queued, 0 burst reads (0%), 0 second 14M large, 64M max
Checkpoint Remap 304 pages, 0 mapped back. 0 s atomic time.
DB master 36608 total 28352 free 304 remap 3 mapped back
temp 256 total 251 free

Lock Status: 0 deadlocks of which 0 2r1w, 0 waits,
Currently 1 threads running 0 threads waiting 0 threads in vdb.
Pending:

Running Statements:
Time (msec) Text

Hash indexes

Is there a way to turn on display for execution time for query? Currently, I am manually counting the time.

anatta · October 27, 2020, 7:13am

Can I ask a separate question, is triplestore suitable to store day to day programming data in RDF form? I have a lot of linked data which is suitable for triplestore, but what about application data such as user login info, customer, orders are they suitable to be stored in triplestore? I would prefer that I could so that I only have to deal with one type of database to reduce complexity but I been reading a lot saying that triplestore is only suitable for linked data and inferencing. Because for the same query triplestore seems to be slower than regular RDBM. This is discussed here semantics - Are RDF triple Stores, suitable for everyday programming? - Stack Overflow But it’s 10 years, ago, I don’t know if things has changed since.

I know Virtuoso is multi-model that support RDBMS, but I prefer to store in triplestore so that I only maintain one system and talk in one language, SPARQL

Let me know if it’s better to start a new thread for this.

anatta · October 27, 2020, 4:26pm

Hi, I have a new finding for the slow query with literal value specified, when that ?catAttName is removed from the query, the query is almost instant fast! ?catAttName is a string field, why querying string field slow it down?

 select *
                 WHERE {
 
                  graph <http://localhost:8890/graph> {
 
                       ?catAtt qq:catId        1;
                               qq:caDataType   ?caDataType;
                               qq:showInView   ?showInview;
                               qq:valFormat    ?valFormatKey;
                               qq:multiple     ?multiple; 
                               qq:position     ?position; 
                               qq:link         ?link; 
                               qq:setting      ?setting; 
                               qq:flag         ?flag;
                               qq:unit         ?caUnit.
        
                     }  
                   }
                 
                   LIMIT 20

EXPLAIN output (combine the URL below), it’s kind of annoying that I cannot post more than 2 links, then now i cannot post this pastebin domain, so just delete the space in between.

fast query - Pastebin [link fixed by OpenLink staff]

UPDATE: When I import the same RDF into GraphDB by Ontotext, the same query return instantly, took just 0.1 seconds, clearly something is not right with Virtuoso.

hwilliams · October 28, 2020, 12:19am

That is an interesting observation, as it would appear the query is for some reason generating a bad query plan, in some way related to the inclusion of the ?catAtt qq:catId 1; and qq:catAttName ?catAttName; triple patterns in the query.

We are analysing the query plans provided to try and determine the cause …

anatta · November 3, 2020, 9:39am

Hi, I suppose this won’t be fixed anytime soon right? I am guessing it will take months/year before a new version released? I would have to use other triplestore like GraphDb or Blazegraph then.

imitko · November 6, 2020, 7:14pm

Hi,
Please make explain of the slow query and send us the report so we can see what happens. The sparql endpoint have a checkbox to make explain, this results in a report shows:

translation to sql
sparql optimisations
sql execution plan

anatta · November 7, 2020, 4:02am

Hi, please scroll up, I already provided both EXPLAIN for slow and fast query

Mitko_Iliev · November 18, 2020, 6:03pm

Hi anatta,

The pastebin link do not working for me, gives me timeout. Please send us the raw output of explain and profile. The way to do either of them is to use ISQL tool and do:

set explain on;
do the query
set explain off;

same for profile

wkr,
mitko

TallTed · November 18, 2020, 11:50pm

@anatta – If the community.openlinksw.com server rejects your file attachment for any reason, or it says your inline pasted text is too large, you can email the output (optimally as attached text files) to support@openlinksw.com, referencing this thread, and we’ll connect the dots.

anatta · November 24, 2020, 8:29am

@Mitko_Iliev @TallTed

Hi, I have emailed to support@openlinksw.com, your forum doesn’t allow me to include more than 2 links. As there are many graph links in the EXPLAIN.

After many days I didn’t use Virtuoso, somehow I cannot start the service anymore, can help me on this? I can reinstall but will it deleted my config and database? So I can only provide the EXPLAIN query.

hwilliams · November 24, 2020, 10:52am

Why could you not simply zip up the Explain files together and attach them to a post as I have done ?

Explain-Fast-Slow-Queries.zip (7.2 KB)

hwilliams · November 24, 2020, 10:56am

In what way is the Virtuoso service not starting? What is in the virtuoso.log file from the start attempts?

Might the machine have been shutdown unexpectedly, leaving the virtuoso.lck in the database directory, which will then prevent it from starting ?

You can reinstall and the database file (virtuoso.db) should remain in place (but best make a copy of it), as it should not be uninstalled or overwritten as the installer does not install a default database, but reinstallation should NOT be required …

anatta · November 24, 2020, 12:39pm

@hwilliams

Hi, indeed that virtuoso.lck was the issue, deleting it, solve it.

I don’t have the option to upload file; the only option that I have for upload is the image. If I tried to upload file via that “Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, heic, heif).”

Could it be you have more privilege than me?

This is what I saw. If I directly copy paste a file, it will also show the error like above.

anatta · November 24, 2020, 12:49pm

When I try this in iSQL, why this error ?

set explain on;


SPARQL

select *
                 WHERE {
 
                  graph <http://localhost:8890/graph> {
 
                       ?catAtt qq:catId        ?catId;
                               qq:caDataType   ?caDataType;
                               qq:showInView   ?showInview;
                               qq:valFormat    ?valFormatKey;
                               qq:multiple     ?multiple; 
                               qq:position     ?position; 
                               qq:link         ?link; 
                               qq:catAttName   ?catAttName; 
                               qq:setting      ?setting; 
                               qq:flag         ?flag;
                               qq:unit         ?caUnit.
        
                     }  
                   }
                
                   LIMIT 20

set explain off;

hwilliams · November 24, 2020, 3:00pm

You run the commands from the command line isql tool and NOT from the Conductor Interactive SQL UI which does not support the set... commands, see:

http://vos.openlinksw.com/owiki/wiki/VOS/VirtTipsAndTricksAnalyzingSPARQLQuery

anatta · November 24, 2020, 3:05pm

Hi, I am not using the command line, in fact i don’t know how to execute with command line yet.

Here is the screenshot of the conductor UI.

Or is it the limitation of the open source version ?