Performance impact of using multiple FROM clauses

mboudet · April 8, 2022, 1:16pm

Hello,

I’m working on an application that send user-customized queries on our Virtuoso server (OS edition)
The user select one or more entity types, and the query is sent only on graphs containing these entities, using FROM clauses.

(The aim being both ‘targeting’ the relevant graphs, and managing users permissions from the application).

Except, it seems like that using FROM clauses is significantly slower than not specifying any graphs (which, as I understand, means send the query to all available graphs). I did not notice the problem before, as I only used small graphs (and thus the graph aggregation was quick, I think?)

As an exemple, here is the following query (automatically-generated, thus the variable names)

PREFIX : <http://askomics.org/data/>
PREFIX askomics: <http://askomics.org/internal/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX faldo: <http://biohackathon.org/resource/faldo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?Metabolomics_Study1_Label ?Experimentation1_Label ?Concentration_AU1_Label ?Genotype36_uri

FROM <urn:sparql:askomics:1_:brassimet_racines_cau_concentrationau_0.ttl_1649313602>
FROM <urn:sparql:askomics:1_:brassimet_feuilles_cau_concentrationau_0.ttl_1649299093>
FROM <urn:sparql:askomics:1_:brassimet_racines_ms_meteroutputfile_0.ttl_1649317171>
FROM <urn:sparql:askomics:1_:brassimet_feuilles_ms_meteroutputfile_0.ttl_1649313588>
FROM <urn:sparql:askomics:1_:collection-probiodiv.owl_1649317105>

WHERE {
    ?Metabolomics_Study1_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_experimentation> ?Experimentation6_uri .
    ?Experimentation6_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_concentration_au> ?Concentration_AU13_uri .
    ?Genotype36_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_value> ?Concentration_AU13_uri .
    ?Metabolomics_Study1_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#MetabolomicsStudy> .
    ?Metabolomics_Study1_uri rdfs:label ?Metabolomics_Study1_Label .
    ?Experimentation6_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Experimentation> .
    ?Experimentation6_uri rdfs:label ?Experimentation1_Label .
    ?Concentration_AU13_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#ConcentrationAU> .
    ?Concentration_AU13_uri rdfs:label ?Concentration_AU1_Label .
    ?Genotype36_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Genotype> .


    VALUES ?Genotype36_uri { <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Aviso> } .
}

This query takes 6 minutes. If I remove all FROM clauses, it now takes 10 seconds.
Is there any reason for the time difference? I’d rather select the graphs instead of querying all, if possible.

Thanks!

hwilliams · April 8, 2022, 8:33pm

Does the following amendment to the query using the GRAPH keyword invoking the graph index enable the query to run faster:

PREFIX : <http://askomics.org/data/>
PREFIX askomics: <http://askomics.org/internal/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX faldo: <http://biohackathon.org/resource/faldo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?Metabolomics_Study1_Label ?Experimentation1_Label ?Concentration_AU1_Label ?Genotype36_uri

FROM <urn:sparql:askomics:1_:brassimet_racines_cau_concentrationau_0.ttl_1649313602>
FROM <urn:sparql:askomics:1_:brassimet_feuilles_cau_concentrationau_0.ttl_1649299093>
FROM <urn:sparql:askomics:1_:brassimet_racines_ms_meteroutputfile_0.ttl_1649317171>
FROM <urn:sparql:askomics:1_:brassimet_feuilles_ms_meteroutputfile_0.ttl_1649313588>
FROM <urn:sparql:askomics:1_:collection-probiodiv.owl_1649317105>

WHERE 
{
GRAPH ?g 
    {
    ?Metabolomics_Study1_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_experimentation> ?Experimentation6_uri .
    ?Experimentation6_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_concentration_au> ?Concentration_AU13_uri .
    ?Genotype36_uri <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#has_value> ?Concentration_AU13_uri .
    ?Metabolomics_Study1_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#MetabolomicsStudy> .
    ?Metabolomics_Study1_uri rdfs:label ?Metabolomics_Study1_Label .
    ?Experimentation6_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Experimentation> .
    ?Experimentation6_uri rdfs:label ?Experimentation1_Label .
    ?Concentration_AU13_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#ConcentrationAU> .
    ?Concentration_AU13_uri rdfs:label ?Concentration_AU1_Label .
    ?Genotype36_uri rdf:type <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Genotype> .

    VALUES ?Genotype36_uri { <https://p2m2.github.io/resource/ontologies/2022/2/p2m2-ontology-3#Aviso> } .
    }
}

mboudet · April 11, 2022, 1:19pm

Sadly, our graphs are separated by entity type, so adding a GRAPH clause return no results. We need the graph aggregation for the query to return something.

Two of the graphs (the ones with ‘concentration’ in the name), are ‘big’. (4 and 7 millions triples). The others are at most a few thousands. We did not notice this issue until now because we were using smaller graphs. I suppose the join is very costly in this case, but I don’t really understand why it would be faster to not specify any graph.

hwilliams · April 12, 2022, 11:47am

Yes, but what was the response time with the revised query I provided ?

mboudet · April 12, 2022, 12:00pm

Sorrry about that. It took roughly 3 seconds.

hwilliams · April 12, 2022, 3:48pm

So that is even faster than when no graphs are specified and the default of ALL graphs is used, as the graph ?g specification in the where clause instructs the compiler to use the graph index when performing the query.

mboudet · April 13, 2022, 8:40am

Seems like it. Is there any way to make this works with the way our graphs are setup?

hwilliams · April 13, 2022, 11:51am

Not sure what you are asking as my revised query runs against your database the way your graphs are setup , with only the graph ?g specification added to the where clause of your original query ?

mboudet · April 13, 2022, 12:28pm

Well, as I mentioned, the new query (with the GRAPH clause) returns no results, so I cannot use it.
Sorry if that wasn’t clear beforehand.

kidehen · May 19, 2022, 10:53pm

Please share your query example. Even better, if possible, share you SPARQL Query Service endpoint URL.

This will aid our ability to assist you.

Kingsley