SUBQUERY limit when connected through isql (and identified as DBA)

Gilles_Serasset · March 31, 2023, 8:49am

Hi all,

I have issues that seem clearly related to SPARQL query limits and am not sure if it is a bug or a “feature”…

Context

The dataset I am working on is DBnary available at http://kaiko.getalp.org/ but my use of it is through a direct connection using isql (with dba user). To reproduce would mean loading the full dataset, which is possible (but will take several hours). You can contact me for this.

Preamble

I am aware of the fact that there is a limit to the public SPARQL endpoint (which is 10000 rows and a certain amount of time and effort for complex queries), but such limits should not appear when connected through isql command on the linux server (and authenticated as dba). To illustrate this, the query :

SPARQL SELECT * WHERE {?sle ontolex:canonicalForm ?tle} ;

will return 10000 rows if executed on the sparql public endpoint (OpenLink Virtuoso SPARQL Query Editor) and 15511921 rows if executed directly on the server using isql command line. So far, so good !

The problem

I want to execute a huge UPDATE query everytime I relod the DBnary data in a fresh server. I do not really care about the time it will take as I do it behind the scene when publishing a new version of the data.

SPARQL INSERT
    { GRAPH <http://kaiko.getalp.org/dbnary/vartrans> {?sle vartrans:translatableAs ?tle} }
WHERE {
    { SELECT (sample(?sle) as ?sle), (sample(?le) as ?tle) WHERE {
      ?trans
        a dbnary:Translation ;
        dbnary:isTranslationOf ?sle ;
        dbnary:targetLanguage ?lg ;
        dbnary:writtenForm ?wf.
      ?sle a ontolex:LexicalEntry;
        lexinfo:partOfSpeech ?pos.
      ?le a ontolex:LexicalEntry;
        dcterms:language ?lg;
        rdfs:label ?wf;
        lexinfo:partOfSpeech ?pos.
      FILTER (REGEX(STR(?le), "^http://kaiko.getalp.org/dbnary/.../[^_]")) .
      } GROUP BY ?trans
        HAVING (COUNT(*) = 1)
    }
};

When I do this query on isql, authenticated as DBA it seems to succeed, but if I count the number of inserted relation in the target graph, I’ll get 10001 rows only, while I should have millions…

My understanding is that the INNER query is limited to 10000 (or is it 10001 ?) and the outer query does not have any way to know that the results are partial.

My question is: why is there such a limit when I am connected and I want an exhaustive answer, regardless of the time it will take ?

Is this a bug or what ?

Gilles,

Gilles_Serasset · March 31, 2023, 8:53am

As an additional note:

It used to work about 1 or 2 years ago, I did not have time to investigate further at the time, but the behaviour changed in some previous virtuoso versions.
the server runs the latest virtuoso github dev version (compiled from github sources yesterday night) :
SPARQL footer is :
Virtuoso version 07.20.3236 (5f6bea725) on Linux (x86_64-pc-linux-gnu) Single Server Edition (125 GB total memory, 28 GB memory in use)

Gilles_Serasset · March 31, 2023, 12:47pm

Also:
When executed independently, the inner query returns 3159415 Rows. in 174115 msec.

hwilliams · March 31, 2023, 1:10pm

Was going to ask how many rows the inner select query returns …

If I run the following simplified query with inner select and insert of 15K triples it is not truncated to 10001 against a recent VOS build I have:

SQL> SPARQL INSERT     { GRAPH <http://kaiko.getalp.org/dbnary/vartrans> {?s ?p ?o} } WHERE {     {         SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 15000     } };

Done. -- 3678 msec.
SQL> SPARQL SELECT count(*) FROM <http://kaiko.getalp.org/dbnary/vartrans> WHERE {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

15000

1 Rows. -- 2 msec.
SQL> status();
REPORT
VARCHAR
_______________________________________________________________________________

OpenLink Virtuoso  Server
Version 07.20.3236-pthreads for Linux as of Mar 17 2023 (a4b4dfa07)
Started on: 2023-03-31 12:59 GMT+0

Does that query truncate the triples when run against your database ?

Gilles_Serasset · March 31, 2023, 2:30pm

Strangely enough, it seems that it is the case in my instance:

SQL> SPARQL DROP SILENT GRAPH <http://kaiko.getalp.org/dbnary/vartrans>;

Done. -- 198 msec.
SQL> SPARQL SELECT count(*) FROM <http://kaiko.getalp.org/dbnary/vartrans> WHERE {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

0

1 Rows. -- 20 msec.
SQL> SPARQL INSERT     { GRAPH <http://kaiko.getalp.org/dbnary/vartrans> {?s ?p ?o} } WHERE {     {         SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 15000     } };

Done. -- 117 msec.
SQL> SPARQL SELECT count(*) FROM <http://kaiko.getalp.org/dbnary/vartrans> WHERE {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

10001

1 Rows. -- 25 msec.
SQL>

Regards,

Gilles,

Gilles_Serasset · March 31, 2023, 3:33pm

Forgot to give the status:

SQL> status();
REPORT
VARCHAR
_______________________________________________________________________________

OpenLink Virtuoso  Server
Version 07.20.3236-pthreads for Linux as of Mar 30 2023 (5f6bea725)
Started on: 2023-03-31 10:00 GMT+2

Database Status:
  File size 1398800384, 2792192 pages, 1204927 free.
  5450000 buffers, 1224694 used, 1557 dirty 0 wired down, repl age 0 0 w. io 0 w/crsr.
  Disk Usage: 1224739 reads avg 0 msec, 0% r 0% w last  0 s, 78796 writes flush      432.9 MB/s,
    7097 read ahead, batch = 166.  Autocompact 1218 in 1044 out, 14% saved col ac: 20274 in 2% saved.
Gate:  9071 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
Log = /data/virtuoso//db/virtuoso.trx, 67690 bytes
1564503 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 3 connects, max 1 concurrent
RPC: 188104 calls, 1 pending, 1 max until now, 0 queued, 0 burst reads (0%), 0 second 0M large, 1464M max
Checkpoint Remap 21888 pages, 0 mapped back. 3 s atomic time.
    DB master 2792192 total 1204927 free 21888 remap 819 mapped back
   temp  4096 total 4091 free

Lock Status: 0 deadlocks of which 0 2r1w, 4 waits,
   Currently 1 threads running 0 threads waiting 0 threads in vdb.
Pending:

Client 1111:7:-8:  Account: dba, 3241 bytes in, 64224 bytes out, 1 stmts.
PID: 1193109, OS: unix, Application: unknown, IP#: 127.0.0.1
Transaction status: PENDING, 1 threads.
Locks:


Running Statements:
 Time (msec) Text
          94 status()


Hash indexes


37 Rows. -- 95 msec.
SQL>

hwilliams · March 31, 2023, 4:36pm

What is the ResultSetMaxRows setting in the [SPARQL] section of the virtuoso.ini config file ?

As I am finding that setting seems to be controlling the number of rows allowed to be inserted, as when I set mine to 5000, only 5001 rows/triples are inserted:

SQL> SPARQL INSERT     { GRAPH <http://kaiko.getalp.org/dbnary/vartrans1> {?s ?p ?o} } WHERE {     {         SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 15000     } };

Done. -- 81 msec.
SQL> SPARQL SELECT count(*) FROM <http://kaiko.getalp.org/dbnary/vartrans1> WHERE {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

5001

1 Rows. -- 9 msec.
SQL>

So I suspect yours is set to 10000 ?

Gilles_Serasset · April 2, 2023, 4:09pm

Indeed, here is the SPARQL section.

[SPARQL]
;ExternalQuerySource            = 1
;ExternalXsltSource             = 1
;DefaultGraph                     = http://localhost:8890/dataspace
;ImmutableGraphs                = http://localhost:8890/dataspace
ResultSetMaxRows           = 10000
MaxQueryCostEstimationTime = 400        ; in seconds
MaxQueryExecutionTime      = 60 ; in seconds
DefaultQuery               = select distinct ?Concept where {[] a ?Concept} LIMIT 100
DeferInferenceRulesInit    = 0  ; controls inference rules loading
MaxMemInUse                     = 0  ; limits the amount of memory for construct dict (0=unlimited)
;LabelInferenceName             = facets        ; Only needed when using the Faceted Browser
;PingService                    = http://rpc.pingthesemanticweb.com/
ShortenLongURIs            = 1

I thought this setting was controlling the max row limit on PUBLIC sparql endpoint. It comes into play when I try the inner query in the public endpoint, but not in isql.

Should I understand that it is not the inner select that return 10000, but the insert that is limited ?

Anyway, if done through isql with DBA user, shouldn’t this limit be ineffective ?

hwilliams · April 3, 2023, 5:37pm

Yes, I didn’t realise ResultSetMaxRows affected SPARQL insert queries as well, but in speaking with development, they indicate this has been the case since 2009. As an insert query performs a construct query under the covers which is restricted by the ResultSetMaxRows setting to limit resources required inserting large graphs or number of triples and prevent the server from possibly running out of resources.

You indicated in a previous post:

What Virtuoso version was in use when this worked, because as said this behaviour has been in place since 2009, when the version would have been 5.x ?

Gilles_Serasset · April 5, 2023, 4:01pm

Sorry, I cannot tell which version it was, I always used the develop version on github, compiled on linux and am almost sure it worked back then, but I may have been fooled by my checks, without calculating the number of inserted triples (but I’m pretty sure it was more the this limit).

But it is not my main concern. My concern is how can I make it work… (see next answer)

Gilles_Serasset · April 5, 2023, 4:29pm

I need to understand what is to be expected and how to overcome such a limit.

My understanding:

From documentation (6.1.1 Datatbase), I read :

[SPARQL]

The SPARQL section sets parameters and limits for SPARQL query protocol web service service. This section should stay commented out as long as SPARQL is not in use. Section RDF Data Access and Data Management contains detailed description of this functionality.

Here, I understood that the parameters here are supposed to act on the SPARQL ENDPOINT (web service), not on the processing of a SPARQL Inline in SQL (which I am executing through the isql program). So my expectation was that these limits are applied to the server’s sparql public endpoint and would not apply to other ways of (authenticated) connections (isql or JDBC/ODBC) where SPARQL is used as a query language and not as a protocol. → can you confirm that the fact this limit do affect authenticated connections is indeed an intended feature ?

ResultSetMaxRows = number . This setting is used to limit the number of the rows in the result. The effective limit will be the lowest of this setting, SPARQL query ‘LIMIT’ clause value (if present), and SPARQL Endpoint request URI &maxrows parameter value (if present).

So, this means that there is NO WAY of avoiding this limit (0 does not mean no limit), and I don’t know how I can be sure the dataset will not go beyond anything written here.

→ Is there any way to avoid this limit ? (what is the range of this number anyway, should I use the max int or max long value or any other “physical” limit of the server ?)

BTW there is a problem in the doc:

MaxMemInUse = 0 . Maximum amount of memory that is allowed for temporary storing parts of a SPARQL query result. Default is zero for no limit. ResultSetMaxRows is maximum length of a SPARQL query result. Default is zero for no limit. These two options may be useful for protecting public web service endpoints against DOS attacks…

Either there is an invalid repetition or the value 0 is possible for ResultSetMaxRows to indicate no limit ? Which one is it ?

Solving my problem

Scenario 1: Well, I would like to write the original SPARQL query and make it run without caring for incomplete results due to any limit that is intended for DOS protection, provided that I am correctly identified as the DBA (admin) user.

Scenario 2: As a work around, for this specific query, I could setup the ini to have NO LIMIT as I am loading and indexing the data offline. I can run the query offline after loading and indexing, then deploy the resulting DB online using a deployment ini with these limits. → It will be OK, but won’t allow me to access the data without taking care of the limits through isql, JDBC or ODBC.
→ To say it another way, if I want an unrestricted access to the data, I need to setup a different server for the public endpoint and for the private unrestricted endpoint… correct ? (this means doubling the disk use of the service…

I must say that scenario 1 would have my preference, i.e. publish the data publicly/read only, AND allow authenticated users/processes to query, update, infer, etc. without any limit.

→ ALSO: what kind of connection would allow me to efficiently insert all these triples after a executing the inner SPARQL query: jdbc, jena, sesame, odbc ? (I must confess I tried to quickly create a psql procedure forthis, but it was difficult as I could not find a correct reference manual for this…)

hwilliams · April 5, 2023, 10:27pm

If you comment out or set to 0 ResultSetMaxRows and MaxMemInUse then there will be no restrictions on the result set size or insert query size, and perhaps is how they were set when you say it worked. But in the case of the inserts of large numbers of triples this will consume significant memory and not be performant.

Which is why inserts should be done in batches of about 5K, which can be done by looping through the result set with limit and offset until the entire result set. This can also be done with a stored procedure if you have the expertise.

Note we also have Jena and RDF4J sample programs showing how RDF data can be bulk loaded into Virtuoso in chunks.

We are going to introduce a new INI file param to control the insert query sizes and not be dependant on ResultSetMaxRows , which is confusing and there is also a need to have different result set and insert/update query sizes.