Virtuoso Horizontal Scaling for ETL

ccolonna · June 28, 2021, 3:34pm

Hi,

I need to perform an ETL procedure on a KG to index data. Process consists of two queries. First query extracts 3mln IDs, X at a time with OFFSET (ideal X would be 4000 as index ideal batch load is 4000 resources at a time). Second query should iterate on those retrieved X instances (concurrently) and launch another query to get all data relative to a single resource. An then load to index.

To explain clearly:

QUERY 1:

SELECT ?resource WHERE {
    ...
} LIMIT 4000 OFFSET 4000

QUERY 2:

SELECT ?attr1 ?attr2 ?attr3 ... WHERE {
   ?resource :has ?attr1 
   ...
}

Reason why i didn’t get directly data with query 1 is that to fetch them I require OPTIONAL and SUBQUERIES and SPARQL endpoint goes in timeout.

Actually virtuoso is too slow to respond to the 4K concurrent queries. Obviously I believe because a faulty approach of mine

I wonder if you can suggest any soultion to optimize this.

I thought to:

horizontal scale virtuoso instances reading data (as dataset is more than 60GB I hope all instances can point to same data as I’m out of resources in testing phase, replicating it wouldn’t be possible) and distribute the concurrent queries to cluster?
Increase number of concurrent request that can be handled by a single agent, are there some virtuoso settings for vertical scaling in this sense ?
UNION some of the 4K query, hoping relaxing virtuoso load i would get a better throughput.

How does this approach sound to you? Can someone kindly point me to some solutions or docs helping in increasing throughput? Actually i would need weeks to index all data.

I’m working with this docker image virtuoso to test and I need to close a working prototype before moving to enterprise virtuoso or cloud.

Thank you very much for your help!

kidehen · June 28, 2021, 3:38pm

Hi @ccolonna,

Is possible to provide a little more information about your Virtuoso Database? For instance, the following:

How many named graphs are in use?
What’s the total triple count
How much memory is available to Virtuoso?

/cc @hwilliams @TallTed

ccolonna · June 28, 2021, 4:07pm

Hi @kidehen , thank you very much for help!
Authoritative source for data is Virtuoso SPARQL Query Editor . Total triple count is 700mln. I’m not aware of named graphs count as this query goes in timeout:

SELECT  DISTINCT ?g 
   WHERE  { GRAPH ?g {?s ?p ?o} } 
ORDER BY  ?g

About RAM the server has 8G .

Italian public administartion fellows are topscorer in handle this . See the paragraph about INPS for a laugh Number of cases worldwide tops 1 million as UK death toll surges by another 569 .

Unfortunately I cannot modify settings on live endpoint as I’m not the administrator.
What i thought was to dump the triples ( 6GB zipped with gz) and create a local optimized environment. My machine has 16GB RAM and fast nvme.

In a local machine i didn’t need to impose per user limitations for example.

kidehen · June 28, 2021, 4:17pm

What happens when you execute that same query using the Conductor Admin UI?

Also, what happens when you try to execute.

SELECT  DISTINCT ?g 
   WHERE  { GRAPH ?g {?s a ?o} } 
ORDER BY  ?g

/cc @hwilliams @ted

ccolonna · June 28, 2021, 4:30pm

I have no access to conductor sorry.

With this one I got response in 20s.
GRAPH total count is 61.

ccolonna · June 29, 2021, 7:22am

Thank you @kidehen, so I better defined my scenario,
I need a setup where virtuoso is configured to serve a single user with maximum number of concurrent connections.
Virtuoso will host possibly a single dataset (700mln triples and 61 named graphs)

Single user would be a machine launching queries to extract and index data from the RDF dataset.

I would test this setup on a 16GB machine with NVME 3500mb/sec in reading.

I would like to set a configuration to maximize the throughput. Extracting process consists of two query:

query1) Extract K resource identifiers (with OFFSET)
query2) Foreach identifier run a query to extract related resource data

actually query1 is able to return 4000 identifiers in 1s,
while concurrents 4000s query2 bottlenecked to 2minute to complete.

I wonder can I somehow increase throughput grouping 4000 concurrent queries with a UNION?

Definitively I thougth to:

tune virtuoso to give its better performance with this scenario ( what configuration params should I modify? )
If possible simplify life to virtuoso reducing 4000K concurrent queries to a lower order of magnitude?

Do you have any ideas?

I really appreciate you for your help!

Greetings

hwilliams · June 29, 2021, 12:05pm

@ccolonna: Are you seeking to setup an instance of the Virtuoso SPARQL Endpoint locally on the indicate 16GB machine with NVME 3500mb/sec ?

If so, see the Virtuoso RDF performance tuning guide for details on how to configure for your environment for best performance.

The MaxClientConnections param of the virtuoso.ini config file should also be increased in both the [Parameters] and [HTTP Server] sections, the former controlling the maximum number of SQL threads Virtuoso is allowed to use, and the latter the number of HTTP threads are allocated for the web server on startup. Using one source you could start by setting both to say 1000.

There are also a number of vectored execution INI file params that can be tuned for improved parallel processing of queries based on the available processors if the defaults do not suffice.

TallTed · June 29, 2021, 3:58pm

@ccolonna — I understand that you’re trying to extract the entire content of an existing endpoint, to load into your own experimental instance(s). I strongly recommend that you contact that endpoint’s administrators, and try to get a dump of the data you want to load – as you’re now stuck working around their endpoint’s settings, and you may be inadvertently conducting a DOS attack with your queries.

Whether or not you continue with your brute-force extraction, in addition to the notes from @kidehen and @hwilliams, please note that whenever you use LIMIT and OFFSET to page through a query’s large result set, you must also include an ORDER BY clause to be sure you get correct results in total. Without that ORDER BY clause, you may find your aggregated result set, constructed from those 4000 row pages, includes duplicate and/or missing rows.

If you do get in contact with the existing instance’s admins, please suggest that they contact us about tuning and otherwise improving their instance’s settings — to benefit all their users, by taking full advantage of the host resources they intend Virtuoso to use. They will also benefit significantly by updating their Virtuoso from the existing version 7.2.2, built 2015-12-10, to a current build of version 7.2.6.1, built 2021-06-22.

TallTed · June 29, 2021, 5:10pm

@ccolonna /cc @kidehen @hwilliams

Also… By increasing the timeout set in the SPARQL query form, you can get a complete list of named graphs in that instance, the number of which is 228 (some of which are Virtuoso system graphs).

ccolonna · June 30, 2021, 7:26am

Thank you @TallTed @hwilliams kidehen for your savvy words.

@TallTed
About ORDER BY, I will certainly follow your advice.
About the dump, I already have it, administrators made it public along with the endpoint.
It is split into 42 [file].nt.gz files.

I tried to load it inside virtuoso with this procedure:

SQL> ld_dir ('/path/to/files', '*.n3.gz', '[mygraph]');
SQL> rdf_loader_run();

Unfortunately, after running an hour and a half, procedure stopped and the already loaded triples gets deleted. 200mln of 700mln triples were created until then.
100GB of free disk space were saturated during this operation.

I wonder if this is the correct procedure or there is a faster and more reliable one to load big RDF dataset.

As I’m in contact with SPARQL endpoint admins, I will certainly suggest to upgrade the outdated build with a newer one, and possibly to contact you savvy guys to make things work properly!

hwilliams · June 30, 2021, 9:54am

@ccolonna:

The Virtuoso RDF Bulk Loader is the recommended method to optimally load large datasets into Virtuoso.

When you say the “procedure stopped”, is the Virtuoso instance still running or has the database stopped?

Did you perform the steps in the Virtuoso RDF performance tuning guide to configure your Virtuoso instance before the load attempt? Normally, when a bulk load stops/hangs, it is an indication of the memory buffers having run out during the load. Thus, can you provide the output of running the status(); command at the point the load stops (before stopping the Virtuoso process) so we can see the state of the database at that point?

Note: Given that you have multiple (42) dataset files, you should also run multiple (number of cores ÷ 2.5) rdf_loader_run(); processes for maximum platform utilisation, hence fastest load time, as detailed in the bulk loader documentation.