Struggling with paralell uploads and queries to Virtuoso

Hi Virtuoso Community!

We’re trying out Virtuoso OS, currently running version 07.20.3236 built from stable/7. Our goal is to use Virtuoso as a triple store, in which edge devices upload their graphs to Virtuoso and another application queries the data through SPARQL selects. Every edge device has its own named graph in the triple store, and only interacts with triples in their own named graph. The edge devices have implemented the RDF4J provider to upload triples in the following way:

  1. begin a transaction with isolation level READ_UNCOMMITTED,
this.getRepositoryConnection().begin(IsolationLevels.READ_UNCOMMITTED)
  1. clear its named graph
IRI context = this.getRepositoryConnection().getValueFactory().createIRI(namedGraphURI);
this.getRepositoryConnection().clear(context);
  1. add new triples in XML/RDF format to the named graph, in batches with up to 100 calls to add()
Reader graphReader = new StringReader(graph);
this.getRepositoryConnection().add(graphReader, namedGraphURI, RDFFormat.RDFXML, context);
  1. commiting
this.getRepositoryConnection().commit();

In total, the edge devices will upload 28m triples from about 70 edge devices spread across an equal amount of named graphs (70). Each edge device can upload from a few hundred to 2m triples, meaning some named graphs contain more triples than others.

We’re facing some difficulties with this setup in Virtuoso I was hoping to get some input on/help with. Virtuoso doesn’t seem to be able to handle the transactions described above when multiple edge devices try to connect to Virtuoso and upload in paralell. It seems like only a few edge devices are allowed to begin a transaction, and the others simply wait for what seems like indefinetly, or atleast hours. The log states “Many lock waits” and “Locks are held for a long time”.

So the question is why they have to wait? Since they all access their own named graph, they shouldn’t interfere with each other? Have we misconfigured something? Is there any way for us to optimize/solve this in Virtuoso?

Any help/feedback/tips is appreciated!

// Joel

Have you Performance Tuned your Virtuoso instance to make optimal use of the machine resources ?

We have a number for RDF4J Sample programs available, showing best practices on how transactional and bulk load applications can be written in RDF4J targeting Virtuoso.

We have tuned the config for performance, atleast to my knowledge. The server runs on an instance with 32 GB memory, so the memory options are set to:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000

I don’t really understand if the RDF Index Scheme is something we need to change.

Performance or not, the real question as we see it is why only one connection can add data at a time. Why is all the other connections waiting for one to finish uploading? Can we do something about these waits/locks?

What does the output of running the “status();” command run from the Virtuoso “isql” command line tool report ?

You should never have to change the RDF Index Scheme, and is not recommended as could result in unexpected behaviour.

As said previously, you should review the RDF4J Sample programs available, showing best practices on how transactional and bulk load applications with multiple concurrent threads/connections can be written in RDF4J targeting Virtuoso.

Even though you maybe writing to separate graphs, in Virtuoso all RDF Data is written to the RDF_QUAD and a few other related relational tables, thus waits and locks are an inevitable consequence of high concurrent access to the database.

I will study the RDF4J samples and see if there is something we can improve.

status(); gives the following output:

OpenLink Virtuoso Server
 Version 07.20.3236-pthreads for Linux as of Mar 10 2023 (795af34a7)
 Started on: 2023-03-13 15:21 GMT+0
 
 Database Status:
  File size 641728512, 78336 pages, 44028 free.
  2720000 buffers, 21615 used, 5128 dirty 0 wired down, repl age 0 0 w. io 1 w/crsr.
  Disk Usage: 21088 reads avg 0 msec, 0% r 0% w last 82 s, 2189 writes flush 0 MB/s,
  172 read ahead, batch = 102. Autocompact 6 in 4 out, 28% saved.
 Gate: 352 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
 Log = /opt/virtuoso/var/lib/virtuoso/db/virtuoso.trx, 264657275 bytes
 29141 pages have been changed since last backup (in checkpoint state)
 Current backup timestamp: 0x0000-0x00-0x00
 Last backup date: unknown
 Clients: 21 connects, max 8 concurrent
 RPC: 15200 calls, 7 pending, 8 max until now, 0 queued, 0 burst reads (0%), 0 second 19M large, 22M max
 Checkpoint Remap 0 pages, 0 mapped back. 0 s atomic time.
  DB master 78336 total 44028 free 0 remap 0 mapped back
  temp 768 total 763 free
 
 Lock Status: 0 deadlocks of which 0 2r1w, 25 waits,
  Currently 8 threads running 7 threads waiting 0 threads in vdb.
 Pending:
  128: IER 14
  96: IER 14
  92: IER 14
  88: IER 14
  84: IER 14
  80: IER 14
  76: IER 14
  72: IER 14
  68: IER 14
  64: IER 14
  60: IER 14
  56: IER 14
  52: IER 14
  48: IER 14
  44: IER 14
  40: IER 14
  36: IER 14
  32: IER 14
  28: IER 14
  24: IER 14
  20: IER 14
  16: IER 14
  12: IER 14
  8: IER 14
  4: IER 14
  97: IER 14
  93: IER 14
  89: IER 14
  85: IER 14
  81: IER 14
  77: IER 14
  73: IER 14
  69: IER 14
  65: IER 14
  61: IER 14
  57: IER 14
  53: IER 14
  49: IER 14
  45: IER 14
  41: IER 14
  37: IER 14
  33: IER 14
  29: IER 14
  25: IER 14
  21: IER 14
  17: IER 14
  13: IER 14
  9: IER 14
  5: IER 14
  1: IER 14
  94: IER 14
  90: IER 14
  86: IER 14
  82: IER 14
  78: IER 14
  74: IER 14
  70: IER 14
  66: IER 14
  62: IER 14
  58: IER 14
  54: IER 14
  50: IER 14
  46: IER 14
  42: IER 14
  38: IER 14
  34: IER 14
  30: IER 14
  26: IER 14
  22: IER 14
  18: IER 14
  14: IER 14
  10: IER 14
  6: IER 14
  2: IER 14
  95: IER 14
  91: IER 14
  87: IER 14
No. of rows in result: 100

Does that mean Virtuoso can’t be used in a setup that requires high concurrency?

Of course it can which is one of Virtuoso’s main prowesses !!! My point is that wait and locks are an inevitable consequence of high concurrent and transactional access to ANY database management system, whether you are aware of it or not, and is why relational databases have transaction isolation levels and concurrency modes etc for different application use cases, to minimise locks and optimise performance.

Alright, that makes sense. So if high concurrency is one of the selling points of Virtuoso, we must be doing something wrong. Do you know how we can optimize Virtuoso to handle these kinds of parallel writes better?

I am not an RDF4J programmer, the RDF4J Sample programs were created to provide guidance on how to best write applications using the Virtuoso RDF4J provider.

I understand, but this is not a question about how to use RDF4J. The question is why can’t Virtuoso handle multiple inserts, deletes and updates in parallel. I understand that wait and locks are inevitable to some extend, but in Virtuoso, the whole table seems to lock when only certain rows are accessed,

Below is a new result from status();. From what I understand (which is far from everything, hence why I’m posting here to get some help) is that we have clients connected to Virtuoso that are trying to upload their data. 5 threads should be 1 server thread + one thread for each client? What are the clients waiting for? They use different named graphs, their data should not intervene? Why do they have to wait? Is Virtuoso simply locking the entire table each time data is inserted/removed?

You previously stated that Virtuoso can handle concurrency. Why can’t these clients access Virtuoso concurrently?

 OpenLink Virtuoso Server
 Version 07.20.3236-pthreads for Linux as of Mar 10 2023 (795af34a7)
 Started on: 2023-03-13 15:21 GMT+0
 
 Database Status:
  File size 645922816, 78848 pages, 42509 free.
  2720000 buffers, 25998 used, 4336 dirty 0 wired down, repl age 0 0 w. io 2 w/crsr.
  Disk Usage: 24430 reads avg 0 msec, 0% r 0% w last 11807 s, 198497 writes flush 211 MB/s,
  230 read ahead, batch = 89. Autocompact 8642 in 7355 out, 14% saved col ac: 45800 in 7% saved.
 Gate: 397 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
 Log = /opt/virtuoso/var/lib/virtuoso/db/virtuoso.trx, 198749350 bytes
 30713 pages have been changed since last backup (in checkpoint state)
 Current backup timestamp: 0x0000-0x00-0x00
 Last backup date: unknown
 Clients: 284 connects, max 10 concurrent
 RPC: 243244 calls, 3 pending, 10 max until now, 0 queued, 0 burst reads (0%), 39 second 16M large, 43M max
 Checkpoint Remap 2000 pages, 0 mapped back. 7 s atomic time.
  DB master 78848 total 42509 free 2000 remap 748 mapped back
  temp 1536 total 1530 free
 
 Lock Status: 0 deadlocks of which 0 2r1w, 343 waits,
  Currently 5 threads running 4 threads waiting 0 threads in vdb.
 Pending:
  128: IER INTERNAL 272
  52: IER 272
  48: IER 272
  44: IER 272
  40: IER 272
  36: IER 272
  32: IER 272
  28: IER 272
  24: IER 272
  20: IER 272
  16: IER 272
  12: IER 272
  8: IER 272
  4: IER 272
  53: IER 272
  49: IER 272
  45: IER 272
  41: IER 272
  37: IER 272
  33: IER 272
  29: IER 272
  25: IER 272
  21: IER 272
  17: IER 272
  13: IER 272
  9: IER 272
  5: IER 272
  1: IER 272 Waiting: INTERNAL
  54: IER 272
  50: IER 272
  46: IER 272
  42: IER 272
  38: IER 272
  34: IER 272
  30: IER 272
  26: IER 272
  22: IER 272
  18: IER 272
  14: IER 272
  10: IER 272
  6: IER 272
  2: IER 272
  55: IER 272
  51: IER 272
  47: IER 272
  43: IER 272
  39: IER 272
  35: IER 272
  31: IER 272
  27: IER 272
  23: IER 272
  19: IER 272
  15: IER 272
  11: IER 272
  7: IER 272
  3: IER 272
  34304: IER 272
  174: IER 272
  23590: IER 272
  249: IER 272
  34343: IER 272
  221: IER 272
  38473: IER 272
  251: IER 272
  48749: IER 272
  213: IER 272
  34433: IER 272
  107: IER 272
  38565: IER 272
  72: IER 272
  68: IER 272
  64: IER 272
  60: IER 272
  56: IER 272
  52: IER 272
  48: IER 272
  44: IER 272
No. of rows in result: 100

I’ve looked into the RDF4J samples, and the only main difference I can see between the samples that use transactions and how we do it is that in the samples, the connection.clear(context) is done before a transaction is initiated, while we do it in the beginning of a transaction. Maybe there is a colleague of yours that can answer if that is has a big significance to concurrency?

The RDF4J Sample applications we published are all about “best practices” for bulk-loading data into Virtuoso using RDF4j.

Your sequence appears to be:

  1. start transaction
  2. delete graph
  3. insert new data to graph
  4. commit transaction

Our sequence is as follows, a fundamental best practice regarding these kinds of operations:

  1. start transaction
  2. delete graph
  3. commit
  4. insert new triples in chunks (e.g., 5000 triples)
  5. commit transaction after each bulk insert chunk

So as to not overload the transaction list in the database, trying to insert up to 2million triples in one transaction as is implied by your indicated possible graph sizes.

That makes sense, larger transactions probably takes more times and require more locks. But if we move delete graph before the transaction that uploads triples, won’t that leave a window where the data doesn’t exist in the database and can’t be queried with a SELECT?

That assumes you are performing a bulk load operation against Virtuoso via RDF4j.
Are you trying to bulk load data or simply have the database online while performing CRUD operations?

We’re not trying to bulk upload. Just as you say, we want to perform CRUD operations and have as close to 100% uptime as possible. The data will be updated multiple times a day, either through the larger transactions as described in this post, or by small update queries to change values on a few triples. The data should also be accessible at any time.

Okay, so under what condition do you need to insert a million or so triples if it isn’t a bulk operation?

I’m not sure how to answer that question, so I’ll describe what we’re trying to achieve a bit more in detail and hope that answers your question.

The only condition is that the data has be accessible at all times, hence the transaction. As described above in the first post, each edge device uploads a description of itself in RDF/XML format. The upload occurs every x interval, currently every 3rd hours. The specific time the scheduled upload occurs varies from edge device to edge device in order to easy the load on Virtuoso. Still, it is inevitable for a few of the edge devices to want to perform this upload at the same time. If we can improve performance, we would like to set the upload interval to as often as possible, whatever that could be, meaning more edge devices would access Virtuoso concurrently.

This RDF/XML data the edge devices upload can change from time to time, and we currently have very limited ways of tracking what changes has occurred.

If any major changes has been detected, the edge device will upload its description to Virtuoso. Since we can’t keep track of what has changed, each upload begins with a clear() to remove all previous data of itself and uploads a new description of itself.

If no major changes has been detected, the edge device will skip the upload described above and just perform SPARQL update queries to update some values we know always changes.

Hope that answers your question. If not, let me know and maybe I can answer it in a better way.

I’ve done some more testing and I’m fairly sure Virtuoso only lets one thread, meaning one connection, access the table at a time. So with this setup we have no concurrency at all.

Any suggestions to how we can improve this would be very appreciated!

That’s never been the case in the history of the product.

I’ll see if we can make time to provide you with an example that demonstrably refutes that assertion.