We’re trying out Virtuoso OS, currently running version 07.20.3236 built from stable/7. Our goal is to use Virtuoso as a triple store, in which edge devices upload their graphs to Virtuoso and another application queries the data through SPARQL selects. Every edge device has its own named graph in the triple store, and only interacts with triples in their own named graph. The edge devices have implemented the RDF4J provider to upload triples in the following way:
begin a transaction with isolation level READ_UNCOMMITTED,
IRI context = this.getRepositoryConnection().getValueFactory().createIRI(namedGraphURI);
this.getRepositoryConnection().clear(context);
add new triples in XML/RDF format to the named graph, in batches with up to 100 calls to add()
Reader graphReader = new StringReader(graph);
this.getRepositoryConnection().add(graphReader, namedGraphURI, RDFFormat.RDFXML, context);
commiting
this.getRepositoryConnection().commit();
In total, the edge devices will upload 28m triples from about 70 edge devices spread across an equal amount of named graphs (70). Each edge device can upload from a few hundred to 2m triples, meaning some named graphs contain more triples than others.
We’re facing some difficulties with this setup in Virtuoso I was hoping to get some input on/help with. Virtuoso doesn’t seem to be able to handle the transactions described above when multiple edge devices try to connect to Virtuoso and upload in paralell. It seems like only a few edge devices are allowed to begin a transaction, and the others simply wait for what seems like indefinetly, or atleast hours. The log states “Many lock waits” and “Locks are held for a long time”.
So the question is why they have to wait? Since they all access their own named graph, they shouldn’t interfere with each other? Have we misconfigured something? Is there any way for us to optimize/solve this in Virtuoso?
Have you Performance Tuned your Virtuoso instance to make optimal use of the machine resources ?
We have a number for RDF4J Sample programs available, showing best practices on how transactional and bulk load applications can be written in RDF4J targeting Virtuoso.
We have tuned the config for performance, atleast to my knowledge. The server runs on an instance with 32 GB memory, so the memory options are set to:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
I don’t really understand if the RDF Index Scheme is something we need to change.
Performance or not, the real question as we see it is why only one connection can add data at a time. Why is all the other connections waiting for one to finish uploading? Can we do something about these waits/locks?
What does the output of running the “status();” command run from the Virtuoso “isql” command line tool report ?
You should never have to change the RDF Index Scheme, and is not recommended as could result in unexpected behaviour.
As said previously, you should review the RDF4J Sample programs available, showing best practices on how transactional and bulk load applications with multiple concurrent threads/connections can be written in RDF4J targeting Virtuoso.
Even though you maybe writing to separate graphs, in Virtuoso all RDF Data is written to the RDF_QUAD and a few other related relational tables, thus waits and locks are an inevitable consequence of high concurrent access to the database.
Of course it can which is one of Virtuoso’s main prowesses !!! My point is that wait and locks are an inevitable consequence of high concurrent and transactional access to ANY database management system, whether you are aware of it or not, and is why relational databases have transaction isolation levels and concurrency modes etc for different application use cases, to minimise locks and optimise performance.
Alright, that makes sense. So if high concurrency is one of the selling points of Virtuoso, we must be doing something wrong. Do you know how we can optimize Virtuoso to handle these kinds of parallel writes better?
I am not an RDF4J programmer, the RDF4J Sample programs were created to provide guidance on how to best write applications using the Virtuoso RDF4J provider.
I understand, but this is not a question about how to use RDF4J. The question is why can’t Virtuoso handle multiple inserts, deletes and updates in parallel. I understand that wait and locks are inevitable to some extend, but in Virtuoso, the whole table seems to lock when only certain rows are accessed,
Below is a new result from status();. From what I understand (which is far from everything, hence why I’m posting here to get some help) is that we have clients connected to Virtuoso that are trying to upload their data. 5 threads should be 1 server thread + one thread for each client? What are the clients waiting for? They use different named graphs, their data should not intervene? Why do they have to wait? Is Virtuoso simply locking the entire table each time data is inserted/removed?
You previously stated that Virtuoso can handle concurrency. Why can’t these clients access Virtuoso concurrently?
OpenLink Virtuoso Server
Version 07.20.3236-pthreads for Linux as of Mar 10 2023 (795af34a7)
Started on: 2023-03-13 15:21 GMT+0
Database Status:
File size 645922816, 78848 pages, 42509 free.
2720000 buffers, 25998 used, 4336 dirty 0 wired down, repl age 0 0 w. io 2 w/crsr.
Disk Usage: 24430 reads avg 0 msec, 0% r 0% w last 11807 s, 198497 writes flush 211 MB/s,
230 read ahead, batch = 89. Autocompact 8642 in 7355 out, 14% saved col ac: 45800 in 7% saved.
Gate: 397 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
Log = /opt/virtuoso/var/lib/virtuoso/db/virtuoso.trx, 198749350 bytes
30713 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 284 connects, max 10 concurrent
RPC: 243244 calls, 3 pending, 10 max until now, 0 queued, 0 burst reads (0%), 39 second 16M large, 43M max
Checkpoint Remap 2000 pages, 0 mapped back. 7 s atomic time.
DB master 78848 total 42509 free 2000 remap 748 mapped back
temp 1536 total 1530 free
Lock Status: 0 deadlocks of which 0 2r1w, 343 waits,
Currently 5 threads running 4 threads waiting 0 threads in vdb.
Pending:
128: IER INTERNAL 272
52: IER 272
48: IER 272
44: IER 272
40: IER 272
36: IER 272
32: IER 272
28: IER 272
24: IER 272
20: IER 272
16: IER 272
12: IER 272
8: IER 272
4: IER 272
53: IER 272
49: IER 272
45: IER 272
41: IER 272
37: IER 272
33: IER 272
29: IER 272
25: IER 272
21: IER 272
17: IER 272
13: IER 272
9: IER 272
5: IER 272
1: IER 272 Waiting: INTERNAL
54: IER 272
50: IER 272
46: IER 272
42: IER 272
38: IER 272
34: IER 272
30: IER 272
26: IER 272
22: IER 272
18: IER 272
14: IER 272
10: IER 272
6: IER 272
2: IER 272
55: IER 272
51: IER 272
47: IER 272
43: IER 272
39: IER 272
35: IER 272
31: IER 272
27: IER 272
23: IER 272
19: IER 272
15: IER 272
11: IER 272
7: IER 272
3: IER 272
34304: IER 272
174: IER 272
23590: IER 272
249: IER 272
34343: IER 272
221: IER 272
38473: IER 272
251: IER 272
48749: IER 272
213: IER 272
34433: IER 272
107: IER 272
38565: IER 272
72: IER 272
68: IER 272
64: IER 272
60: IER 272
56: IER 272
52: IER 272
48: IER 272
44: IER 272
No. of rows in result: 100
I’ve looked into the RDF4J samples, and the only main difference I can see between the samples that use transactions and how we do it is that in the samples, the connection.clear(context) is done before a transaction is initiated, while we do it in the beginning of a transaction. Maybe there is a colleague of yours that can answer if that is has a big significance to concurrency?
The RDF4J Sample applications we published are all about “best practices” for bulk-loading data into Virtuoso using RDF4j.
Your sequence appears to be:
start transaction
delete graph
insert new data to graph
commit transaction
Our sequence is as follows, a fundamental best practice regarding these kinds of operations:
start transaction
delete graph
commit
insert new triples in chunks (e.g., 5000 triples)
commit transaction after each bulk insert chunk
So as to not overload the transaction list in the database, trying to insert up to 2million triples in one transaction as is implied by your indicated possible graph sizes.
That makes sense, larger transactions probably takes more times and require more locks. But if we move delete graph before the transaction that uploads triples, won’t that leave a window where the data doesn’t exist in the database and can’t be queried with a SELECT?
That assumes you are performing a bulk load operation against Virtuoso via RDF4j.
Are you trying to bulk load data or simply have the database online while performing CRUD operations?
We’re not trying to bulk upload. Just as you say, we want to perform CRUD operations and have as close to 100% uptime as possible. The data will be updated multiple times a day, either through the larger transactions as described in this post, or by small update queries to change values on a few triples. The data should also be accessible at any time.
I’m not sure how to answer that question, so I’ll describe what we’re trying to achieve a bit more in detail and hope that answers your question.
The only condition is that the data has be accessible at all times, hence the transaction. As described above in the first post, each edge device uploads a description of itself in RDF/XML format. The upload occurs every x interval, currently every 3rd hours. The specific time the scheduled upload occurs varies from edge device to edge device in order to easy the load on Virtuoso. Still, it is inevitable for a few of the edge devices to want to perform this upload at the same time. If we can improve performance, we would like to set the upload interval to as often as possible, whatever that could be, meaning more edge devices would access Virtuoso concurrently.
This RDF/XML data the edge devices upload can change from time to time, and we currently have very limited ways of tracking what changes has occurred.
If any major changes has been detected, the edge device will upload its description to Virtuoso. Since we can’t keep track of what has changed, each upload begins with a clear() to remove all previous data of itself and uploads a new description of itself.
If no major changes has been detected, the edge device will skip the upload described above and just perform SPARQL update queries to update some values we know always changes.
Hope that answers your question. If not, let me know and maybe I can answer it in a better way.
I’ve done some more testing and I’m fairly sure Virtuoso only lets one thread, meaning one connection, access the table at a time. So with this setup we have no concurrency at all.
Any suggestions to how we can improve this would be very appreciated!