Hi,
I have some questions about the increasing memory allocation of the Virtuoso Open Source database. Currently, we are running a Virtuoso instance via Docker (7.2.13) on a machine with 16GB memory and 1TB disk space
We are trying to load a dataset containing approximately 2.5 billion triples. Based on guidance from this forum, we split the dataset into multiple smaller files (each containing 500,000 triples) and use the RDF Bulk Loader to import the data. Specifically, we import five files at a time, load them, and then perform a checkpoint.
However, during the loading process, we observe that memory usage increases steadily until Virtuoso crashes. I’ve read that for every billion triples, 10GB of memory is needed, so we plan to use a larger machine for the full dataset.
Currently, after loading a smaller part of the dataset, around 10GB of memory is allocated. After restarting, the memory allocation is set back to 5-6GB. This leads me to wonder about the following:
- Is this memory allocation necessary because Virtuoso loads the data into memory and updates its indices?
- Why is memory allocation lower once you restart the database?
- Will the memory allocation decrease once this process is completed? My understanding was that performing a
checkpoint
would write the data to disk and reduce memory usage. - Should we expect memory usage to grow linearly with the size of the dataset, requiring significantly more memory as we add data?
My main concern is that as we add more data, the memory requirements might increase indefinitely, necessitating machines with much more memory. Any insights into Virtuoso’s memory behavior during bulk loading and indexing would be greatly appreciated.
To provide you some more information on our setup, this is the output of the status()
function:
Database Status:
File size 1910505472, 757504 pages, 185883 free.
1360000 buffers, 490379 used, 29402 dirty 0 wired down, repl age 0 0 w. io 0 w/crsr.
Disk Usage: 321544 reads avg 0 msec, 0% r 0% w last 1402 s, 3690998 writes flush 333.8 MB/s,
794 read ahead, batch = 366. Autocompact 306977 in 221205 out, 27% saved col ac: 2265643 in 8% saved.
Gate: 7176 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
Log = ../database/virtuoso.trx, 1120 bytes
486545 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 93 connects, max 60 concurrent
RPC: 7389 calls, 3 pending, 5 max until now, 0 queued, 0 burst reads (0%), 0 second 0M large, 601M max
Checkpoint Remap 66443 pages, 0 mapped back. 18 s atomic time.
DB master 757504 total 185883 free 66443 remap 10882 mapped back
temp 768 total 763 free
Thank you in advance for your help!