Questions About Increasing Memory Allocation in Virtuoso Open Source 7.2.13

Hi,

I have some questions about the increasing memory allocation of the Virtuoso Open Source database. Currently, we are running a Virtuoso instance via Docker (7.2.13) on a machine with 16GB memory and 1TB disk space

We are trying to load a dataset containing approximately 2.5 billion triples. Based on guidance from this forum, we split the dataset into multiple smaller files (each containing 500,000 triples) and use the RDF Bulk Loader to import the data. Specifically, we import five files at a time, load them, and then perform a checkpoint.

However, during the loading process, we observe that memory usage increases steadily until Virtuoso crashes. I’ve read that for every billion triples, 10GB of memory is needed, so we plan to use a larger machine for the full dataset.

Currently, after loading a smaller part of the dataset, around 10GB of memory is allocated. After restarting, the memory allocation is set back to 5-6GB. This leads me to wonder about the following:

  1. Is this memory allocation necessary because Virtuoso loads the data into memory and updates its indices?
  2. Why is memory allocation lower once you restart the database?
  3. Will the memory allocation decrease once this process is completed? My understanding was that performing a checkpoint would write the data to disk and reduce memory usage.
  4. Should we expect memory usage to grow linearly with the size of the dataset, requiring significantly more memory as we add data?

My main concern is that as we add more data, the memory requirements might increase indefinitely, necessitating machines with much more memory. Any insights into Virtuoso’s memory behavior during bulk loading and indexing would be greatly appreciated.

To provide you some more information on our setup, this is the output of the status() function:

Database Status:
  File size 1910505472, 757504 pages, 185883 free.
  1360000 buffers, 490379 used, 29402 dirty 0 wired down, repl age 0 0 w. io 0 w/crsr.
  Disk Usage: 321544 reads avg 0 msec, 0% r 0% w last  1402 s, 3690998 writes flush      333.8 MB/s,
    794 read ahead, batch = 366.  Autocompact 306977 in 221205 out, 27% saved col ac: 2265643 in 8% saved.
Gate:  7176 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap.
Log = ../database/virtuoso.trx, 1120 bytes
486545 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 93 connects, max 60 concurrent
RPC: 7389 calls, 3 pending, 5 max until now, 0 queued, 0 burst reads (0%), 0 second 0M large, 601M max
Checkpoint Remap 66443 pages, 0 mapped back. 18 s atomic time.
    DB master 757504 total 185883 free 66443 remap 10882 mapped back
   temp  768 total 763 free

Thank you in advance for your help!

To address the memory allocation issues you’re experiencing with Virtuoso during bulk loading and indexing, here are some insights and recommendations based on the Virtuoso documentation:

  1. Memory Allocation During Bulk Loading:
    • Virtuoso uses memory to load data into memory and update its indices. This is why you see an increase in memory usage during the loading process. The memory is used for caching and processing the data efficiently.
  2. Memory Usage After Restart:
    • After a restart, the memory allocation might be lower because the cache is cleared, and Virtuoso starts with a fresh state. The memory usage will increase again as the cache fills up with frequently accessed data.
  3. Checkpointing:
    • Performing a checkpoint writes the data to disk, which should help in reducing memory usage. However, the cache might still retain some data for performance reasons.
  4. Memory Requirements:
    • The memory usage is expected to grow with the size of the dataset. For every billion triples, approximately 10GB of memory is recommended. This means that as you add more data, you will need more memory.
  5. Configuration Recommendations:
    • Ensure that your Virtuoso configuration file (virtuoso.ini) is optimized for your system’s memory. For a system with 16GB of RAM, the recommended settings are:
      • NumberOfBuffers = 1360000
      • MaxDirtyBuffers = 1000000
      • MaxCheckpointRemap should be set to 1/4th of the database size in pages.
  6. Handling Large Datasets:
    • It is advisable to use a machine with more memory if you plan to load the full dataset of 2.5 billion triples. This will help in managing the memory requirements more effectively.

By following these guidelines and adjusting your system’s configuration, you should be able to manage the memory allocation more effectively during the bulk loading and indexing processes in Virtuoso.

See also the Public Virtuoso Instance Analysis – INI Files and DB Metadata spreadsheet that details the configurations for various publucaly available instances.

1 Like