memory consumption when loading large triple files

I’m loading a large Turtle file with almost 7 billion triples in it.

As the file is loaded virtuoso takes up more and more memory, much more than
it should. I had set it for 24GB of memory available on a machine with 32GB
of memory but eventually it took up all the available virtual memory and
crashed. Near the end it was using around 97% of main memory and kswapd0 was
consuming about 25% of a processing unit.

Is it expected that virtuoso needs this extra memory when loading a large
Turtle file or is this a bug of some sort?

peter

@pfps: What is the load method being used, I presume you are using the Virtuoso RDF Bulk Loader ?

Can you confirm that the dataset being loaded is a single file containing 7 billion triples and if so what is its size ?

Generally it is recommended that large files be split into multiple smaller files of about 1 to 2 GB in size and bulk loaded with multiple rdf_loader_run() processes as indicated in the documentation.

Also, if not already done, please review the Virtuoso RDF Performance Tuning guide, to configure your instance for optimum performance.

Yes. All I did was call ld_dir for the single file and then rdf_loader_run(). Nothing else.

The uncompressed file is 327GB. The file is turtle syntax. Splitting turtle files into pieces is not trivial.

I did take a look at the tuning guide. I had the sizes adjusted for the amount of memory on the machine, and when that caused a crash I tried with smaller sizes, but that too ran out of memory.

Does the file contain lot of blank nodes?

I don’t know how many blank nodes. I’m assuming some, but not billions.

A single 327GB file is definitely too large and needs to be split, as most large RDF dataset providers tend to do, or some provide a script for splitting them. Such a large file will consume significant memory, just loading it for processing. Is the dataset being loaded publicly available for download or is it an internal dataset not for public use ?

A quick google search provides a link to a git project for splitting Large RDF dataset files. We also once created a procedure for splitting the Uniprot RDF/XML dataset files, but don’t have one for TTL files.

Thus as said please split the dataset file for loading, then you can load multiple datasets concurrently with multiple rdf_loader_run() functions running depending on the number of available cores, as indicated in the Bulk Loader documentation, for optimum platform utilisation and hence best performance loading the dataset.

Please provide the output of running the Virtuoso status(); command so we can see the basic configuration of the server. Also how many cores and memory are available on the server in use, which I assume is Linux-based?

I am loading Wikidata - the big version, not the truthy one. (I had to make two patches to allow non-terrestrial geo-literals to be loaded.) Because the file appears to not have labelled blank nodes and has blank lines between objects, I have split it into ten pieces and am now loading them in one rdf_loader_run() call.

I’m doing this on a four-core (and four-thread) machine with 32GB of main memory, running Fedora 29. Having concurrent loads doesn’t make sense on this machine, I think. One load is sufficient to use most of the available computing power.

However virtuoso still increases in memory size and doesn’t shrink down when a file is completely loaded. It is on track to run out of memory sometime tonight.

Here is the output of status();. I cut down on the number of buffers in the hope that this would allow completion of the load, but that didn’t work.

OpenLink Virtuoso  Server
Version 07.20.3230-pthreads for Linux as of Nov 16 2018 
Started on: 2018-11-19 13:12 GMT-5
 
Database Status:
  File size 115343360, 15742720 pages, 3695763 free.
  1360000 buffers, 1350569 used, 982956 dirty 20530 wired down, repl age 20277301 277 w. io 0 w/crsr.
  Disk Usage: 164040987 reads avg 0 msec, 0% r 0% w last  0 s, 170668406 writes flush      259.1 MB,
    4070219 read ahead, batch = 30.  Autocompact 15806277 in 12270280 out, 22% saved col ac: 94862572 in 9% saved.
Gate:  5272543 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap. 
Log = /home/virtuoso/var/lib/virtuoso/db/virtuoso.trx, 484860187 bytes
8280467 pages have been changed since last backup (in checkpoint state)
Current backup timestamp: 0x0000-0x00-0x00
Last backup date: unknown
Clients: 3 connects, max 3 concurrent
RPC: 35 calls, 2 pending, 2 max until now, 0 queued, 2 burst reads (5%), 0 second 394M large, 2293M max
Checkpoint Remap 0 pages, 0 mapped back. 363 s atomic time.
    DB master 15742720 total 3697208 free 0 remap 0 mapped back
   temp  256 total 251 free
 
Lock Status: 0 deadlocks of which 0 2r1w, 0 waits,
   Currently 4 threads running 0 threads waiting 0 threads in vdb.
Pending:
 
Client 1111:2:  Account: dba, 1294 bytes in, 11197 bytes out, 1 stmts.
PID: 16852, OS: unix, Application: unknown, IP#: 127.0.0.1
Transaction status: PENDING, 0 threads.
Locks: 
 
Client 1111:3:  Account: dba, 206 bytes in, 289 bytes out, 1 stmts.
PID: 24434, OS: unix, Application: unknown, IP#: 127.0.0.1
Transaction status: PENDING, 1 threads.
Locks: 
 
Client 1111:1:  Account: dba, 969 bytes in, 3179 bytes out, 1 stmts.
PID: 9111, OS: unix, Application: unknown, IP#: 127.0.0.1
Transaction status: PENDING, 1 threads.
Locks: 
 
 
Running Statements:
 Time (msec) Text
    22921077 rdf_loader_run()
        1126 status()
 
 
Hash indexes

@pfps: Actually the problem is evident from your first post and confirmed by the status() output you provided, which shows 1360000 buffers, 1350569 used, i.e., all the allocated memory buffers are being used during the load at which point data will have to be swapped to and from disk, which will kill the load times which would take an very long time.

Generally we recommend typically 10GB RAM is required per billion triples (depending on how well they can be compressed in the database) with Virtuoso to be able to host them in memory for optimum performance. Thus, given the indicated dataset size of 7 billion triples, at least 70GB of RAM would be required, while you only have 32GB. Use of SSD drives would mitigate the need for RAM as SSD disk access times are much faster the normal disk drives, but performance will still deteriorate.

What is the URL to the Wikidata dataset you are loading?

We have a 900+ million triple Wikidata dataset loaded into the DBpedia instance we host for the community, the dataset of which consisted of about 140 ntriple bz2 compressed split files.

I am quite aware that this is not an ideal machine for loading Wikidata. However, the machine that I normally use for the purpose is not currently available so I was trying to load Wikidata up on the machine that I had. The machine is not quite so unsuitable as one might expect as it does have a 1TB NVMe SSD that I am using to hold the Virtuoso data and log files.

But all that is not really relevant. The problem that I am asking about is:

So my question was why is Virtuoso using so much extra memory. The total memory footprint of Virtuoso increases by about 0.75GB per hour even when Virtuoso is already using all its buffer space. Is this a memory leak in Virtuoso? If not, what is the memory being used for?

The URL for what I loaded is https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 but that is a time-varying file. I think that that content is now in https://dumps.wikimedia.org/wikidatawiki/entities/20181112/wikidata-20181112-all-BETA.ttl.bz2

During dataset load additional memory is require for the creation of the 5 (2 full & 3 partial) indexes on the RDF_QUAD table to improve performance for typical RDF query work loads etc.

Sure, indexes have to be enlarged when more data is in the KB. But shouldn’t these indexes be in buffers and thus the memory usage of Virtuoso should not be expanding without apparent bound?

@pfps: The NumberOfBuffers allocated in the INI file controls the amount of RAM used by Virtuoso to cache database files for hosting of the database working set in memory, and is separate from memory used during data load or query execution.

We have a Virtuoso Memory Analysis based on INI Entries online spreadsheet that details the memory requirements for publically accessible Virtuoso instance hosted by OpenLink and others.

What I am seeing doesn’t seem to correspond to what is in the table.

What I am seeing is that Virtuoso’s virtual (and effective!) memory usage grows without bound as more and more triples are added. So instead of Virtuoso’s memory usage being the buffer size plus some small amount when loading triples it grows to 1.5 times or 2 times the size of all buffers or even more than that when the buffers are sized for 16 or 32 GB of main memory. Even on a machine with 128GB of main memory and 10900000 buffers I’m seeing memory growth of over 15GB when loading 7 billion triples.

I don’t know what that memory growth is supporting, if anything.