Loading Wikidata into Virtuoso (Open Source or Enterprise Edition)

Introduction

Here’s a simple guide that covers how we constructed a Wikidata instance deployed using Virtuoso. Naturally, once loaded you end with the following:

  1. SPARQL Query Services Endpoint
  2. Faceted Search & Browsing Service Endpoint

Hardware

Item Value
CPU 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Cores 24
Memory 378GB
SSD 4x Crucial M4 SSD 500 GB

Virtuoso INI File Settings

Item Value
NumberofBuffers 30,000,000

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffer setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

Preparing the datasets

We downloaded the Wikidata dumps from about 2020-03-01.

Filename Size (bytes)
latest-all.nt.bz2 103,207,309,995
latest-lexemes.nt.bz2 229,470,813

Although the file latest-all-nt-bz2 , which is 96 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allows us to run multiple loaders in parallel and therefore multiple CPU cores to load the data into the database.

We used a small Perl script to split the file into 3765 parts each about 37MB in size. Since decompressing a bz2 file is not a task that can be performed multi threaded, this unfortunately took several days.

As there are some known issues with geosparql data in this dataset, we then split out the geosparql triples in a different set of dumps to be processed separately, which only took about 2 hours using 10 parallel processes to speed up the work.

Bulkload Statistics

Description Value
Number of parallel loaders 8
Duration 10 hours
Total number of triples 11,894,354,985
Average triples/second 331,574
Average CPU/second 1,881%
Number of Database pages in use 48,503,816 pages
Description Value
Setting up prefixes and lables 2 min
FCT calculating labels 10 hours
Freetext index 1 day as a background task

Graphs and triple count

The main graphs are:

Graph Description Triple Count
http://www.wikidata.org/ the triples from latest-all.nt.bz2 11,857,528,152
http://www.wikidata.org/lexemes/ the triples from latest-lexemes.nt.bz2 42,641,432
urn:wikidata:labels the wikidata property labels 836,131
http://wikiba.se/ontology-1.0.owl the wikidata ontology 280

We added the following linksets to the database:

Graph Description Triple Count
http://yago.r2.enst.fr/data/yago4/en/2019-12-10/ the Yago knowledge graph 199,575,815
urn:wikidata:dbpedia:schema the dbpedia linkset (schema form) 27,897,734
urn:wikidata:dbpedia:about the dbpedia linkset 6,974,651
urn:kbpedia:concepts_linkage:inferrence_extended the kbpedia linkset 1,152,051

Wikidata VOID graph

Description Value
graph name http://www.wikidata.org/void/
sparql endpoint http://wikidata.demo.openlinksw.com/sparql/
number of triples 11,857,528,152
number of classes 925
number of entities 1,311,652,842
number of distinct subjects 1,314,371,674
number of properties 32,647
number of distinct objects 1,987,044,958
number of owl:sameAs links 2,717,989
1 Like

Thanks for this post!

We used a small Perl script to split the file into 3765 parts each about 37MB in size. Since decompressing a bz2 file is not a task that can be performed multi threaded, this unfortunately took several days.

have you tried this command? I think could split the whole file quite faster.

zcat latest-all.nt.gz | split - -l 100000 --filter='gzip > $FILE.gz' wikidata.part.

Number of parallel loaders 8

Does it mean you run rdf_loader_run()& eigth times in the same Virtuoso console?

Best,

The problem with the split command above is that the --filter="" argument is a relatively new enhancement to the GNU coreutils which is not available on all systems, not even on modern macOS versions.

So i have been using a small but portable Perl script that uses Perl open pipe to create multiple processes that run on separate CPU cores. It is not as fast as the split --format='' command as it does not use extra background threads to speed up processing, but it works on all systems that have perl installed.

The reason why it took so long was that i was using bzip2 -2 to re-compress the parts to save some space on the filesystem and also for transferring the parts between systems.

Using gzip:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='gzip > $FILE.nt.gz' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     33884731 2021-08-19 16:36:53.352384590 +0200 wikidata.part.aa.nt.gz
-rw-rw-r-- 1 pkleef pkleef     35233218 2021-08-19 16:37:05.860493306 +0200 wikidata.part.ab.nt.gz
...

As you can see it takes about 12 seconds per file.

Using bzip2:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='bzip2 -2  > $FILE.nt.bz2' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     29658392 2021-08-19 16:18:16.258764512 +0200 wikidata.part.aa.nt.bz2
-rw-rw-r-- 1 pkleef pkleef     31235116 2021-08-19 16:19:26.615360342 +0200 wikidata.part.ab.nt.bz2

As you can see bzip -2 takes about 1min 10 seconds per part.

As stated in the document above the total number of triples was 11,857,528,152.

Using a line count of 3,000,000 this means the split would create 3953 separate compressed files.

With gzip this would take 3953 * 12 seconds = 47436 seconds = 13 hours processing time.

With bzip2 this would take 3953 * 70 seconds = 276710 seconds = 77 hours = 3 days processing time.

While bzip2 takes a lot more computing power it would save about 4MB per file which means a total of 4 * 3953 = 15GB disk space.

Additionally it would also take slightly less long to load as higher compression means more triples per read in the bulk loader. However the savings in bulkload time would be much smaller than the 64 hour difference in split times.

The conclusion is that next time we create a new wikidata instance, we will be using gzip instead of bzip2.

Correct. When using the isql tool you can do:

isql 1111
SQL> rdf_loader_run() &
SQL> rdf_loader_run() &
.....
SQL> rdf_loader_run() &

SQL> wait

But you can also do:

$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
...
$ isql 1111 EXEC='rdf_loader_run()' &

$ wait

We normally recommend using about half the number of CPU cores you have in your system for best possible parallellism during the bulkload.