Loading Wikidata into Virtuoso (Open Source or Enterprise Edition)

Introduction

Here’s a simple guide that covers how we constructed a Wikidata instance deployed using Virtuoso. Naturally, once loaded you end with the following:

Hardware

Item Value
CPU 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Cores 24
Memory 378 GB
SSD 4x Crucial M4 SSD 500 GB

Virtuoso INI File Settings

Item Value
NumberofBuffers 30,000,000

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffer setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

Preparing the datasets

We downloaded the Wikidata dumps from about 2020-03-01.

Filename Size (bytes)
latest-all.nt.bz2 103,207,309,995
latest-lexemes.nt.bz2 229,470,813

Although the file latest-all-nt-bz2 , which is 96 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel and thereby use multiple CPU cores to load the data into the database.

We used a small Perl script to split the file into 3765 parts, each about 37 MB in size. Since decompressing a bz2 file is not a task that can be performed multi-threaded, this unfortunately took several days.

As there are some known issues with Geodata in this dataset, we then split out the Geodata triples to a different set of dumps to be processed separately, which only took about 2 hours using 10 parallel processes to speed up the work.

Bulk load Statistics

Description Value
Number of parallel loaders 8
Duration 10 hours
Total number of triples 11,894,354,985
Average triples/second 331,574
Average CPU/second 1,881%
Number of Database pages in use 48,503,816 pages
Description Value
Setting up prefixes and labels 2 min
FCT calculating labels 10 hours
Freetext index 1 day as a background task

Graphs and triple counts

The main graphs are:

Graph Description Triple Count
http://www.wikidata.org/ the triples from
latest-all.nt.bz2
11,857,528,152
http://www.wikidata.org/lexemes/ the triples from
latest-lexemes.nt.bz2
42,641,432
urn:wikidata:labels the Wikidata property labels 836,131
http://wikiba.se/ontology-1.0.owl the Wikidata ontology 280

We added the following link sets to the database:

Graph Description Triple Count
http://yago.r2.enst.fr/data/yago4/en/2019-12-10/ the Yago knowledge graph 199,575,815
urn:wikidata:dbpedia:schema the DBpedia link set (schema form) 27,897,734
urn:wikidata:dbpedia:about the DBpedia link set 6,974,651
urn:kbpedia:concepts_linkage:inferrence_extended the KBpedia link set 1,152,051

Wikidata VoID graph

Description Value
graph name http://www.wikidata.org/void/
SPARQL endpoint http://wikidata.demo.openlinksw.com/sparql/
number of triples 11,857,528,152
number of classes 925
number of entities 1,311,652,842
number of distinct subjects 1,314,371,674
number of properties 32,647
number of distinct objects 1,987,044,958
number of owl:sameAs links 2,717,989

Related

1 Like

Thanks for this post!

We used a small Perl script to split the file into 3765 parts each about 37MB in size. Since decompressing a bz2 file is not a task that can be performed multi threaded, this unfortunately took several days.

have you tried this command? I think could split the whole file quite faster.

zcat latest-all.nt.gz | split - -l 100000 --filter='gzip > $FILE.gz' wikidata.part.

Number of parallel loaders 8

Does it mean you run rdf_loader_run()& eigth times in the same Virtuoso console?

Best,

The problem with the split command above is that the --filter="" argument is a relatively new enhancement to the GNU coreutils, which is not available on all systems, not even on modern macOS versions.

So I have been using a small but portable Perl script that uses Perl open pipe to create multiple processes that run on separate CPU cores. It is not as fast as the split --format='' command, as it does not use extra background threads to speed up processing, but it works on all systems that have perl installed.

The reason it took so long was that I was using bzip2 -2 to re-compress the parts to save some space on the filesystem, and also for transferring the parts between systems.

Using gzip:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='gzip > $FILE.nt.gz' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     33884731 2021-08-19 16:36:53.352384590 +0200 wikidata.part.aa.nt.gz
-rw-rw-r-- 1 pkleef pkleef     35233218 2021-08-19 16:37:05.860493306 +0200 wikidata.part.ab.nt.gz
...

As you can see, it takes about 12 seconds per file.

Using bzip2:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='bzip2 -2  > $FILE.nt.bz2' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     29658392 2021-08-19 16:18:16.258764512 +0200 wikidata.part.aa.nt.bz2
-rw-rw-r-- 1 pkleef pkleef     31235116 2021-08-19 16:19:26.615360342 +0200 wikidata.part.ab.nt.bz2

As you can see, bzip -2 takes about 70 seconds per part.

As stated in the document above, the total number of triples was 11,857,528,152.

Using a line count of 3,000,000, this means the split would create 3953 separate compressed files.

With gzip this would take 3953 * 12 seconds = 47436 seconds ≈ 13 hours processing time.

With bzip2 this would take 3953 * 70 seconds = 276710 seconds ≈ 77 hours ≈ 3.2 days processing time.

While bzip2 takes a lot more computing power, it would save about 4MB per file, which means a total of 4MB * 395315GB disk space.

Additionally, it would take slightly less long to load, as higher compression means more triples per read in the bulk loader. However, the savings in bulkload time would be much smaller than the 64 hour difference in split times .

The conclusion is that next time we create a new wikidata instance, we will be using gzip instead of bzip2.

Correct. When using the isql tool, you can do:

isql 1111
SQL> rdf_loader_run() &
SQL> rdf_loader_run() &
.....
SQL> rdf_loader_run() &

SQL> wait

But you can also do:

$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
...
$ isql 1111 EXEC='rdf_loader_run()' &

$ wait

We normally recommend using about half the number of CPU cores you have in your system for best possible parallelism during the bulkload.

I’m sorry to ask how I should load the Wikidata dump to our Virtuoso. I’m still confused about loading Wikidata because there are no commands to follow in this post.

1 Like

You would need to download the latest RDF Wikidata dataset file, which being a single 100+ GB file and needs to be split into multiple dataset files for parallel loading as indicated in post prior to yours. Then the Virtuoso RDF Bulk Loader can be used for the optimal parallel loading of the dataset files into Virtuoso.

Note we also have a preloaded Wikidata Snapshot (Virtuoso PAGO) EBS-backed EC2 AMI available for use, which currently has an older dataset snapshot, and we are in the process of updating to the latest Wikidata dataset which should be available in the coming weeks …

Thank you for documenting this. I downloaded a Wikidata dump and am in the process of splitting it up into smaller files using split with filter as people in the thread suggested. (My Linux distribution tends to have very recent packages, so that’s not an issue.)

One thing I am curious about is the utility in splitting up the dump into smaller files. While it makes sense that having smaller files would allow you to import the dataset in parallel, how much time are you saving really if you are bottlenecked on single-threaded file decompression?

In other words, is the combined time spent decompressing, splitting, and then importing, less time than simply sending the file whole into the database, seeing as you are bottlenecked anyway? I am assuming the answer is “yes,” but if so, then how.

1 Like

Correct, especially if you have a multi-core machine for hosting Virtuoso where multiple rdf_loader_run() RDF Bulk Loader process can be run, such that multiples of the split dataset files can be loaded in parallel for maximum platform utilisation and hence load rate.

1 Like

Do you have recommendations keeping the dataset up to date? There’s a Wikimedia-developed updater script that takes diffs of Wikidata and submits update queries to a SPARQL endpoint, but it’s probably optimized for Blazegraph.

We are considering implementation of live updates to our Virtuoso hosted Wikidata instance, which is to be scheduled at some point. What is the Wikimedia-developed updater script you refer to ?