Loading Wikidata into Virtuoso (Open Source or Enterprise Edition)

hwilliams · August 9, 2021, 5:06pm

Introduction

Here’s a simple guide that covers how we constructed a Wikidata instance deployed using Virtuoso. Naturally, once loaded you end with the following:

Hardware

Item	Value
CPU	`2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz`
Cores	`24`
Memory	`378 GB`
SSD	`4x Crucial M4 SSD 500 GB`

Virtuoso INI File Settings

Item	Value
`NumberofBuffers`	`30,000,000`

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffer setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

Preparing the datasets

We downloaded the Wikidata dumps from about 2020-03-01.

Filename	Size (bytes)
`latest-all.nt.bz2`	`103,207,309,995`
`latest-lexemes.nt.bz2`	`229,470,813`

Although the file latest-all-nt-bz2 , which is 96 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel and thereby use multiple CPU cores to load the data into the database.

We used a small Perl script to split the file into 3765 parts, each about 37 MB in size. Since decompressing a bz2 file is not a task that can be performed multi-threaded, this unfortunately took several days.

As there are some known issues with Geodata in this dataset, we then split out the Geodata triples to a different set of dumps to be processed separately, which only took about 2 hours using 10 parallel processes to speed up the work.

Bulk load Statistics

Description	Value
Number of parallel loaders	`8`
Duration	`10 hours`
Total number of triples	`11,894,354,985`
Average triples/second	`331,574`
Average CPU/second	`1,881%`
Number of Database pages in use	`48,503,816 pages`

Description	Value
Setting up prefixes and labels	2 min
FCT calculating labels	10 hours
Freetext index	`1 day as a background task`

Graphs and triple counts

The main graphs are:

Graph	Description	Triple Count
`http://www.wikidata.org/`	the triples from `latest-all.nt.bz2`	`11,857,528,152`
`http://www.wikidata.org/lexemes/`	the triples from `latest-lexemes.nt.bz2`	`42,641,432`
`urn:wikidata:labels`	the Wikidata property labels	`836,131`
`http://wikiba.se/ontology-1.0.owl`	the Wikidata ontology	`280`

We added the following link sets to the database:

Graph	Description	Triple Count
`http://yago.r2.enst.fr/data/yago4/en/2019-12-10/`	the Yago knowledge graph	`199,575,815`
`urn:wikidata:dbpedia:schema`	the DBpedia link set (schema form)	`27,897,734`
`urn:wikidata:dbpedia:about`	the DBpedia link set	`6,974,651`
`urn:kbpedia:concepts_linkage:inferrence_extended`	the KBpedia link set	`1,152,051`

Wikidata VoID graph

Description	Value
graph name	`http://www.wikidata.org/void/`
SPARQL endpoint	`http://wikidata.demo.openlinksw.com/sparql/`
number of triples	`11,857,528,152`
number of classes	`925`
number of entities	`1,311,652,842`
number of distinct subjects	`1,314,371,674`
number of properties	`32,647`
number of distinct objects	`1,987,044,958`
number of `owl:sameAs` links	`2,717,989`

So I have been using a small but portable Perl script that uses Perl open pipe to create multiple processes that run on separate CPU cores. It is not as fast as the split --format='' command, as it does not use extra background threads to speed up processing, but it works on all systems that have perl installed.

The reason it took so long was that I was using bzip2 -2 to re-compress the parts to save some space on the filesystem, and also for transferring the parts between systems.

Using gzip:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='gzip > $FILE.nt.gz' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     33884731 2021-08-19 16:36:53.352384590 +0200 wikidata.part.aa.nt.gz
-rw-rw-r-- 1 pkleef pkleef     35233218 2021-08-19 16:37:05.860493306 +0200 wikidata.part.ab.nt.gz
...

As you can see, it takes about 12 seconds per file.

Using bzip2:

$ bzip2 -cd latest-all.nt.bz2 | split - -l 3000000 --filter='bzip2 -2  > $FILE.nt.bz2' wikidata.part.

$ ls --full
-rw-rw-r-- 1 pkleef pkleef     29658392 2021-08-19 16:18:16.258764512 +0200 wikidata.part.aa.nt.bz2
-rw-rw-r-- 1 pkleef pkleef     31235116 2021-08-19 16:19:26.615360342 +0200 wikidata.part.ab.nt.bz2

As you can see, bzip -2 takes about 70 seconds per part.

As stated in the document above, the total number of triples was 11,857,528,152.

Using a line count of 3,000,000, this means the split would create 3953 separate compressed files.

With gzip this would take 3953 * 12 seconds = 47436 seconds ≈ 13 hours processing time.

With bzip2 this would take 3953 * 70 seconds = 276710 seconds ≈ 77 hours ≈ 3.2 days processing time.

While bzip2 takes a lot more computing power, it would save about 4MB per file, which means a total of 4MB * 3953 ≈ 15GB disk space.

Additionally, it would take slightly less long to load, as higher compression means more triples per read in the bulk loader. However, the savings in bulkload time would be much smaller than the 64 hour difference in split times .

The conclusion is that next time we create a new wikidata instance, we will be using gzip instead of bzip2.

Correct. When using the isql tool, you can do:

isql 1111
SQL> rdf_loader_run() &
SQL> rdf_loader_run() &
.....
SQL> rdf_loader_run() &

SQL> wait

But you can also do:

$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
$ isql 1111 EXEC='rdf_loader_run()' &
...
$ isql 1111 EXEC='rdf_loader_run()' &

$ wait

We normally recommend using about half the number of CPU cores you have in your system for best possible parallelism during the bulkload.

Jiang_Jinhao · January 3, 2023, 4:06pm

I’m sorry to ask how I should load the Wikidata dump to our Virtuoso. I’m still confused about loading Wikidata because there are no commands to follow in this post.

hwilliams · January 5, 2023, 12:04pm

You would need to download the latest RDF Wikidata dataset file, which being a single 100+ GB file and needs to be split into multiple dataset files for parallel loading as indicated in post prior to yours. Then the Virtuoso RDF Bulk Loader can be used for the optimal parallel loading of the dataset files into Virtuoso.

Note we also have a preloaded Wikidata Snapshot (Virtuoso PAGO) EBS-backed EC2 AMI available for use, which currently has an older dataset snapshot, and we are in the process of updating to the latest Wikidata dataset which should be available in the coming weeks …

hare · February 13, 2023, 9:02pm

Thank you for documenting this. I downloaded a Wikidata dump and am in the process of splitting it up into smaller files using split with filter as people in the thread suggested. (My Linux distribution tends to have very recent packages, so that’s not an issue.)

One thing I am curious about is the utility in splitting up the dump into smaller files. While it makes sense that having smaller files would allow you to import the dataset in parallel, how much time are you saving really if you are bottlenecked on single-threaded file decompression?

In other words, is the combined time spent decompressing, splitting, and then importing, less time than simply sending the file whole into the database, seeing as you are bottlenecked anyway? I am assuming the answer is “yes,” but if so, then how.

hwilliams · February 17, 2023, 12:29pm

Correct, especially if you have a multi-core machine for hosting Virtuoso where multiple rdf_loader_run() RDF Bulk Loader process can be run, such that multiples of the split dataset files can be loaded in parallel for maximum platform utilisation and hence load rate.

hare · February 26, 2023, 3:57pm

Do you have recommendations keeping the dataset up to date? There’s a Wikimedia-developed updater script that takes diffs of Wikidata and submits update queries to a SPARQL endpoint, but it’s probably optimized for Blazegraph.

hwilliams · February 27, 2023, 12:01pm

We are considering implementation of live updates to our Virtuoso hosted Wikidata instance, which is to be scheduled at some point. What is the Wikimedia-developed updater script you refer to ?