Loading the Wikidata dataset 2022/12 into Virtuoso Open Source

PvK · January 6, 2023, 7:20pm

Loading the Wikidata dataset 2022/12 into Virtuoso Open Source

Introduction

Here’s a simple guide that covers how we constructed a Wikidata instance deployed using the latest Virtuoso Open Source Edition:

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.8.3235-pthreads as of Oct 19 2022 (64e6ecd39)
Compiled for Linux (x86_64-generic_glibc25-linux-gnu)
Copyright (C) 1998-2022 OpenLink Software

Naturally, once loaded you end with the following:

Hardware

We use the following configuration for a single host machine.

Item	Value
CPU	`2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz`
Cores	`24`
Memory	`378 GB`
SSD	`4x Crucial M4 SSD 500 GB`

This system is shared by several other demo databases, so we could not use all available resources on it.

Cloud cost estimates

Although it is hard to match various cloud offerings, we used various vendor-supplied cost calculators to come up with some estimates, based on the following minimal machine specifications:

dedicated machine for 1 year without upfront costs
128 GiB memory
16 cores or more
512GB SSD for the database
3T outgoing internet traffic (based on our DBpedia statistics)

vendor	machine type	memory	vCPUs	monthly machine	monthly disk	monthly network	monthly total
Amazon	r5a.4xlarge	128 GiB	16	$479.61	$55.96	$276.48	$812.05
Google	e2highmem-16	128 GiB	16	$594.55	$95.74	$255.00	$945.30
Azure	D32a	128 GiB	32	$769.16	$38.40	$252.30	$1,060.06

Note that all prices in this table are per month.

Virtuoso INI File Settings

We updated the following settings in the virtuoso.ini before loading the data.

In the [Parameters] section:

Item	Value
`NumberofBuffers`	`30,000,000`

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffers setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

In the [SPARQL] section:

Item	Value
`LabelInferenceName`	`facets`

Recently we added support for automatic label inferencing during the bulk-load, which uses additional threads to process the labels. This replaces the manual generation of FCT labels using the urilbl_ac_init_db() function which took around 10 hours during the previous load.

Preparing the datasets

We downloaded the Wikidata dumps from around 2022-12-28.

Filename	Size (bytes)
`latest-all.nt.gz`	`200,051,013,962`
`latest-lexemes.nt.gz`	`900,282,630`

Although the file latest-all.nt.gz, which is 186 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel, and thereby use multiple CPU cores to load the data into the database.

Initially, we tried to used the munch.sh script from the wikidata-query-tools tree to convert the data from NTriples to Turtle, and write those in chunks of 50,000 entities into separate files that can be loaded. However, this Java-based application would have taken several days to finish, as it is a single threaded program.

Eventually we used a small Perl script that used 3 separate processes to split the file into 5639 parts, each about 42 MB in size. This took a total of 11 hours and 35 minutes.

As there are some known issues with Geodata in this dataset, we removed this bad pattern during the splitting process.

Bulk load Statistics

Description	Value
Number of parallel loaders	`6`
Duration	`13 hours 30 minutes`
Total number of triples	`17,888,903,520`
Average triples/second	`367,523`
Average CPU percentage	`1.792%`

Additional tasks

Description	Value
Setting up prefixes and labels	`2 min`
Loading ontologies	`2 min`
Loading KBpedia	`1 min`
FCT calculating labels	calculated during the bulkload
Freetext index	`24 hours`

Graphs and triple counts

Main graphs

Graph	Description	Triple Count
`http://www.wikidata.org/`	the Wikidata dataset	`17,742.853,944`
`http://www.wikidata.org/lexemes/`	the Wikidata Lexemes dataset	`141,960,664`
`urn:wikidata:labels`	calculated Wikidata labels	`1,278,609`

Ontology graphs

Graph	Description	Triple Count
`http://wikiba.se/ontology-1.0.owl`	the Wikidata base ontology	`288`
`https://schema.org/version/latest/schemaorg-current-http.nt`	the `http://schema.org/` ontology	`16,248`

Knowledge Base graphs

Graph	Description	Triple Count
`urn:kbpedia:2.50:reference:concepts:linkage:inferrence:extended`	the KBpedia extended graph	`1,175,147`
`urn:kbpedia:2.50:reference:concepts:linkage`	the KBpedia linkage graph	`850,711`
`urn:kbpedia:2.50:reference:concepts`	the KBpedia concepts graph	`761,092`

Link sets

At this point, we did not add any other link sets to the database.

Wikidata VoID graph

Description	Value
graph name	`http://www.wikidata.org/void/`
SPARQL endpoint	`https://wikidata.demo.openlinksw.com/sparql/`
number of triples	`17,742,853,944`
number of classes	`1339`
number of entities	`1,884,546,246`
number of distinct subjects	`1,888,457,596`
number of properties	`48,841`
number of distinct objects	`3,255,942,590`
number of `owl:sameAs` links	`3,910,462`

Loading the Wikidata dataset 2022/12 into Virtuoso Open Source

Loading the Wikidata dataset 2022/12 into Virtuoso Open Source

Introduction

Hardware

Cloud cost estimates

Virtuoso INI File Settings

Preparing the datasets

Bulk load Statistics

Additional tasks

Graphs and triple counts

Main graphs

Ontology graphs

Knowledge Base graphs

Link sets

Wikidata VoID graph

Related