Loading the Wikidata 2024/04 dataset into Virtuoso Open Source

Loading the Wikidata 2024/04 dataset into Virtuoso Open Source

Copyright © 2024 OpenLink Software

Introduction

Here’s a simple guide covering deployment of a Wikidata instance using the latest Virtuoso Open Source Edition:

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.13-dev.3239-pthreads as of May  5 2024 (77554e474d)
Compiled for Linux (x86_64-centos_6-linux-gnu)
Copyright (C) 1998-2024 OpenLink Software

Once loaded, the following URLs will be functional:

Hardware

We use the following configuration for a single host machine.

Item Value
CPU 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Cores 24
Memory 378 GB
SSD 4x Crucial M4 SSD 500 GB

This system is shared by several other demo databases, so we could not use all available resources on it.

Cloud cost estimates

Although it is hard to match various cloud offerings, we used various vendor-supplied cost calculators to estimate monthly costs, based on the following minimal machine specifications:

  • dedicated machine for 1 year without upfront costs
  • 128 GiB memory
  • 16 cores or more
  • 1TB SSD for the database
  • 3T outgoing internet traffic (based on our DBpedia statistics)
vendor machine type memory vCPUs monthly machine monthly disk monthly network monthly total
Amazon r6g.4xlarge 128 GiB 16 $406.17 $81.92 $276.48 $764.57
Google e2highmem-16 128 GiB 16 $506.68 $220.49 $727.17
Azure D32a 128 GiB 32 $659.34 $76.80 $252.30 $988.44

Virtuoso INI File Settings

We updated the following settings in the virtuoso.ini before loading the data.

In the [Parameters] section:

Item Value
NumberofBuffers 30,000,000

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffers setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

In the [SPARQL] section:

Item Value
LabelInferenceName facets

Recently we added support for automatic label inferencing during the bulk-load, which uses additional threads to process the labels. This replaces the manual generation of FCT labels using the urilbl_ac_init_db() function which took around 10 hours during the previous load.

Preparing the datasets

We downloaded the Wikidata dumps from around 2024-04-26.

Filename Size (bytes)
latest-all.nt.gz 222,980,592,521
latest-lexemes.nt.gz 1,098,482,182

Although the file latest-all.nt.gz, which is 223 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel, and thereby use multiple CPU cores to load the data into the database.

We used a small Perl script that used 3 separate processes to split the files into 6254 parts, each about 42 MB in size. This took a total of 10 hours and minutes.

The current version of Virtuoso Open Source supports the special geodata encodings used by Wikidata, so we no longer have to filter them out separately and can bulkload all the data at once.

Bulk load Statistics

Description Value
Number of parallel loaders 6
Duration 15 hours 50 minutes
Total number of triples 19,783,556,955
Average triples/second 346,769
Average CPU percentage 1,599%

Additional tasks

Description Value
Setting up prefixes and labels 2 min
Loading ontologies 2 min
FCT calculating labels calculated during the bulkload
Freetext index 24 hours

Graphs and triple counts

Main graphs

Graph Description Triple Count
http://www.wikidata.org/ the Wikidata dataset 19,524,435,178
http://www.wikidata.org/lexemes/ the Wikidata Lexemes dataset 169,677,210
urn:wikidata:labels calculated Wikidata labels 1,454,421

Ontology graphs

Graph Description Triple Count
http://wikiba.se/ontology-1.0.owl the Wikidata base ontology 251
https://schema.org/version/latest/schemaorg-current-http.nt the http://schema.org/ ontology 16,593

Knowledge Base graphs

At this time we have not loaded any third party knowledge graphs such as kbpedia.

Link sets

Graph Description Triple Count
urn:wikidata:dbpedia:about DBpedia owl:SameAs 33,860,047
urn:wikidata:dbpedia:schema DBpedia schema:Article 54106768

Wikidata VoID graph

Description Value
graph name http://www.wikidata.org/void/
SPARQL endpoint https://wikidata.demo.openlinksw.com/sparql/
number of triples 19,524,435,179
number of classes 1,601
number of entities 2,049,500,774
number of distinct subjects 2,053,812,504
number of properties 54,902
number of distinct objects 3,640,746,976
number of owl:sameAs links 4,310,820

Related Links