Loading the Wikidata 2024/04 dataset into Virtuoso Open Source
Copyright © 2024 OpenLink Software
Introduction
Here’s a simple guide covering deployment of a Wikidata instance using the latest Virtuoso Open Source Edition:
Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.13-dev.3239-pthreads as of May 5 2024 (77554e474d)
Compiled for Linux (x86_64-centos_6-linux-gnu)
Copyright (C) 1998-2024 OpenLink Software
Once loaded, the following URLs will be functional:
Hardware
We use the following configuration for a single host machine.
Item | Value |
---|---|
CPU | 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz |
Cores | 24 |
Memory | 378 GB |
SSD | 4x Crucial M4 SSD 500 GB |
This system is shared by several other demo databases, so we could not use all available resources on it.
Cloud cost estimates
Although it is hard to match various cloud offerings, we used various vendor-supplied cost calculators to estimate monthly costs, based on the following minimal machine specifications:
- dedicated machine for 1 year without upfront costs
- 128 GiB memory
- 16 cores or more
- 1TB SSD for the database
- 3T outgoing internet traffic (based on our DBpedia statistics)
vendor | machine type | memory | vCPUs | monthly machine | monthly disk | monthly network | monthly total |
---|---|---|---|---|---|---|---|
Amazon | r6g.4xlarge | 128 GiB | 16 | $406.17 | $81.92 | $276.48 | $764.57 |
e2highmem-16 | 128 GiB | 16 | $506.68 | – | $220.49 | $727.17 | |
Azure | D32a | 128 GiB | 32 | $659.34 | $76.80 | $252.30 | $988.44 |
Virtuoso INI File Settings
We updated the following settings in the virtuoso.ini
before loading the data.
In the [Parameters]
section:
Item | Value |
---|---|
NumberofBuffers |
30,000,000 |
After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffers
setting to 20,000,000
pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.
In the [SPARQL]
section:
Item | Value |
---|---|
LabelInferenceName |
facets |
Recently we added support for automatic label inferencing during the bulk-load, which uses additional threads to process the labels. This replaces the manual generation of FCT labels using the urilbl_ac_init_db()
function which took around 10 hours during the previous load.
Preparing the datasets
We downloaded the Wikidata dumps from around 2024-04-26.
Filename | Size (bytes) |
---|---|
latest-all.nt.gz |
222,980,592,521 |
latest-lexemes.nt.gz |
1,098,482,182 |
Although the file latest-all.nt.gz
, which is 223 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel, and thereby use multiple CPU cores to load the data into the database.
We used a small Perl script that used 3 separate processes to split the files into 6254 parts, each about 42 MB in size. This took a total of 10 hours and minutes.
The current version of Virtuoso Open Source supports the special geodata encodings used by Wikidata, so we no longer have to filter them out separately and can bulkload all the data at once.
Bulk load Statistics
Description | Value |
---|---|
Number of parallel loaders | 6 |
Duration | 15 hours 50 minutes |
Total number of triples | 19,783,556,955 |
Average triples/second | 346,769 |
Average CPU percentage | 1,599% |
Additional tasks
Description | Value |
---|---|
Setting up prefixes and labels | 2 min |
Loading ontologies | 2 min |
FCT calculating labels | calculated during the bulkload |
Freetext index | 24 hours |
Graphs and triple counts
Main graphs
Graph | Description | Triple Count |
---|---|---|
http://www.wikidata.org/ |
the Wikidata dataset | 19,524,435,178 |
http://www.wikidata.org/lexemes/ |
the Wikidata Lexemes dataset | 169,677,210 |
urn:wikidata:labels |
calculated Wikidata labels | 1,454,421 |
Ontology graphs
Graph | Description | Triple Count |
---|---|---|
http://wikiba.se/ontology-1.0.owl |
the Wikidata base ontology | 251 |
https://schema.org/version/latest/schemaorg-current-http.nt |
the http://schema.org/ ontology |
16,593 |
Knowledge Base graphs
At this time we have not loaded any third party knowledge graphs such as kbpedia
.
Link sets
Graph | Description | Triple Count |
---|---|---|
urn:wikidata:dbpedia:about |
DBpedia owl:SameAs | 33,860,047 |
urn:wikidata:dbpedia:schema |
DBpedia schema:Article | 54106768 |
Wikidata VoID graph
Description | Value |
---|---|
graph name | http://www.wikidata.org/void/ |
SPARQL endpoint | https://wikidata.demo.openlinksw.com/sparql/ |
number of triples | 19,524,435,179 |
number of classes | 1,601 |
number of entities | 2,049,500,774 |
number of distinct subjects | 2,053,812,504 |
number of properties | 54,902 |
number of distinct objects | 3,640,746,976 |
number of owl:sameAs links |
4,310,820 |
Related Links
- Wikidata
- Wikidata Dumps — including details about the NTriples based Dataset
- Wikidata VoID graph
- Wikidata Lexemes VoID graph
- OpenLink Virtuoso
- Virtuoso Faceted Browser
- Loading DBpedia into Virtuoso (Open Source or Enterprise Edition)
- Loading the Wikidata dataset 2022/12 into Virtuoso Open Source