Introduction
Here’s a simple guide that covers how we constructed a Wikidata instance deployed using Virtuoso. Naturally, once loaded you end with the following:
Hardware
Item | Value |
---|---|
CPU | 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz |
Cores | 24 |
Memory | 378 GB |
SSD | 4x Crucial M4 SSD 500 GB |
Virtuoso INI File Settings
Item | Value |
---|---|
NumberofBuffers |
30,000,000 |
After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffer
setting to 20,000,000
pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.
Preparing the datasets
We downloaded the Wikidata dumps from about 2020-03-01.
Filename | Size (bytes) |
---|---|
latest-all.nt.bz2 |
103,207,309,995 |
latest-lexemes.nt.bz2 |
229,470,813 |
Although the file latest-all-nt-bz2
, which is 96 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel and thereby use multiple CPU cores to load the data into the database.
We used a small Perl script to split the file into 3765 parts, each about 37 MB in size. Since decompressing a bz2
file is not a task that can be performed multi-threaded, this unfortunately took several days.
As there are some known issues with Geodata in this dataset, we then split out the Geodata triples to a different set of dumps to be processed separately, which only took about 2 hours using 10 parallel processes to speed up the work.
Bulk load Statistics
Description | Value |
---|---|
Number of parallel loaders | 8 |
Duration | 10 hours |
Total number of triples | 11,894,354,985 |
Average triples/second | 331,574 |
Average CPU/second | 1,881% |
Number of Database pages in use | 48,503,816 pages |
Description | Value |
---|---|
Setting up prefixes and labels | 2 min |
FCT calculating labels | 10 hours |
Freetext index | 1 day as a background task |
Graphs and triple counts
The main graphs are:
Graph | Description | Triple Count |
---|---|---|
http://www.wikidata.org/ |
the triples fromlatest-all.nt.bz2
|
11,857,528,152 |
http://www.wikidata.org/lexemes/ |
the triples fromlatest-lexemes.nt.bz2
|
42,641,432 |
urn:wikidata:labels |
the Wikidata property labels | 836,131 |
http://wikiba.se/ontology-1.0.owl |
the Wikidata ontology | 280 |
We added the following link sets to the database:
Graph | Description | Triple Count |
---|---|---|
http://yago.r2.enst.fr/data/yago4/en/2019-12-10/ |
the Yago knowledge graph | 199,575,815 |
urn:wikidata:dbpedia:schema |
the DBpedia link set (schema form) | 27,897,734 |
urn:wikidata:dbpedia:about |
the DBpedia link set | 6,974,651 |
urn:kbpedia:concepts_linkage:inferrence_extended |
the KBpedia link set | 1,152,051 |
Wikidata VoID graph
Description | Value |
---|---|
graph name | http://www.wikidata.org/void/ |
SPARQL endpoint | http://wikidata.demo.openlinksw.com/sparql/ |
number of triples | 11,857,528,152 |
number of classes | 925 |
number of entities | 1,311,652,842 |
number of distinct subjects | 1,314,371,674 |
number of properties | 32,647 |
number of distinct objects | 1,987,044,958 |
number of owl:sameAs links |
2,717,989 |