Here’s a simple guide that covers how we constructed a Wikidata instance deployed using Virtuoso. Naturally, once loaded you end with the following:
|CPU||2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz|
|SSD||4x Crucial M4 SSD 500 GB|
After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffer setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.
We downloaded the Wikidata dumps from about 2020-03-01.
Although the file
latest-all-nt-bz2 , which is 96 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allows us to run multiple loaders in parallel and therefore multiple CPU cores to load the data into the database.
We used a small Perl script to split the file into 3765 parts each about 37MB in size. Since decompressing a bz2 file is not a task that can be performed multi threaded, this unfortunately took several days.
As there are some known issues with geosparql data in this dataset, we then split out the geosparql triples in a different set of dumps to be processed separately, which only took about 2 hours using 10 parallel processes to speed up the work.
|Number of parallel loaders||8|
|Total number of triples||11,894,354,985|
|Number of Database pages in use||48,503,816 pages|
|Setting up prefixes and lables||2 min|
|FCT calculating labels||10 hours|
|Freetext index||1 day as a background task|
The main graphs are:
|http://www.wikidata.org/||the triples from latest-all.nt.bz2||11,857,528,152|
|http://www.wikidata.org/lexemes/||the triples from latest-lexemes.nt.bz2||42,641,432|
|urn:wikidata:labels||the wikidata property labels||836,131|
|http://wikiba.se/ontology-1.0.owl||the wikidata ontology||280|
We added the following linksets to the database:
|http://yago.r2.enst.fr/data/yago4/en/2019-12-10/||the Yago knowledge graph||199,575,815|
|urn:wikidata:dbpedia:schema||the dbpedia linkset (schema form)||27,897,734|
|urn:wikidata:dbpedia:about||the dbpedia linkset||6,974,651|
|urn:kbpedia:concepts_linkage:inferrence_extended||the kbpedia linkset||1,152,051|
|number of triples||11,857,528,152|
|number of classes||925|
|number of entities||1,311,652,842|
|number of distinct subjects||1,314,371,674|
|number of properties||32,647|
|number of distinct objects||1,987,044,958|
|number of owl:sameAs links||2,717,989|