Loading the Wikidata Truthy dataset into Virtuoso Open Source Edition

Loading the Wikidata Truthy dataset into Virtuoso Open Source

Copyright © 2022 OpenLink Software

Introduction

Here’s a simple guide that covers how we constructed a Wikidata Truthy instance deployed using the latest Virtuoso Open Source Edition:

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.8.3235-pthreads as of Oct 19 2022 (64e6ecd39)
Compiled for Linux (x86_64-generic_glibc25-linux-gnu)
Copyright (C) 1998-2022 OpenLink Software

Naturally, once loaded you end with the following:

Hardware

We use the following configuration for a single host machine.

Item Value
CPU 2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Cores 24
Memory 378 GB
SSD 4x Crucial M4 SSD 500 GB

This system is shared by several other demo databases, so we could not use all available resources on it.

Cloud cost estimates

Although it is hard to match various cloud offerings, we used various vendor-supplied cost calculators to come up with some estimates, based on the following minimal machine specifications:

  • dedicated machine for 1 year without upfront costs
  • 128 GiB memory
  • 16 cores or more
  • 512GB SSD for the database
  • 3T outgoing internet traffic (based on our DBpedia statistics)
vendor machine type memory vCPUs monthly machine monthly disk monthly network monthly total
Amazon r5a.4xlarge 128 GiB 16 $479.61 $55.96 $276.48 $812.05
Google e2highmem-16 128 GiB 16 $594.55 $95.74 $255.00 $945.30
Azure D32a 128 GiB 32 $769.16 $38.40 $252.30 $1,060.06

Note that all prices in this table are per month.

Virtuoso INI File Settings

We updated the following settings in the virtuoso.ini before loading the data.

In the [Parameters] section:

Item Value
NumberofBuffers 30,000,000

After the database was loaded and running for a few days, we determined it was only using around 18 million buffers instead of the 30 million allocated, so we lowered the final NumberOfBuffers setting to 20,000,000 pages. This is equivalent to about 80GB memory in use for disk buffering. As we are using multiple SSD drives for the database, we could further reduce the memory footprint without too much performance loss.

In the [SPARQL] section:

Item Value
LabelInferenceName facets

Recently we added support for automatic label inferencing during the bulk-load, which uses additional threads to process the labels. This replaces the manual generation of FCT labels using the urilbl_ac_init_db() function which took around 10 hours during the previous load.

Preparing the datasets

We downloaded the Wikidata Truthy dumps from around 2022-10-26.

Filename Size (bytes)
latest-truthy.nt.bz2 34,868,477,684
latest-lexemes.nt.bz2 638,730,291

Although the file latest-truthyl-nt-bz2, which is 32 GB compressed, could in theory be loaded with a single loader, we decided to split this file into smaller fragments. This allowed us to run multiple loaders in parallel, and thereby use multiple CPU cores to load the data into the database.

Initially, we tried to used the munch.sh script from the wikidata-query-tools tree to convert the data from NTriples to Turtle, and write those in chunks of 50,000 entities into separate files that can be loaded. However, this Java-based application would have taken several days to finish, as it is a single threaded program.

Eventually we used a small Perl script that used 3 separate processes to split the file into 1844 parts, each about 42 MB in size. This took a total of 2 hours and 50 minutes.

As there are some known issues with Geodata in this dataset, we removed this bad pattern from the dump, which only took about 2 hours using 10 parallel processes to speed up the work.

Bulk load Statistics

Description Value
Number of parallel loaders 6
Duration 5.5 hours
Total number of triples 7,414,588,664
Average triples/second 390,482
Average CPU percentage 2,012%

Additional tasks

Description Value
Setting up prefixes and labels 2 min
Loading ontologies 2 min
Loading KBpedia 1 min
Loading Yago4 (full) 5 hours
FCT calculating labels calculated during the bulkload
Freetext index 1 day as a background task

Note that the Yago 4 dataset took just as long to load as the whole Wikidata truthy dataset. This is due to the fact that we did not split the larger Yago 4 dataset files into smaller segments, which caused part of the bulkload to be running on a single core, rather than parallelizing the effort over many cores.

Graphs and triple counts

Main graphs

Graph Description Triple Count
http://www.wikidata.org/ the Wikidata Truthy dataset 7,276,558,699
http://www.wikidata.org/lexemes/ the Wikidata Lexemes dataset 136,806,150
urn:wikidata:labels calculated Wikidata labels 1,223,815

Ontology graphs

Graph Description Triple Count
http://wikiba.se/ontology-1.0.owl the Wikidata base ontology 288
https://schema.org/version/latest/schemaorg-current-http.nt the http://schema.org/ ontology 16,248

Knowledge Base graphs

Graph Description Triple Count
https://yago-knowledge.org/data/yago4/full/2020-02-24/ the full Yago4 knowledge graph 2,489,858,799
urn:kbpedia:2.50:reference:concepts:linkage:inferrence:extended the KBpedia extended graph 1,175,147
urn:kbpedia:2.50:reference:concepts:linkage the KBpedia linkage graph 850,711
urn:kbpedia:2.50:reference:concepts the KBpedia concepts graph 761,092

Link sets

At this point, we did not add any other link sets to the database.

Wikidata VoID graph

Description Value
graph name http://www.wikidata.org/void/
SPARQL endpoint https://wikidata-truthy.demo.openlinksw.com/sparql/
number of triples 7,276,558,699
number of classes 1141
number of entities 202,805,726
number of distinct subjects 206,667,972
number of properties 10,204
number of distinct objects 1,532,124,772
number of owl:sameAs links 3,862,241

Related

1 Like