Loading full Wikidata latest TTL dump into VOS

sergio · July 31, 2020, 5:11am

Hi.

We are trying to load the full Wikidata (latest TTL dump) into a Virtuoso Open-Source (VOS) instance (version 7.2.5.1).

We encountered some problems when loading the first batch of files: the latest-all.ttl file was divided into 82 smaller files of 200M triples each (around 7GB each file). When we started to run the bulk upload, the process terminated with the following problem:

/data/VOS/bulk-loading/latest-all_from_1_to_200000000.ttl

**42000 RDFGE: RDF box with a geometry RDF type and a non-geometry content**

/data/VOS/bulk-loading/latest-all_from_200000001_to_400000000.ttl

**37000 [Vectorized Turtle loader] SP029: TURTLE RDF loader, line 1: Undefined namespace prefix at skos:prefLabel**

/data/VOS/bulk-loading/latest-all_from_400000001_to_600000000.ttl

**37000 [Vectorized Turtle loader] SP029: TURTLE RDF loader, line 1: syntax error**

/data/VOS/bulk-loading/latest-all_from_600000001_to_800000000.ttl

**37000 [Vectorized Turtle loader] SP029: TURTLE RDF loader, line 1: syntax error**

/data/VOS/bulk-loading/latest-all_from_800000001_to_1000000000.ttl

**37000 [Vectorized Turtle loader] SP029: TURTLE RDF loader, line 1: syntax error**

We found the following description on Stackoverflow from Peter F. Patel-Schneider. In it, Peter indicates that there’s a bug in Virtuoso related to the handling of geo-coordinates.

Peter goes explaining that one should change the VOS source code in order to correct this problem.

We thought that maybe this problem would have been fixed in a newer version of VOS. But, it’s in our understanding that the latest version of VOS is 7.2.5.1 (from 2018-08-15). And, hence, we are using it. Also, we haven’t found any other/further information related to a fix for this bug. The only close reference is an evaluation of using Virtuoso as an alternative to Blazegraph for Wikidata.

Questions:

Is the description of Peter on Stackoverflow correct?
And if so, Peter mentions that “if one is loading the complete Wikidata dump one needs a machine with at least 256GB of main memory (maybe even at least 512GB)”. Would we need to have a main memory size of 512GB in order to run Wikidata in VOS? Or, how much main memory would be necessary to run it smoothly?

Having a local copy of Wikidata on our premises will help us greatly in several ongoing research projects.

We appreciate your kind attention and assistance. Looking forward to your reply.

hwilliams · July 31, 2020, 3:13pm

The issue reported in the stack overflow post is known and details in git issue#295, for which a fix is not currently available.

User have got around the issue with the problem CRS’es by removing them from the dataset, which is what we do with the instance we host at wikidata.demo.openlinksw.com .

In terms of memory requirements the dataset is about 12billion triples and we generally recommend about 10GB RAM per billion triples, depending on how well the dataset can be compressed, for hosting the dataset in memory for best performance, thus about 120GB RAM should suffice. If you have SSD or other such fast storage devices this can mitigate against the need to have the entire dataset memory.