Loading DBpedia into Virtuoso (Open Source or Enterprise Edition)

Loading DBpedia into Virtuoso (Open Source or Enterprise Edition)

Introduction

This is a simple guide covering how a DBpedia instance deployed using the OpenLink Virtuoso Knowledge Graph data store can be created from scratch. Once loaded a user would then have the equivalent of the following live public services hosted by OpenLink Software:

  1. Linked Data Description Pages
  2. SPARQL Query Services Endpoint
  3. Faceted Search & Browsing Service Endpoint

Requirements

Virtuoso Instance Type Options

Hardware Requirements

Memory

For best performance a machine with at least 64GB of available memory is recommended for Virtuoso deployment comprising a compact database working set and key compression hosted in available memory.

Disk Storage

An SSD or similar storage device is also recommended.

CPUs

For high current usage the more CPUs available the more concurrent requests can be processed and hence improve performance. Virtuoso is a highly multithreaded DBMS server that well suited to parallel execution of queries from a large number of concurrent clients.

Database Initialization Configuration Settings (i.e., INI File)

Via the URIQA section, set the DefaultHost key-value to the cname (hostname) and HTTP portno (port number) to be used by your Virtuoso instance. This setting is important with regards to Dynamic IRIs generated as part of the Linked Data Deployment aspect of DBpedia hosting.

Given a 64GB machine, here are additional settings to be applied to the initialization (i.e., virtuoso.ini) file:

[Parameters]
.
.
.
NumberOfBuffers          = 5450000
MaxDirtyBuffers          = 4000000
.
.
.
VectorSize               = 10000
.
.
.
[URIQA]
DefaultHost              = {cname}:{portno}

If you are using a Linux machine, do take note of the important kernel related swappiness setting that controls how much the kernel favors swap over RAM. It is recommended that this parameter be changed from its default value of 60 to something closer to 10, as detailed in the Virtuoso performance tuning swappiness documentation.

Virtuoso Add-On Packages (VADs)

Install the Virtuoso DBpedia and Faceted Browser VADs using either of the following:

  • the “Package” menu item within the Virtuoso Conductor Admin Interface via the System Admin -> Packages tab
  • Virtuoso’s iSQL command-line using sequence:
vad_install ('dbpedia_dav.vad', 0);
vad_install ('fct_dav.vad', 0);
vad_list_packages();

Download DBpedia datasets

Choose the location where the DBpedia dataset files are to be downloaded, ensuring its location (directory or folder) is identified by the value of the Virtuoso DirsAllowed section-key of the virtuoso.ini configuration file. This is an important security feature that informs the Virtuoso server about intended use from a live instance.

Download the dataset via the DBpedia Databus as follows, using curl:

  1. Query dataset availability
query=$(curl -H "Accept:text/sparql" https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-12)
  1. Download dataset files:
files=$(curl -H "Accept: text/csv" --data-urlencode "query=${query}" https://databus.dbpedia.org/repo/sparql | tail -n+2 | sed 's/"//g')
while IFS= read -r file ; do wget $file; done <<< "$files"

Bulk load the DBpedia datasets into Virtuoso

Now that the DBpedia datasets have been downloaded, use the Virtuoso RDF Bulk Loader to load this data to your Virtuoso instance.

Here are the steps for completing this procedure:

  1. Load the list of downloaded DBpedia dataset files into the Virtuoso load_list table using:
ld_dir ('/path/to/files', '*.ttl.bz2', 'http://dbpedia.org');
  1. Check load_list tables to ensure the dataset files list loaded successfully:
select * from load_list
  1. Run the rdf_loader_run() function to commence loading of the datasets. (Note multiple rdf_loader_run() processes can be run depending on the number of available cores, for maximum platform utilisation to speedup the dataset loading task):
rdf_loader_run()
  1. Once the rdf_loader_run() is complete, you can check the DB.DBA.load_list to confirm all data sets were loaded successfully. This is indicated by an ll_state value of 2 and an ll_error value of NULL . The following query can be used to check if any errors occurred when loading dataset files during the bulk load:
select * from DB.DBA.LOAD_LIST where ll_error IS NOT NULL
  1. Finally a database checkpoint MUST be run once the dataset loading process is complete; this is required for final committal of data to the Virtuoso DBMS

Post Installation Steps

The following commands need to be performed to setup Text Indexing required by the Faceted Search Engine and Data Browser:

VT_INC_INDEX_DB_DBA_RDF_OBJ ();
urilbl_ac_init_db()
s_rank()

For Geospatial indexing, the following needs to be run:

RDF_GEO_FILL ();

Verify Installation

The successful installation can be verified by:

  1. Accessing a sample DBpedia Linked Data Description Page, by going to http://{cname}:{portno}/resource/London for example:

  1. Accessing the SPARQL Query Services endpoint via http://{cname}:{portno}/sparql:

  2. Accessing the Faceted Search and Browsing Service via http://{cname}:{portno}/fct:

Related

1 Like

Is the text indexing step (in particular VT_INC_INDEX_DB_DBA_RDF_OBJ) supposed to take much longer than the loading step? I loaded all of dbpedia on my machine in about 12 hours, but the command VT_INC_INDEX_DB_DBA_RDF_OBJ has been running for about two days with no end in sight. Is this to be expected? (I’m running this on a 10 core 32GB machine.)

It should not take so long to run the VT_INC_INDEX_DB_DBA_RDF_OBJ() procedure.

What is the output of running the status(); command from isql whilst the procedure is still running ?