Loading DBpedia into Virtuoso (Open Source or Enterprise Edition)
Introduction
This is a simple guide covering how a DBpedia instance deployed using the OpenLink Virtuoso Knowledge Graph data store can be created from scratch. Once loaded a user would then have the equivalent of the following live public services hosted by OpenLink Software:
- Linked Data Description Pages
- SPARQL Query Services Endpoint
- Faceted Search & Browsing Service Endpoint
Requirements
Virtuoso Instance Type Options
-
Virtuoso Open Source Edition via source code, prebuilt binary, or docker container.
-
Enterprise Edition which is available as a native standalone installer for Linux, Windows and macOS from the Virtuoso Onboarding page; Docker Container Image from Docker Hub; PAGO & BYOL Virtual Machines for the Microsoft Azure Cloud; or PAGO & BYOL Virtual Machines for the Amazon AWS Cloud.
Hardware Requirements
Memory
For best performance a machine with at least 64GB of available memory is recommended for Virtuoso deployment comprising a compact database working set and key compression hosted in available memory.
Disk Storage
An SSD or similar storage device is also recommended.
CPUs
For high current usage the more CPUs available the more concurrent requests can be processed and hence improve performance. Virtuoso is a highly multithreaded DBMS server that well suited to parallel execution of queries from a large number of concurrent clients.
Database Initialization Configuration Settings (i.e., INI File)
Via the URIQA section, set the DefaultHost
key-value to the cname
(hostname) and HTTP portno
(port number) to be used by your Virtuoso instance. This setting is important with regards to Dynamic IRIs generated as part of the Linked Data Deployment aspect of DBpedia hosting.
Given a 64GB machine, here are additional settings to be applied to the initialization (i.e., virtuoso.ini
) file:
[Parameters]
.
.
.
NumberOfBuffers = 5450000
MaxDirtyBuffers = 4000000
.
.
.
VectorSize = 10000
.
.
.
[URIQA]
DefaultHost = {cname}:{portno}
If you are using a Linux machine, do take note of the important kernel related swappiness setting that controls how much the kernel favors swap over RAM. It is recommended that this parameter be changed from its default value of 60 to something closer to 10, as detailed in the Virtuoso performance tuning swappiness documentation.
Virtuoso Add-On Packages (VADs)
Install the Virtuoso DBpedia and Faceted Browser VADs using either of the following:
- the “Package” menu item within the Virtuoso Conductor Admin Interface via the
System Admin -> Packages
tab - Virtuoso’s iSQL command-line using sequence:
vad_install ('dbpedia_dav.vad', 0);
vad_install ('fct_dav.vad', 0);
vad_list_packages();
Download DBpedia datasets
Choose the location where the DBpedia dataset files are to be downloaded, ensuring its location (directory or folder) is identified by the value of the Virtuoso DirsAllowed
section-key of the virtuoso.ini
configuration file. This is an important security feature that informs the Virtuoso server about intended use from a live instance.
Download the dataset via the DBpedia Databus as follows, using curl:
- Query dataset availability
query=$(curl -H "Accept:text/sparql" https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-12)
- Download dataset files:
files=$(curl -H "Accept: text/csv" --data-urlencode "query=${query}" https://databus.dbpedia.org/repo/sparql | tail -n+2 | sed 's/"//g')
while IFS= read -r file ; do wget $file; done <<< "$files"
Bulk load the DBpedia datasets into Virtuoso
Now that the DBpedia datasets have been downloaded, use the Virtuoso RDF Bulk Loader to load this data to your Virtuoso instance.
Here are the steps for completing this procedure:
- Load the list of downloaded DBpedia dataset files into the Virtuoso
load_list
table using:
ld_dir ('/path/to/files', '*.ttl.bz2', 'http://dbpedia.org');
- Check
load_list
tables to ensure the dataset files list loaded successfully:
select * from load_list
- Run the
rdf_loader_run()
function to commence loading of the datasets. (Note multiple rdf_loader_run() processes can be run depending on the number of available cores, for maximum platform utilisation to speedup the dataset loading task):
rdf_loader_run()
- Once the
rdf_loader_run()
is complete, you can check theDB.DBA.load_list
to confirm all data sets were loaded successfully. This is indicated by anll_state
value of2
and anll_error
value ofNULL
. The following query can be used to check if any errors occurred when loading dataset files during the bulk load:
select * from DB.DBA.LOAD_LIST where ll_error IS NOT NULL
- Finally a database checkpoint MUST be run once the dataset loading process is complete; this is required for final committal of data to the Virtuoso DBMS
Post Installation Steps
The following commands need to be performed to setup Text Indexing required by the Faceted Search Engine and Data Browser:
VT_INC_INDEX_DB_DBA_RDF_OBJ ();
urilbl_ac_init_db()
s_rank()
For Geospatial indexing, the following needs to be run:
RDF_GEO_FILL ();
Verify Installation
The successful installation can be verified by:
- Accessing a sample DBpedia Linked Data Description Page, by going to
http://{cname}:{portno}/resource/London
for example:
-
Accessing the SPARQL Query Services endpoint via
http://{cname}:{portno}/sparql
:
-
Accessing the Faceted Search and Browsing Service via
http://{cname}:{portno}/fct
: