How to Bulk Load Data into a Virtuoso Docker Instance
Copyright Β© 2022 OpenLink Software support@openlinksw.com
Table of Contents
Introduction
Both the OpenLink Virtuoso Commercial docker image (openlink/virtuoso-commercial-8
) and the OpenLink Virtuoso Open Source docker image (openlink/virtuoso-opensource-7
) allow users to run a combination of shell scripts and SQL scripts to initialize a new database.
This example shows how a Virtuoso Open Source Docker instance, started by docker-compose, can bulk load initial RDF data into the quad store when initializing a new database.
It has been tested on both Ubuntu 18.04 (x86_64) and macOS Big Sur 11.6 (x86_64 and Apple Silicon).
Most modern Linux distributions provide Docker packages as part of their repository.
For Apple macOS and Microsoft Windows, Docker installers can be downloaded from the Docker website.
Note: Installing software like git
, Docker, and Docker Compose is left as an exercise for the reader.
Downloading and running the example
The source code for this example can be cloned from its repository on GitHub using the following command:
$ git clone https://github.com/openlink/vos-docker-bulkload-example
The example is started using the following commands:
$ cd vos-docker-bulkload-example
$ docker-compose pull
$ docker-compose up
Once Virtuoso has started, you can use a browser to connect to the local SPARQL endpoint:
http://localhost:8890/sparql
Cut and paste the following query and press the βExecuteβ button.
SELECT * from <urn:bulkload:test1> WHERE { ?s ?p ?o }
which should give the following result:
s | p | o |
---|---|---|
s1 | p1 | This is example 1 (uncompressed) |
s3 | p3 | This is example 3 (gzip) |
s4 | p4 | This is example 4 (xz) |
Next, cut and paste the following query, and press the βExecuteβ button.
SELECT * from <urn:bulkload:test2> WHERE { ?s ?p ?o }
which should give the following result:
s | p | o |
---|---|---|
s2 | p2 | This is example 2 (bzip2) |
To stop the example, just press CTRL-C
on the docker-compose window.
Finally, to clean the example, run the following command:
$ docker-compose rm
Explanation
The docker-compose.yml
script
This is the docker-compose.yml
script we are using in this example:
version: "3.3"
services:
virtuoso_db:
image: openlink/virtuoso-opensource-7
volumes:
- ./data:/database/data
- ./scripts:/opt/virtuoso-opensource/initdb.d
environment:
- DBA_PASSWORD=dba
ports:
- "1111:1111"
- "8890:8890"
The docker compose
program uses this information to run the following steps:
- If it does not find a version of the
openlink/virtuoso-opensource-7
image in your local docker cache, download the docker image from Docker Hub. Note: As you may have an older image in your cache, you may want to rundocker-compose pull
first, to make sure you have the absolute latest version of the image. - Create an instance of this docker image.
- Mount the local
data
directory on the host OS, as/database/data
in the docker instance. - Mount the local
scripts
directory on the host OS, as/initdb.d
in the docker instance. - Set the
dba
password to something trivial. - Expose the standard network ports
1111
and8890
. - Run the registered startup script for this image.
The ./data
directory
This directory contains the initial data that we want to load into the database.
./data
βββ README.md
βββ example1.nt
βββ example2.nt.bz2
βββ example2.nt.graph
βββ example3.nt.gz
βββ global.graph
βββ subdir
βββ example4.nt.xz
The docker-compose.yml
script mounts this data directory below the /database
directory, as /database/data
.
Since the database
directory is in the DirsAllowed
setting in the [Parameters]
section of the virtuoso.ini
configuration file, we do not have to make modifications to that file.
Notes on the Bulk Loader
The Virtuoso bulk loader can automatically handle compressed (.gz
, .bz2
, and .xz
) files and will choose the appropriate decompression function to read the content from the file.
It then chooses an appropriate parser for the data based on the suffix of the file:
- N-Quad when using the
.nq
or.n4
suffix - TriG when using the
.trig
suffix - RDF/XML when using
.xml
,.owl
,.rdf
, or.rdfs
suffix - Turtle when using
.ttl
or.nt
suffix
While the ld_dir_all()
function allows the operator to provide a graph name, it is much simpler for the data directory to contain hints on the graph names to use, especially when there are a number of files that needed to be loaded, whether to one or several named graphs.
In this example, the file example2.nt.gz
needs to be loaded to a different named graph than the other two data files, so we create an example2.nt.graph
file which contains the graph name for this data file. Note that although the data file has the .gz
extension, the graph file does not.
The other two data files in this directory, example1.nt
and example3.nt.gz
, do not have their own graph hint file. In this case, the bulk loader sees there is a global.graph
file in the same directory, and uses its contents for these two files.
Finally, the example4.nt.xz
also uses the information from global.graph
as subdirectories that automatically inherit the graph name from their parent directory.
If the global.graph
file is not present, the graph
argument of the ld_dir_all()
function is used.
The ./scripts
directory
This directory can contain a mix of shell (.sh) and Virtuso PL (.sql) scripts that can perform functions such as:
- Installing additional Ubuntu packages
- Loading data from remote locations such as Amazon S3 buckets, Google Drive, or other locations
- Bulk loading data into the Virtuoso database
- Installing additional
VAD
packages into the database - Adding new Virtuoso users
- Granting permissions to Virtuoso users
- Regenerating free-text indexes or other initial data
The scripts are run only once, during the initial database creation; subsequent restarts of the docker image will not cause these script to be re-run.
The scripts are run in alphabetical order, so we suggest starting the script names with sequence numbers, so the ordering is explicit and obvious.
For security purposes, Virtuoso will run the .sql
scripts in a special mode, and will not respond to connections on its SQL (1111
) and/or HTTP (8890
) ports until these scripts complete.
At the end of each .sql
script, Virtuoso automatically performs a checkpoint
, to make sure the changes are fully written back to the database. This is very important for scripts that use the bulk loader function rdf_loader_run()
or manually change the ACID mode of the database for any reason.
After all the initialization scripts have run to completion, Virtuoso will be started normally, and start listening to requests on its SQL (1111
) and HTTP (8890
) ports.
The 10-bulkload.sql
script
--
-- Copyright (C) 2022 OpenLink Software
--
--
-- Add all files that end in .nt
--
ld_dir_all ('data', '*.nt', 'no-graph-1')
;
--
-- Add all files that end in .bz2, .gz, or .xz, to show that the Virtuoso bulk loader
-- can load compressed files without manual decompression
--
ld_dir_all ('data', '*.bz2', 'no-graph-3')
;
ld_dir_all ('data', '*.gz', 'no-graph-2')
;
ld_dir_all ('data', '*.xz', 'no-graph-4')
;
--
-- Now load all of the files found above into the database
--
rdf_loader_run()
;
--
-- End of script
--