How to Bulk Load Data into a Virtuoso Docker Instance

How to Bulk Load Data into a Virtuoso Docker Instance

Copyright Β© 2022 OpenLink Software support@openlinksw.com

Table of Contents

Introduction

Both the OpenLink Virtuoso Commercial docker image (openlink/virtuoso-commercial-8) and the OpenLink Virtuoso Open Source docker image (openlink/virtuoso-opensource-7) allow users to run a combination of shell scripts and SQL scripts to initialize a new database.

This example shows how a Virtuoso Open Source Docker instance, started by docker-compose, can bulk load initial RDF data into the quad store when initializing a new database.

It has been tested on both Ubuntu 18.04 (x86_64) and macOS Big Sur 11.6 (x86_64 and Apple Silicon).

Most modern Linux distributions provide Docker packages as part of their repository.

For Apple macOS and Microsoft Windows, Docker installers can be downloaded from the Docker website.

Note: Installing software like git, Docker, and Docker Compose is left as an exercise for the reader.

Downloading and running the example

The source code for this example can be cloned from its repository on GitHub using the following command:

$ git clone https://github.com/openlink/vos-docker-bulkload-example

The example is started using the following commands:

$ cd vos-docker-bulkload-example
$ docker-compose pull
$ docker-compose up

Once Virtuoso has started, you can use a browser to connect to the local SPARQL endpoint:

http://localhost:8890/sparql

Cut and paste the following query and press the β€˜Execute’ button.

SELECT * from <urn:bulkload:test1> WHERE { ?s ?p ?o }

which should give the following result:

s p o
s1 p1 This is example 1 (uncompressed)
s3 p3 This is example 3 (gzip)
s4 p4 This is example 4 (xz)

Next, cut and paste the following query, and press the β€˜Execute’ button.

SELECT * from <urn:bulkload:test2> WHERE { ?s ?p ?o }

which should give the following result:

s p o
s2 p2 This is example 2 (bzip2)

To stop the example, just press CTRL-C on the docker-compose window.

Finally, to clean the example, run the following command:

$ docker-compose rm

Explanation

The docker-compose.yml script

This is the docker-compose.yml script we are using in this example:

version: "3.3"
services:
  virtuoso_db:
    image: openlink/virtuoso-opensource-7
    volumes:
      - ./data:/database/data
      - ./scripts:/opt/virtuoso-opensource/initdb.d
    environment:
      - DBA_PASSWORD=dba
    ports:
      - "1111:1111"
      - "8890:8890"

The docker compose program uses this information to run the following steps:

  1. If it does not find a version of the openlink/virtuoso-opensource-7 image in your local docker cache, download the docker image from Docker Hub. Note: As you may have an older image in your cache, you may want to run docker-compose pull first, to make sure you have the absolute latest version of the image.
  2. Create an instance of this docker image.
  3. Mount the local data directory on the host OS, as /database/data in the docker instance.
  4. Mount the local scripts directory on the host OS, as /initdb.d in the docker instance.
  5. Set the dba password to something trivial.
  6. Expose the standard network ports 1111 and 8890.
  7. Run the registered startup script for this image.

The ./data directory

This directory contains the initial data that we want to load into the database.

./data
β”œβ”€β”€ README.md
β”œβ”€β”€ example1.nt
β”œβ”€β”€ example2.nt.bz2
β”œβ”€β”€ example2.nt.graph
β”œβ”€β”€ example3.nt.gz
β”œβ”€β”€ global.graph
└── subdir
    └── example4.nt.xz

The docker-compose.yml script mounts this data directory below the /database directory, as /database/data.

Since the database directory is in the DirsAllowed setting in the [Parameters] section of the virtuoso.ini configuration file, we do not have to make modifications to that file.

Notes on the Bulk Loader

The Virtuoso bulk loader can automatically handle compressed (.gz, .bz2, and .xz) files and will choose the appropriate decompression function to read the content from the file.

It then chooses an appropriate parser for the data based on the suffix of the file:

  • N-Quad when using the .nq or .n4 suffix
  • TriG when using the .trig suffix
  • RDF/XML when using .xml, .owl, .rdf, or .rdfs suffix
  • Turtle when using .ttl or .nt suffix

While the ld_dir_all() function allows the operator to provide a graph name, it is much simpler for the data directory to contain hints on the graph names to use, especially when there are a number of files that needed to be loaded, whether to one or several named graphs.

In this example, the file example2.nt.gz needs to be loaded to a different named graph than the other two data files, so we create an example2.nt.graph file which contains the graph name for this data file. Note that although the data file has the .gz extension, the graph file does not.

The other two data files in this directory, example1.nt and example3.nt.gz, do not have their own graph hint file. In this case, the bulk loader sees there is a global.graph file in the same directory, and uses its contents for these two files.

Finally, the example4.nt.xz also uses the information from global.graph as subdirectories that automatically inherit the graph name from their parent directory.

If the global.graph file is not present, the graph argument of the ld_dir_all() function is used.

The ./scripts directory

This directory can contain a mix of shell (.sh) and Virtuso PL (.sql) scripts that can perform functions such as:

  • Installing additional Ubuntu packages
  • Loading data from remote locations such as Amazon S3 buckets, Google Drive, or other locations
  • Bulk loading data into the Virtuoso database
  • Installing additional VAD packages into the database
  • Adding new Virtuoso users
  • Granting permissions to Virtuoso users
  • Regenerating free-text indexes or other initial data

The scripts are run only once, during the initial database creation; subsequent restarts of the docker image will not cause these script to be re-run.

The scripts are run in alphabetical order, so we suggest starting the script names with sequence numbers, so the ordering is explicit and obvious.

For security purposes, Virtuoso will run the .sql scripts in a special mode, and will not respond to connections on its SQL (1111) and/or HTTP (8890) ports until these scripts complete.

At the end of each .sql script, Virtuoso automatically performs a checkpoint, to make sure the changes are fully written back to the database. This is very important for scripts that use the bulk loader function rdf_loader_run() or manually change the ACID mode of the database for any reason.

After all the initialization scripts have run to completion, Virtuoso will be started normally, and start listening to requests on its SQL (1111) and HTTP (8890) ports.

The 10-bulkload.sql script

--
--  Copyright (C) 2022 OpenLink Software
--

--
--  Add all files that end in .nt
--
ld_dir_all ('data', '*.nt', 'no-graph-1')
;

--
--  Add all files that end in .bz2, .gz, or .xz, to show that the Virtuoso bulk loader 
--  can load compressed files without manual decompression
--
ld_dir_all ('data', '*.bz2', 'no-graph-3')
;

ld_dir_all ('data', '*.gz', 'no-graph-2')
;

ld_dir_all ('data', '*.xz', 'no-graph-4')
;

--
--  Now load all of the files found above into the database
--
rdf_loader_run()
;

--
--  End of script
--
2 Likes