Producing RDF dumps of Virtuoso Quad-store hosted RDF model data

What

How to export RDF model data from Virtuoso’s Quad Store in NQuad or TTL formats.

Why

Every DBMS needs to offer a mechanism for bulk export and import of data.

When exporting RDF model data from Virtuoso’s Quad Store, having the ability to retain and reflect Named Graph IRI based data partitioning provides significant value to a variety of application profiles.

Virtuoso supports dumping and reloading graph model data (e.g., RDF), as well as relational data (e.g., SQL) (discussed elsewhere).

How

Virtuoso provides a number procedures for dumping RDF Quad Store data to RDF dataset dumps, as detailed below.

RDF_DUMP_GRAPH()

The RDF_DUMP_GRAPH procedure can be used to dump a named graph in the Virtuoso Quad Store to RDF dataset files in TTL format.

Procedure Signature and Parameters

The procedure signature is :

RDF_DUMP_GRAPH (IN srcgraph VARCHAR, IN out_file VARCHAR, IN file_length_limit INTEGER := 1000000000)

  • IN srcgraph VARCHAR – source graph
  • IN out_file VARCHAR – output file
  • IN file_length_limit INTEGER – maximum length of dump files

Usage Example

Call the dump_one_graph procedure with appropriate arguments:

$ pwd 
/opt/virtuoso/database

$ grep DirsAllowed virtuoso.ini
DirsAllowed              = ., ../vad,

$ /opt/virtuoso/bin/isql 1111
Connected to OpenLink Virtuoso
Driver: 08.03.3323 OpenLink Virtuoso ODBC Driver
OpenLink Interactive SQL (Virtuoso), version 0.9849b.
Type HELP; for help and EXIT; to exit.
SQL> RDF_DUMP_GRAPH ('http://daas.openlinksw.com/data#', './data_', 1000000000); 
Done. -- 1438 msec.

As a result, a dump of the graph http://daas.openlinksw.com/data# will be found in the files data_XX (located in your Virtuoso database directory):

$ ls
data_000001.ttl
data_000002.ttl
....
data_000001.ttl.graph

RDF_DUMP_NQUADS

The dump procedure RDF_DUMP_NQUADS procedure leverages SPARQL to facilitate data dump(s) for ALL graphs excluding the internal predefined "virtrdf: ".

Procedure Signature and Parameters

The procedure signature is:

RDF_DUMP_NQUADS (IN dir VARCHAR := 'dumps', IN start_from INT := 1, IN file_length_limit INTEGER := 100000000, IN comp INT := 1)

  • IN dir VARCHAR – folder where the dumps will be stored. Note: The dump directory must be included in the DirsAllowed parameter of the Virtuoso configuration file (e.g., virtuoso.ini), or the Virtuoso server will not be able to create or access the data files.
  • IN outstart_fromfile INTEGER – output start from number n
  • IN file_length_limit INTEGER – maximum length of dump files
  • IN comp INTEGER – when set to 0, then no gzip will be done. By default is set to 1.

Usage Example

This example demonstrates calling the RDF_DUMP_NQUADS procedure to dump all graphs to a series of compressed NQuad dumps, each with uncompressed length of 10Mb (./dumps/output000001.nq.gz):

SQL> RDF_DUMP_NQUADS ('dumps', 1, 10000000, 1);

As a result, a dataset file dump of the graph ALL the graphs in the Virtuoso Quad Store can be found in the dumps directory (located in your Virtuoso database directory):

$ ls -ltr
total 12740
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-000001.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 132 Aug 24 10:33 rdf-dump-000002.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 118 Aug 24 10:33 rdf-dump-000003.nq.gz
.
.
.
-rw-r--r-- 1 ubuntu ubuntu 113 Aug 24 10:33 rdf-dump-003138.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 110 Aug 24 10:33 rdf-dump-003139.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-003140.nq.gz
$

RDF_DUMP_NQUADS_MT

Is a new NQuad dump procedure to enable the multi-threaded dumping of graphs, when there are many graphs to be dumped. This function also ensures all of graphs NQUADs are stored in the same file, such that blank node values are loaded at the same time to keep them consistent.

The function signature and parameters are:

RDF_DUMP_NQUADS_MT (IN n_threads INTEGER, IN dir VARCHAR := 'dumps', IN file_length_limit INTEGER := 100000000, IN comp INTEGER := 1, IN fix INTEGER := 1)

  • IN n_thread INTEGER - number of threads used for the dump, typically equal to the number of available CPUs
  • IN dir VARCHAR – folder where the dumps will be stored. *Note: The dump directory must be included in the DirsAllowed parameter of the Virtuoso configuration file (e.g., virtuoso.ini), or the Virtuoso server will not be able to create or access the data files.
  • IN file_length_limit INTEGER – maximum length of dump files
  • IN comp INTEGER – when set to 0, then no gzip will be done. By default is set to 1.
  • IN fix INTEGER - internal fix enabled by default

Usage Example

This example demonstrates calling the RDF_DUMP_NQUADS_MT procedure to dump all graphs to a series of compressed NQuad dumps, each with uncompressed length of 10Mb (./dumps/output000001.nq.gz):

SQL> RDF_DUMP_NQUADS_MT(4, 100, 'dumps', 100000000, 1, 1);

As a result, a dataset file dump of the graph ALL the graphs in the Virtuoso Quad Store can be found in the dumps directory (located in your Virtuoso database directory):

$ ls -ltr
total 12740
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-000001.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 132 Aug 24 10:33 rdf-dump-000002.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 118 Aug 24 10:33 rdf-dump-000003.nq.gz
.
.
.
-rw-r--r-- 1 ubuntu ubuntu 113 Aug 24 10:33 rdf-dump-003138.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 110 Aug 24 10:33 rdf-dump-003139.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-003140.nq.gz
$

RDF_DUMP_NQUADS_MT2

The RDF_DUMP_NQUADS_MT has a 2M (million) limit on the size of a vector() for dumping the RDF store with large numbers of graphs. In which case the RDF_DUMP_NQUADS_MT2() procedure can be used to get around this limit, although it does run slower. The function signature and parameters:

RDF_DUMP_NQUADS_MT2 (IN n_threads INTEGER, IN n_per_slice INTEGER, IN dir VARCHAR := 'dumps', IN file_length_limit INTEGER := 100000000, IN comp INTEGER := 1, IN fix INTEGER := 1)

  • IN n_thread INTEGER - number of threads used for the dump, typically equal to the number of available CPUs
  • IN n_per_slice INTEGER - maximum number of graphs per dataset file
  • IN dir VARCHAR – folder where the dumps will be stored. *Note: The dump directory must be included in the DirsAllowed parameter of the Virtuoso configuration file (e.g., virtuoso.ini), or the Virtuoso server will not be able to create or access the data files.
  • IN file_length_limit INTEGER – maximum length of dump files
  • IN comp INTEGER – when set to 0, then no gzip will be done. By default is set to 1.
  • IN fix INTEGER - internal fix enabled by default

The parameters are same as for RDF_DUMP_NQUADS_MT(), except RDF_DUMP_NQUADS_MT2() takes second argument indicating the maximum number of graphs that can be grouped together in the same dataset file.

Usage Example

This example demonstrates calling the RDF_DUMP_NQUADS_MT2 procedure to dump all graphs to a series of compressed NQuad dumps, each with uncompressed length of 10Mb (./dumps/output000001.nq.gz):

SQL> RDF_DUMP_NQUADS_MT2(4, 100, 'dumps', 100000000, 1, 1);

As a result, a dataset file dump of the graph ALL the graphs in the Virtuoso Quad Store can be found in the dumps directory (located in your Virtuoso database directory):

$ ls -ltr
total 12740
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-000001.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 132 Aug 24 10:33 rdf-dump-000002.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 118 Aug 24 10:33 rdf-dump-000003.nq.gz
.
.
.
-rw-r--r-- 1 ubuntu ubuntu 113 Aug 24 10:33 rdf-dump-003138.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 110 Aug 24 10:33 rdf-dump-003139.nq.gz
-rw-r--r-- 1 ubuntu ubuntu 119 Aug 24 10:33 rdf-dump-003140.nq.gz
$

Note The RDF_DUMP_NQUADS_MT... functions are only available in Virtuoso 8.x by default, but if using Virtuoso 7 (open source or commercial) this
dump_nquads_mt.sql.zip (1.5 KB)
can be loaded manually for use.

Load datasets

Dumped dataset files can then loaded into another Virtuoso instance using the Virtuoso RDF Bulk Loader process.

Related