Transactional Bulk Loading of RDF Data into Virtuoso DBMS via the Jena RDF Framework

What

A simulation application for demonstrating how to transactionally bulk load RDF datasets into a Virtuoso DBMS using the Apache Jena framework.

Why

When performing transactional bulk loading of large amounts of data it is vital to do so using transactions that are consistent with the principles of atomicity, consistency, isolation, and durability(a/k/a ACID) in conjunction with appropriate concurrency modalities (e.g., optimistic or pessimistic). Failure to do so will lead to DBMS integrity compromises and unexpected behaviour in regards to perceived performance and scale.

How

As RDF usage is becoming more mainstream its usage profile is evolving from decision support solely to a mix that now includes OLTP. In response to this reality, we’ve created a Jena RDF Loader application that uses the Virtuoso Native Jena Provider to perform transactional bulk load operations against a Virtuoso instance. Source code link for RDFLoader.java application.

The objective here is to enable developers understand and exercise the spectrum of modalities associated with the ACID characteristics of transaction and concurrency control.

To Use:

  1. Clone git repository
git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
  1. Goto RDFLoader directory:
cd Virtuoso_Java_Samples/Jena/RDFLoader/
  1. Compile the application with command:
gradlew clean build
  1. Run the application with the command:
gradlew run
  1. Application configuration settings are in file config.json file where:

"conn" block:

   "isolationMode": read_uncommitted | read_committed | repeatable_read | serializable 
                  default isolationMode = repeatable_read 

   "concurrencyMode": default | optimistic | pessimistic 
                  default concurrencyMode = default 

   "batch_size": the size of one chunk data, that will be sent to server

   "useAutoCommit": false, true

   "clear_graph": the graph name, that will be clear before inserting data, if it is required, may be empty

   "max_threads": max count of working threads, by default max count of threads = count of uploaded files

   "data_dir": directory name with data files

"data" block
Contains the list of files comprising data to be loaded to a Virtuoso DBMS instance.
By default, this app starts ONE DBMS connection (with a single thread) for each source file.

   "file" : file name that includes its path
   "type" : content-type, which may be one of: 
               "RDF/XML" | "TURTLE" | "TTL" | "N3" | "NTRIPLES" | "JSON-LD" | 
               "JSON-LD10" | "JSON-LD11" | "RDF/JSON" | "TRIG" | "NQUADS" | 
               "RDF-PROTO" | "RDF-THRIFT" | "SHACLC" | "TRIX"
   "graph": named graph denoted by an IRI that names internal DBMS storage of data. Note, this may be left empty if the source data comprises quads
   "clear_graph": true - indicates clearance of existing data associated with destination named graph prior to commencement of new data load run.

Example Usage

$ git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
Cloning into 'Virtuoso_Java_Samples'...
remote: Enumerating objects: 505, done.
remote: Counting objects: 100% (505/505), done.
remote: Compressing objects: 100% (260/260), done.
remote: Total 505 (delta 229), reused 442 (delta 179), pack-reused 0
Receiving objects: 100% (505/505), 2.18 MiB | 30.51 MiB/s, done.
Resolving deltas: 100% (229/229), done
$ cd Virtuoso_Java_Samples/Jena/RDFLoader/
$ ls -l
total 36
-rw-rw-r-- 1 ubuntu ubuntu  587 May 20 10:15 build.gradle
-rw-rw-r-- 1 ubuntu ubuntu 1565 May 20 10:15 config.json
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 gradle
-rwxrwxr-x 1 ubuntu ubuntu 5766 May 20 10:15 gradlew
-rw-rw-r-- 1 ubuntu ubuntu 2763 May 20 10:15 gradlew.bat
drwxrwxr-x 2 ubuntu ubuntu 4096 May 20 10:15 lib
-rw-rw-r-- 1 ubuntu ubuntu 1619 May 20 10:15 readme.txt
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 src
ubuntu@ip-172-30-0-160:~/tmp/jena/Virtuoso_Java_Samples/Jena/RDFLoader$ cat config.json 
{
  "conn": {
     "host": "localhost",
     "port": 1111,
     "uid": "dba",
     "pwd": "dba",
     "isolationMode": "repeatable_read",
     "concurrencyMode": "default",
     "batch_size": 5000,
     "useAutoCommit": false,
     "clear_graph": "test:insert",
     "data_dir":  ".",
     "max_threads": 0
  },
  "data": 
    [ 
      {"file": "xaa.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xab.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xac.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xad.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xae.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xaf.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xag.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xah.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xai.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xaj.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      }

    ]
}
$ ./gradlew clean build

BUILD SUCCESSFUL in 2s
6 actionable tasks: 6 executed
ubuntu@ip-172-30-0-160:~/git/Virtuoso_Java_Samples/Jena/RDFLoader$ ./gradlew run

> Task :run
===========================================================================

App will use next options
    hostname = localhost
        port = 1111
         UID = dba
         PWD = dba
   isolation = read_committed
 concurrency = optimistic
  chunk_size = 20000
useAutoCommit= false
===========================================================================

   ==[] Start clear graph = test:insert
   ==[] End clear graph = test:insert
   ==[Thread-8] Start load data = xai.ttl
   ==[Thread-3] Start load data = xad.ttl
   ==[Thread-5] Start load data = xaf.ttl
   ==[Thread-7] Start load data = xah.ttl
   ==[Thread-1] Start load data = xab.ttl
   ==[Thread-0] Start load data = xaa.ttl
   ==[Thread-9] Start load data = xaj.ttl
   ==[Thread-6] Start load data = xag.ttl
   ==[Thread-2] Start load data = xac.ttl
   ==[Thread-4] Start load data = xae.ttl
   ==[Thread-1] End load data = 
   ==[Thread-1] DONE = 
   ==[Thread-0] End load data = 
   ==[Thread-0] DONE = 
   ==[Thread-2] End load data = 
   ==[Thread-2] DONE = 
   ==[Thread-7] End load data = 
   ==[Thread-7] DONE = 
   ==[Thread-8] End load data = 
   ==[Thread-8] DONE = 
   ==[Thread-9] End load data = 
   ==[Thread-9] DONE = 
   ==[Thread-4] End load data = 
   ==[Thread-4] DONE = 
   ==[Thread-6] End load data = 
   ==[Thread-6] DONE = 
   ==[Thread-3] End load data = 
   ==[Thread-3] DONE = 
   ==[Thread-5] End load data = 
   ==[Thread-5] DONE = 

BUILD SUCCESSFUL in 10m 13s
2 actionable tasks: 1 executed, 1 up-to-date
$

Related