Transactional Bulk Loading of RDF Data into Virtuoso DBMS via the Jena RDF Framework

What

A simulation application demonstrating how to transactionally bulk-load RDF datasets into a Virtuoso DBMS using the Apache Jena framework.

Why

When transactionally bulk-loading large amounts of data, it is vital to use transactions that are consistent with the principles of atomicity, consistency, isolation, and durability (a/k/a ACID) in conjunction with appropriate concurrency modalities (e.g., optimistic or pessimistic). Failure to do so will compromise DBMS integrity and lead to unexpected and inaccurate perception of performance and scale.

How

As RDF usage is becoming more mainstream, its usage profile is evolving from solely decision support to a mix that now includes OLTP. In response to this reality, we’ve created a Jena RDF Loader application that uses the Virtuoso Native Jena Provider to perform transactional bulk-load operations against a Virtuoso instance. The source code for the RDFLoader.java application is also available.

The objective here is to enable developers to understand and exercise the spectrum of modalities associated with the ACID characteristics of transaction and concurrency control.

To Use

  1. Clone git repository

    git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
    
  2. Goto RDFLoader directory

    cd Virtuoso_Java_Samples/Jena/RDFLoader/
    
  3. Compile the application with command:

    gradlew clean build
    
  4. Run the application with the command:

    gradlew run
    
  5. Application configuration settings are in file config.json file where:

    conn block:

    "isolationMode": read_uncommitted | read_committed | 
                     repeatable_read | serializable 
                     default isolationMode = repeatable_read 
    
    "concurrencyMode": default | optimistic | pessimistic 
                       default concurrencyMode = default 
    
    "batch_size": the size of one chunk of data that will be sent to server
    
    "useAutoCommit": false, true
    
    "clear_graph": the name of the graph that will be cleared before inserting 
                   data, if required; may be empty
    
    "max_threads": max count of working threads
                   default max_threads = count of uploaded files
    
    "data_dir": directory name with data files
    

    data block
    Contains the list of files comprising data to be loaded to a Virtuoso DBMS instance.
    By default, this app starts ONE DBMS connection (with a single thread) PER SOURCE FILE.

    "file" : file name that includes its path
    
    "type" : content-type, which may be one of: 
                "RDF/XML" | "TURTLE" | "TTL" | "N3" | "NTRIPLES" | "JSON-LD" | 
                "JSON-LD10" | "JSON-LD11" | "RDF/JSON" | "TRIG" | "NQUADS" | 
                "RDF-PROTO" | "RDF-THRIFT" | "SHACLC" | "TRIX"
    
    "graph": named graph denoted by an IRI that names internal DBMS storage of data. 
             Note, this may be left empty if the source data comprises quads
    
    "clear_graph": true - indicates destination named graph will be cleared of 
                   existing data associated with prior to commencement of new 
                   data load run.
    
    "gzipped": true - indicates that file was compressed with GZip. It will be 
               set automatically if a file has the additional extension .gz or .z, 
               for example test.ttl.gz.
    

Example Usage

$ git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
Cloning into 'Virtuoso_Java_Samples'...
remote: Enumerating objects: 505, done.
remote: Counting objects: 100% (505/505), done.
remote: Compressing objects: 100% (260/260), done.
remote: Total 505 (delta 229), reused 442 (delta 179), pack-reused 0
Receiving objects: 100% (505/505), 2.18 MiB | 30.51 MiB/s, done.
Resolving deltas: 100% (229/229), done
$ cd Virtuoso_Java_Samples/Jena/RDFLoader/
$ ls -l
total 36
-rw-rw-r-- 1 ubuntu ubuntu  587 May 20 10:15 build.gradle
-rw-rw-r-- 1 ubuntu ubuntu 1565 May 20 10:15 config.json
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 gradle
-rwxrwxr-x 1 ubuntu ubuntu 5766 May 20 10:15 gradlew
-rw-rw-r-- 1 ubuntu ubuntu 2763 May 20 10:15 gradlew.bat
drwxrwxr-x 2 ubuntu ubuntu 4096 May 20 10:15 lib
-rw-rw-r-- 1 ubuntu ubuntu 1619 May 20 10:15 readme.txt
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 src
ubuntu@ip-172-30-0-160:~/tmp/jena/Virtuoso_Java_Samples/Jena/RDFLoader$ cat config.json 
{
  "conn": {
     "host": "localhost",
     "port": 1111,
     "uid": "dba",
     "pwd": "dba",
     "isolationMode": "repeatable_read",
     "concurrencyMode": "default",
     "batch_size": 5000,
     "useAutoCommit": false,
     "clear_graph": "test:insert",
     "data_dir":  ".",
     "max_threads": 0
  },
  "data": 
    [ 
      {"file": "xaa.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xab.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xac.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xad.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xae.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xaf.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xag.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xah.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xai.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      },
      {"file": "xaj.ttl",
       "type": "ttl",
       "graph": "test:insert",
       "clear_graph": false
      }

    ]
}
$ ./gradlew clean build

BUILD SUCCESSFUL in 2s
6 actionable tasks: 6 executed
ubuntu@ip-172-30-0-160:~/git/Virtuoso_Java_Samples/Jena/RDFLoader$ ./gradlew run

> Task :run
===========================================================================

App will use next options
    hostname = localhost
        port = 1111
         UID = dba
         PWD = dba
   isolation = read_committed
 concurrency = optimistic
  chunk_size = 20000
useAutoCommit= false
===========================================================================

   ==[] Start clear graph = test:insert
   ==[] End clear graph = test:insert
   ==[Thread-8] Start load data = xai.ttl
   ==[Thread-3] Start load data = xad.ttl
   ==[Thread-5] Start load data = xaf.ttl
   ==[Thread-7] Start load data = xah.ttl
   ==[Thread-1] Start load data = xab.ttl
   ==[Thread-0] Start load data = xaa.ttl
   ==[Thread-9] Start load data = xaj.ttl
   ==[Thread-6] Start load data = xag.ttl
   ==[Thread-2] Start load data = xac.ttl
   ==[Thread-4] Start load data = xae.ttl
   ==[Thread-1] End load data = 
   ==[Thread-1] DONE = 
   ==[Thread-0] End load data = 
   ==[Thread-0] DONE = 
   ==[Thread-2] End load data = 
   ==[Thread-2] DONE = 
   ==[Thread-7] End load data = 
   ==[Thread-7] DONE = 
   ==[Thread-8] End load data = 
   ==[Thread-8] DONE = 
   ==[Thread-9] End load data = 
   ==[Thread-9] DONE = 
   ==[Thread-4] End load data = 
   ==[Thread-4] DONE = 
   ==[Thread-6] End load data = 
   ==[Thread-6] DONE = 
   ==[Thread-3] End load data = 
   ==[Thread-3] DONE = 
   ==[Thread-5] End load data = 
   ==[Thread-5] DONE = 

BUILD SUCCESSFUL in 10m 13s
2 actionable tasks: 1 executed, 1 up-to-date
$

Related