What
A simulation application demonstrating how to transactionally bulk-load RDF datasets into a Virtuoso DBMS using the Apache Jena framework.
Why
When transactionally bulk-loading large amounts of data, it is vital to use transactions that are consistent with the principles of atomicity, consistency, isolation, and durability (a/k/a ACID) in conjunction with appropriate concurrency modalities (e.g., optimistic or pessimistic). Failure to do so will compromise DBMS integrity and lead to unexpected and inaccurate perception of performance and scale.
How
As RDF usage is becoming more mainstream, its usage profile is evolving from solely decision support to a mix that now includes OLTP. In response to this reality, we’ve created a Jena RDF Loader application that uses the Virtuoso Native Jena Provider to perform transactional bulk-load operations against a Virtuoso instance. The source code for the RDFLoader.java
application is also available.
The objective here is to enable developers to understand and exercise the spectrum of modalities associated with the ACID characteristics of transaction and concurrency control.
To Use
-
Clone git repository
git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
-
Goto
RDFLoader
directorycd Virtuoso_Java_Samples/Jena/RDFLoader/
-
Compile the application with command:
gradlew clean build
-
Run the application with the command:
gradlew run
-
Application configuration settings are in file
config.json
file where:conn
block:"isolationMode": read_uncommitted | read_committed | repeatable_read | serializable default isolationMode = repeatable_read "concurrencyMode": default | optimistic | pessimistic default concurrencyMode = default "batch_size": the size of one chunk of data that will be sent to server "useAutoCommit": false, true "clear_graph": the name of the graph that will be cleared before inserting data, if required; may be empty "max_threads": max count of working threads default max_threads = count of uploaded files "data_dir": directory name with data files
data
block
Contains the list of files comprising data to be loaded to a Virtuoso DBMS instance.
By default, this app starts ONE DBMS connection (with a single thread) PER SOURCE FILE."file" : file name that includes its path "type" : content-type, which may be one of: "RDF/XML" | "TURTLE" | "TTL" | "N3" | "NTRIPLES" | "JSON-LD" | "JSON-LD10" | "JSON-LD11" | "RDF/JSON" | "TRIG" | "NQUADS" | "RDF-PROTO" | "RDF-THRIFT" | "SHACLC" | "TRIX" "graph": named graph denoted by an IRI that names internal DBMS storage of data. Note, this may be left empty if the source data comprises quads "clear_graph": true - indicates destination named graph will be cleared of existing data associated with prior to commencement of new data load run. "gzipped": true - indicates that file was compressed with GZip. It will be set automatically if a file has the additional extension .gz or .z, for example test.ttl.gz.
Example Usage
$ git clone https://github.com/OpenLinkSoftware/Virtuoso_Java_Samples
Cloning into 'Virtuoso_Java_Samples'...
remote: Enumerating objects: 505, done.
remote: Counting objects: 100% (505/505), done.
remote: Compressing objects: 100% (260/260), done.
remote: Total 505 (delta 229), reused 442 (delta 179), pack-reused 0
Receiving objects: 100% (505/505), 2.18 MiB | 30.51 MiB/s, done.
Resolving deltas: 100% (229/229), done
$ cd Virtuoso_Java_Samples/Jena/RDFLoader/
$ ls -l
total 36
-rw-rw-r-- 1 ubuntu ubuntu 587 May 20 10:15 build.gradle
-rw-rw-r-- 1 ubuntu ubuntu 1565 May 20 10:15 config.json
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 gradle
-rwxrwxr-x 1 ubuntu ubuntu 5766 May 20 10:15 gradlew
-rw-rw-r-- 1 ubuntu ubuntu 2763 May 20 10:15 gradlew.bat
drwxrwxr-x 2 ubuntu ubuntu 4096 May 20 10:15 lib
-rw-rw-r-- 1 ubuntu ubuntu 1619 May 20 10:15 readme.txt
drwxrwxr-x 3 ubuntu ubuntu 4096 May 20 10:15 src
ubuntu@ip-172-30-0-160:~/tmp/jena/Virtuoso_Java_Samples/Jena/RDFLoader$ cat config.json
{
"conn": {
"host": "localhost",
"port": 1111,
"uid": "dba",
"pwd": "dba",
"isolationMode": "repeatable_read",
"concurrencyMode": "default",
"batch_size": 5000,
"useAutoCommit": false,
"clear_graph": "test:insert",
"data_dir": ".",
"max_threads": 0
},
"data":
[
{"file": "xaa.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xab.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xac.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xad.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xae.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xaf.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xag.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xah.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xai.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
},
{"file": "xaj.ttl",
"type": "ttl",
"graph": "test:insert",
"clear_graph": false
}
]
}
$ ./gradlew clean build
BUILD SUCCESSFUL in 2s
6 actionable tasks: 6 executed
ubuntu@ip-172-30-0-160:~/git/Virtuoso_Java_Samples/Jena/RDFLoader$ ./gradlew run
> Task :run
===========================================================================
App will use next options
hostname = localhost
port = 1111
UID = dba
PWD = dba
isolation = read_committed
concurrency = optimistic
chunk_size = 20000
useAutoCommit= false
===========================================================================
==[] Start clear graph = test:insert
==[] End clear graph = test:insert
==[Thread-8] Start load data = xai.ttl
==[Thread-3] Start load data = xad.ttl
==[Thread-5] Start load data = xaf.ttl
==[Thread-7] Start load data = xah.ttl
==[Thread-1] Start load data = xab.ttl
==[Thread-0] Start load data = xaa.ttl
==[Thread-9] Start load data = xaj.ttl
==[Thread-6] Start load data = xag.ttl
==[Thread-2] Start load data = xac.ttl
==[Thread-4] Start load data = xae.ttl
==[Thread-1] End load data =
==[Thread-1] DONE =
==[Thread-0] End load data =
==[Thread-0] DONE =
==[Thread-2] End load data =
==[Thread-2] DONE =
==[Thread-7] End load data =
==[Thread-7] DONE =
==[Thread-8] End load data =
==[Thread-8] DONE =
==[Thread-9] End load data =
==[Thread-9] DONE =
==[Thread-4] End load data =
==[Thread-4] DONE =
==[Thread-6] End load data =
==[Thread-6] DONE =
==[Thread-3] End load data =
==[Thread-3] DONE =
==[Thread-5] End load data =
==[Thread-5] DONE =
BUILD SUCCESSFUL in 10m 13s
2 actionable tasks: 1 executed, 1 up-to-date
$