Insert Performance of Virtuoso Jena Provider

Jason · July 11, 2019, 2:48pm

Hi everyone,
I need to insert millions triples into virtuoso as quick as possible.My goal is something like a R2RML implementation based on spark.
The data in RDB firstly are fetched and converted into Jena Triple Object, then VirtGraph send them into virtuoso, for example:

Node foo1 = Node.createURI("http://example.org/#foo1");
Node bar1 = Node.createURI("http://example.org/#bar1");
Node baz1 = Node.createURI("http://example.org/#baz1");
VirtGraph graph = new VirtGraph ("GraphName", url, "dba", "dba");
graph.add(new Triple(foo1, bar1, baz1));

However, the insert speed is not ideal enough, which is 400,000~500,000 triples per 5 minutes.Our hardware seems still sleeping, 70% CPU is in idle.
Our environment is:
Virtuoso: Virtuoso Open Source Edition v7.2.5.1
JDBC: VirtJDBC4_2
Jena Provider: VirtJena3
Data scale: 1 million rows per table in RDB be converted into 6 millions or more triples per graph.
Ideal speed: 1 million or more triples inserted per 5 minutes

And these are related parameters in virtuoso.ini I think:

[Parameters]
MaxClientConnections = 10000
ServerThreads            = 1000
ServerThreadSize         = 500000
MainThreadSize           = 1000000
ThreadCleanupInterval    = 0
ThreadThreshold          = 10
SingleCPU                = 0
ThreadsPerQuery          = 16
AsyncQueueMaxThreads     = 10
NumberOfBuffers          = 5450000
MaxDirtyBuffers          = 4000000
ColumnStore              = 1
[Flags]
enable_mt_txn = 1
enable_mt_transact = 1
enable_qp = 1
qp_thread_min_usec = 100
mp_local_rc_sz = 0
dbf_explain_level = 0
enable_exact_p_stat = 1
hash_join_enable = 2
enable_g_in_sec = 1
qrc_tolerance = 40
dbf_max_itc_samples = 5

By the way, will it be more quicker that change RDB data into SPARQL Update String and use SPARQL protocol ?

Thank you very much!

Jason

hwilliams · July 11, 2019, 3:39pm

Have you looked at the Jena 3 VirtuosoSPARQLExample11.java, which uses the newer public Model add(List<Statement> statements) method ? On typical commodity level hardware you should be able to achieve insert/load rates of 30 - 40 KTriples per sec. I assume your 5 minute rate does not include the time to convert the RDB rows to triples ?

The quickest way to load large volumes of RDF data into Virtuoso is to create RDF datasets and load with the Virtuoso RDF Bulk Loader …

Jason · July 11, 2019, 5:01pm

Thank you for reply.

According to Jena documents, only Triples, Quads and Nodes are serializable object that is required by Spark. Spark is an distributed cluster-computing framework and data (rows and triples) is stored in distributed dataset(called RDD) which require all objects are serializable.
So public Model add(List<Statement> statements) method seems hard to use, because Statement is not serializable.
I think 5 minute rate does not include the time to convert the RDB rows to triples but I will consider it.
Virtuoso RDF Bulk Loader were considered when project is started, but it’s very hard to make RDD generate manageable RDF files to be bulked.

So, is there any parameters or other method to speed up on virtuoso side?

Thank you very much!
Jason

smalinin · July 12, 2019, 11:12am

The SimpleBulkUpdateHandler interface was removed in Jena3 , so for now you could do the next:

create inmemory Jena Model
add your data to this model
insert date from inmemory model to VirtMode.add method
Something like:

        Model model = ModelFactory.createDefaultModel();
        Graph graph = model.asGraph();

	Node foo1 = NodeFactory.createURI("http://example.org/#foo1");
	Node bar1 = NodeFactory.createURI("http://example.org/#bar1");
	Node baz1 = NodeFactory.createURI("http://example.org/#baz1");

	Node foo2 = NodeFactory.createURI("http://example.org/#foo2");
	Node bar2 = NodeFactory.createURI("http://example.org/#bar2");
	Node baz2 = NodeFactory.createURI("http://example.org/#baz2");


	graph.add(new Triple(foo1, bar1, baz1));
	graph.add(new Triple(foo2, bar2, baz2));
	graph.add(new Triple(foo3, bar3, baz3));
	graph.add(new Triple(foo1, bar2, baz2));
	graph.add(new Triple(foo1, bar3, baz3));

        VirtModel vm = VirtModel.openDatabaseModel("test", "jdbc:virtuoso://localhost:1111", "dba", "dba");

        vm.add(model);

or etc.

It will works much faster, because batch insert will be used.
There aren’t any methods for insert List of Triples in Jena3, they was removed from API.

Jason · July 13, 2019, 6:40am

Thanks, it becomes better! 6 millions triples insertion in 6 mins including conversion.

But still a question is that I used the code below:

Model model = ModelFactory.createDefaultModel();
model.add(model.asStatement(triple));
VirtModel md = VirtModel.openDatabaseModel(GraphName, url, login, passwd);
md.add(model);

Is there any performance difference between the code above?

Thank you very much!!

smalinin · July 13, 2019, 9:20am

There isn’t any difference.

the call above adds data with batch optimization.