Hi everyone,
I need to insert millions triples into virtuoso as quick as possible.My goal is something like a R2RML implementation based on spark.
The data in RDB firstly are fetched and converted into Jena Triple Object, then VirtGraph send them into virtuoso, for example:
However, the insert speed is not ideal enough, which is 400,000~500,000 triples per 5 minutes.Our hardware seems still sleeping, 70% CPU is in idle.
Our environment is:
Virtuoso: Virtuoso Open Source Edition v7.2.5.1
JDBC: VirtJDBC4_2
Jena Provider: VirtJena3
Data scale: 1 million rows per table in RDB be converted into 6 millions or more triples per graph.
Ideal speed: 1 million or more triples inserted per 5 minutes
And these are related parameters in virtuoso.ini I think:
Have you looked at the Jena 3 VirtuosoSPARQLExample11.java, which uses the newer public Model add(List<Statement> statements) method ? On typical commodity level hardware you should be able to achieve insert/load rates of 30 - 40 KTriples per sec. I assume your 5 minute rate does not include the time to convert the RDB rows to triples ?
The quickest way to load large volumes of RDF data into Virtuoso is to create RDF datasets and load with the Virtuoso RDF Bulk Loader …
According to Jena documents, only Triples, Quads and Nodes are serializable object that is required by Spark. Spark is an distributed cluster-computing framework and data (rows and triples) is stored in distributed dataset(called RDD) which require all objects are serializable.
So public Model add(List<Statement> statements) method seems hard to use, because Statement is not serializable.
I think 5 minute rate does not include the time to convert the RDB rows to triples but I will consider it.
Virtuoso RDF Bulk Loader were considered when project is started, but it’s very hard to make RDD generate manageable RDF files to be bulked.
So, is there any parameters or other method to speed up on virtuoso side?