Virtuoso bulk loader and database problem

virtuoso
bulkload
java
jena

#1

I am new with Virtuoso so I faced too much problems while reading their chaotic documentation and implementing some code in Java.

Finally I managed to make some RDF Loader but I can see on their documentation there is special bulk loader which I do not know how to implement in Java.

Here is my code which is working fine:

 VirtDataset virtuosoDataset = new VirtDataset(virtuosoLocation, virtuosoUser, virtuosoPass);      
 virtuosoDataset.begin(ReadWrite.WRITE);
 Model model = virtuosoDataset.getDefaultModel();
 model.read(reader, ConfigurationFile.getProperty(JENAXMLBASE));
 virtuosoDataset.commit();
 virtuosoDataset.close();

When I use get method over the Virtuoso I can get the objects that are inserted.

My questions are:

  1. Can i see the imported model somehow on Virtuoso Gui http://localhost:8890/conductor?

  2. Is this the right way to load bulk objects in Virtuoso?


#2

Thank you for bringing this here.

As I said on your original post on StackOverflow

For a number of reasons, the RDF Bulk Loader is the optimal method. Other means may be used, but will be less efficient in many ways.

Your followup comment there, will help others to assist you here, now –

I am using VirtDataset(localhost:1111) so it loads the model in some database and where can i see the ImportedModel on localhost:8890/conductor?

Also about RDF Bulk Loader, can you give me some example of implementation in Java?


#3

So can you send me an example with RDF Bulk Loader implemented in Java?


#4

The best way to bulk load RDF data into Virtuoso is via the SQL Stored Procedures provided in the Bulk Loader Documentation.

Conceptually, you are performing the following steps (via the ISQL command-line or variant in the HTML-based Conductor UI):

  1. Identify the location of RDF documents to be uploaded – this MUST be a folder (directory) listed as a value of the “DirsAllowed” INI setting

  2. Determine a destination Named Graph IRI that identifies the internal (Virtuoso hosted) target document into which you plan this RDF data

  3. Direct the Bulk Loader to the source directory comprising RDF documents that you want to schedule for loading, an RDF document selection pattern or specific name (if a single file), and a target Named Graph IRI

  4. Verify that the Bulk Loader successfully registered the target RDF documents

  5. Run the Bulk Loader

  6. Verify that the Bulk Loading process was successful

Given the following –

  • a directory named /tmp/data/
  • RDF-Turtle documents with the extension .ttl have been copied to the directory above
  • Virtuoso Named Graph IRI file:tmp:data is the target of the bulk upload

– here are the steps you would perform steps using the following “;” separated SQL Stored Procedure calls:

  1. LD_DIR ('/tmp/data/', '*.ttl', 'file:tmp:data') ;

  2. SELECT * from DB.DBA.load_list ; – where ll_state value 0 indicates an item hasn’t been loaded

  3. RDF_LOADER_RUN() ;

  4. SPARQL SELECT COUNT (*) FROM <file:tmp:data> WHERE {?s ?p ?o} ;

  5. SPARQL SELECT SAMPLE (?s) as ?sample COUNT (*) ?o FROM <file:tmp:data> WHERE {?s a ?o} GROUP BY ?o; – to get a quick analysis of entities and entity types associated with the Named Graph <file:tmp:data>

  6. Done.


#5

If you must do this in Java you will simply need to invoke the ISQL command-line interpreter from your Java app.

Template (from iSQL docs):
isql [port] [username] [pwd] VERBOSE=OFF 'EXEC={semi-colon-separated-commands}' {additional-file1-with-sql-commands}.sql {additional-file2-with-sql-commands}.sql -i arg1 arg2

Example (everything happens via EXEC()):
isql 1111 dba dba "EXEC=LD_DIR ('/tmp/data/', '*.ttl', 'file:tmp:data')"; SELECT * from DB.DBA.load_list ; RDF_LOADER_RUN() ; SPARQL SELECT SAMPLE (?s) as ?sample COUNT (*) ?o FROM <file:tmp:data> WHERE {?s a ?o} GROUP BY ?o;"


#6

The Java code with JenaAPI is Ok.

It is the proper way, if you want to load data via Jena API, but it may not work extremely fast for large datasets.

p.s. The Virtuoso Jena provider sends data by chunks with 5000 triples by default. It could be changed via

model.setBatchSize(chunk_size);

Also, bulk loading works slower in transaction and requires more server resources (for transaction logging).


#7

If i set BatchSize and remove tranasctions the loading will be faster?
I am using Jena Model, so i need to change it to VirtModel which needs VirtGraph and not VirtDataset like it is in my code and then i should set the BatchSize and the loader will be fine configured?

About ISQL command line, i wont use that in my Java code… I thought there is some way to use it in Java code without calling the ISQL…

Nobody answered my second question about Virtuoso GUI… I still can not find where can i see the imported model in localhost:8890/conductor


#8

Virtuoso is an RDF Quad Store so your “Model” ultimately equates to an RDF Graph in Virtuoso, which can be seen from the Linked Data → Graphs → Graphs tab of the Conductor …


#9

If you remove transactions, it will work faster. This is so for all DBMS.

The default BatchSize==5000 already. It is enough for many cases, but you could increase it and compare performance for your environment.


#10

Your code with setBatchSize() will look like:

VirtDataset virtuosoDataset = new VirtDataset(virtuosoLocation, virtuosoUser, virtuosoPass);      
 virtuosoDataset.begin(ReadWrite.WRITE);
 VirtModel model = virtuosoDataset.getDefaultModel();
 model.setBatchSize(6000);
 model.read(reader, ConfigurationFile.getProperty(JENAXMLBASE));
 virtuosoDataset.commit();
 virtuosoDataset.close();

The VirtModel class has method setBatchSize() also.


#11

I tried with VirtModel and batchSize but i can not see some drastic difference… When i try to import file that is more than 1GB i get this error:

SR325: Transaction aborted because its log after image size went above the limit

One question about deleting objects from Virtuoso… I tried with all of this:

 virtuosoDataset.clear();
//    model.removeAll();
//    model.remove(model);

and clear is fine, it removes the object’s from Virtuoso but the virtuoso.db file is still 3GB? Is there some other way to remove the objects from Virtuoso and also clear the virtuoso.db file ?

Also i want to ask what is the average time of creating one object in Virtuoso, because i need to create 200 objects in Virtuoso for less than 30 ms.


#12

The “transaction aborted” message results from an INI setting being too low for your current bulk-loading activity – but note that the current setting may be appropriate for later ongoing activity. See the manual’s discussion of TransactionAfterImageLimit

  • TransactionAfterImageLimit = N Bytes default 50000000. When the roll-forward log entry of a transaction exceeds this size, the transaction is too large and is marked as uncommittable. This work as upper limit otherwise infinite (transactions). The default is 50MB. Also note that transaction roll-back data takes about 2x the space of roll-forward data. Hence when the transaction roll-forward data is 50MB, the total transient consumption is closer to 150 MB.

Deleting data from the DB does not immediately free the disk space previously occupied by that data. Virtuoso has an auto-compaction feature which will eventually free space. New data will be loaded into the space previously occupied by deleted data, but of course this reuse will be imperfect. You may be able to free some space by running a CHECKPOINT; or the DB..VACUUM (); procedure (depending on your workflow, it may make sense to run CHECKPOINT; and DB..VACUUM (); twice or three times in a cycle). You can also do a backup-dump-and-reload to immediately reclaim disk – though you must temporarily consume substantially more disk space during this process.


I would encourage you to describe what you’re really trying to achieve – both starting points and desired end results – and to define your terms as you go.

You have expressed a wish to “create 200 objects in Virtuoso for less than 30 ms” (by which I think you mean “≤ 30 ms total to create 200 objects”). However, it is not clear if those “objects” would be triples of a few dozen or hundred bytes each, for which your wish should be easily granted, or if those “objects” are graphs of thousands or millions of triples and multiple GB each, for which your wish might not be so easily granted, if at all, especially if everything is moving over the network and not moving between disk and RAM and disk on a single box…

As is true of all software, the speed of all Virtuoso activity is impacted by infrastructure between action points, and by the resources Virtuoso has to work with on its local host(s). Proper tuning matters a great deal – such as ensuring that Virtuoso knows how much memory it should use for active work, and leaving enough free RAM and disk for task-specific activities.

You might benefit by exploring some of the benchmarks in active use –


#13

Thank you for your answer, TransactionAfterImageLimit property fixed my problem.

About the 200 objects that need to be inserted in Virtuoso for less than 30 ms this is the structure of one object:

 <cim:Substation rdf:ID="_0425e670-fcbd-11e6-835d-f0def1611578">
<cim:IdentifiedObject.name>RTP Domžale</cim:IdentifiedObject.name>
   <cim:Substation.Region rdf:resource="#_052259c0-fcbb-11e6-835d-f0def1611578"/>
 </cim:Substation>

I want to ask another question. By default now i am using Quad Store DB. Is there an option to save RDF triplets with Sparql query in SQL database. I saw some converter in the documentation but i did not found some example in Java. I can make connection with virtuoso.jdbc4.Driver to the DB but i dont know how can Triplets be saved in SQL (should I create seperate tables and etc. or Virtuoso will handle that just with changing the connection driver).


#14

I am glad to hear that adjusting the TransactionAfterImageLimit resolved your data load issue.

For many reasons, it is best to keep each topic focused on one question or issue. Your new questions (about insertion speeds, and about mixing RDF Graphs with SQL Tables) would each be better raised in a new thread/question/topic.


#15

Okay sorry, I created another two questions. Thank you.