HowTo - Bulk Loading RDF into a Virtuoso Single-Server Instance

What

Bulk loading large RDF datasets into a Virtuoso Single Server instance for optimal speed in loading the data.

Why

Large RDF datasets (billions of triples) can take a significant time to load on commodity hardware. Thus with a Virtuoso single server instance it is important to ensure the additional resources available due to the horizontal scaling are used for bulk loading data with maximum platform utilisation, particularly in terms of available CPUs and RAM.

How

This can be achieved by performing running multiple rdf_loader_run() function process on the Virtuoso single server instance, using the Virtuoso RDF Bulk Loader procedure as follows.

  1. Copy all the RDF Datasets to the loaded to the same location on each machine a Virtuoso cluster instance node resides on. A shared NFS mount would be ideal or physical copies on each.
  2. Run the ld_dir() or ld_dir_all() command to register the dataset files in the load_list table for loading.
  3. Once registered for loading, the rdf_loader_run() function can be run multiple times, it is recommended a maximum of no cores / 2.5, to optimally parallelize the data load and hence maximize load speed. A typical script (e.g., bulk_load.sh ) that can be run from command line , might look like:
isql 1111 dba <pwd> exec=“http_lock();” &
isql 1111 dba <pwd> exec=“checkpoint_interval(0);” &
isql 1111 dba <pwd> exec=“scheduler_interval(0);” &

isql 1111 dba <pwd> exec="rdf_loader_run();" & 
isql 1111 dba <pwd> exec="rdf_loader_run();" & 
isql 1111 dba <pwd> exec="rdf_loader_run();" & 
isql 1111 dba <pwd> exec="rdf_loader_run();" & 
isql 1111 dba <pwd> exec="rdf_loader_run();" & 
isql 1111 dba <pwd> exec="rdf_loader_run();" & 

wait 

isql 1111 dba <pwd> exec="checkpoint;” &
isql 1111 dba <pwd> exec="checkpoint_interval(60);” &
isql 1111 dba <pwd> exec="scheduler_interval(10);” &
isql 1111 dba <pwd> exec=“http_unlock();” &

The sample script above assumes each cluster nodes has 15CPUs (6 * 2.5) available and uses the rdf_loader_run() function process on the single server instance. Refer to the Virtuoso RDF Bulk Loader documentation for additional details on usage.

Related

1 Like