HowTo - Bulk Loading RDF into a Virtuoso Share-Nothing Cluster Instance

What

Bulk loading large RDF datasets into a Virtuoso Elastic scale-out Cluster for optimal speed in loading the data.

Why

Large RDF datasets (billions of triples) can take a significant time to load on commodity hardware. Thus with an Elastic scale-out cluster it is important to ensure the additional resources available due to the horizontal scaling are used for bulk loading data with maximum platform utilisation, particularly in terms of available CPUs and RAM.

How

This can be achieved by performing bulk loading on all the nodes for the Virtuoso Elastic Cluster, using the Virtuoso RDF Bulk Loader as follows.

  1. Copy all the RDF Datasets to the loaded to the same location on each machine a Virtuoso cluster instance node resides on. A shared NFS mount would be ideal or physical copies on each.
  2. Run the ld_dir() or ld_dir_all() command to register the dataset files in the load_list table for loading.
  3. Once registered for loading, the rdf_ld_srv() function can be run multiple times, it is recommended a maximum of no cores / 2.5, to optimally parallelize the data load and hence maximize load speed. Use the cluster cl_exec() function to run and instance of the rdf_loader_run() function on the each node of the cluster. A typical script (e.g., bulk_load.sh ) that can be run from command line on one of the cluster nodes (master typically), might look like:
isql 1111 dba <pwd> exec=“cl_exec(‘http_lock()’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint_interval(0)’);” &
isql 1111 dba <pwd> exec=“cl_exec(’scheduler_interval(0)’);” &

isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" & 
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" & 
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" & 
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" & 
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" & 
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &  

wait 

isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint_interval(60)’);” &
isql 1111 dba <pwd> exec=“cl_exec(’scheduler_interval(10)’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘http_unlock()’);” &

The sample script above assumes each cluster nodes has 15CPUs (6 * 2.5) available and uses the cl_exec('rdf_ld_srv()') command to start a rdf_loader_run() function process on each node of the cluster. Refer to the Virtuoso RDF Bulk Loader documentation for additional details on usage.

Related