What
Bulk loading large RDF datasets into a Virtuoso Elastic scale-out Cluster for optimal speed in loading the data.
Why
Large RDF datasets (billions of triples) can take a significant time to load on commodity hardware. Thus with an Elastic scale-out cluster it is important to ensure the additional resources available due to the horizontal scaling are used for bulk loading data with maximum platform utilisation, particularly in terms of available CPUs and RAM.
How
This can be achieved by performing bulk loading on all the nodes for the Virtuoso Elastic Cluster, using the Virtuoso RDF Bulk Loader as follows.
- Copy all the RDF Datasets to the loaded to the same location on each machine a Virtuoso cluster instance node resides on. A shared NFS mount would be ideal or physical copies on each.
- Run the ld_dir() or ld_dir_all() command to register the dataset files in the
load_list
table for loading. - Once registered for loading, the
rdf_ld_srv()
function can be run multiple times, it is recommended a maximum ofno cores / 2.5
, to optimally parallelize the data load and hence maximize load speed. Use the cluster cl_exec() function to run and instance of the rdf_loader_run() function on the each node of the cluster. A typical script (e.g.,bulk_load.sh
) that can be run from command line on one of the cluster nodes (master typically), might look like:
isql 1111 dba <pwd> exec=“cl_exec(‘http_lock()’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint_interval(0)’);” &
isql 1111 dba <pwd> exec=“cl_exec(’scheduler_interval(0)’);” &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
isql 1111 dba <pwd> exec="cl_exec('rdf_ld_srv()');" &
wait
isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘checkpoint_interval(60)’);” &
isql 1111 dba <pwd> exec=“cl_exec(’scheduler_interval(10)’);” &
isql 1111 dba <pwd> exec=“cl_exec(‘http_unlock()’);” &
The sample script above assumes each cluster nodes has 15CPUs (6 * 2.5) available and uses the cl_exec('rdf_ld_srv()')
command to start a rdf_loader_run() function process on each node of the cluster. Refer to the Virtuoso RDF Bulk Loader documentation for additional details on usage.
Related
- Virtuoso RDF Bulk Loader
- HowTo - Bulk Loading RDF into a Virtuoso Single-Server Instance
- Virtuoso Functions Guide & Reference
- Virtuoso LD Meter RDF load rate monitor