How to link Virtuoso distributed version to Hadoop

d.selma · December 2, 2019, 10:33am

I Have a cluster of 4 nodes, I installed Hadoop+ Spark (GraphX)…

Now I have to process a big RDF dataset, my question is : Can I install Virtuoso on the cluster so to store this RDF datasets and to be able to execute SPARQL distributed queries?

To the best of your knowledge, I need a web endpoint to allow users putting their SPARQL Queries.

in other words: is Virtuoso a good solution that works in a hadoop cluster, and can use SPARK to execute the distributed queries?

is there any free access to the Distributed version for academic people?

hwilliams · December 2, 2019, 4:55pm

Virtuoso does not support Hadoop storage (HDFS, etc.), as it uses its own storage system. Nor does Virtuoso support Spark, so it cannot be used as you describe.

What are these “big RDF datasets” you are seeking to host in terms of number of triples? What sort of distributed queries would you be executing against them that can’t be run using the Virtuoso SPARQL query engine?

TallTed · December 13, 2019, 3:31pm

(also asked on Stack Overflow)

The Apache Spark website indicates that Spark SQL can be used to query across JDBC and JSON data sources –

DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.

Virtuoso (both Open Source and Enterprise Edition) can deliver SPARQL results as JSON serializations, so that is an option.

We also provide JDBC drivers for Virtuoso (again, both Open Source and Enterprise Edition), so that is also an option.

We are not Apache Spark experts, so we cannot provide much guidance for getting these working beyond assisting with Virtuoso JDBC URLs and/or retrieving SPARQL query results in JSON serialization.

In the other direction, Virtuoso (Enterprise Edition; not Open Source Edition) can be used to query against external ODBC data sources, and there are ODBC drivers available for Hadoop/SPARK data sources, so this is also an option.

We are not Apache Spark experts, so we cannot provide much guidance for getting their drivers working, but once you have a functional DSN, we can assist in getting Virtuoso connected to it.