(also asked on Stack Overflow)
The Apache Spark website indicates that Spark SQL can be used to query across JDBC and JSON data sources –
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
Virtuoso (both Open Source and Enterprise Edition) can deliver SPARQL results as JSON serializations, so that is an option.
We also provide JDBC drivers for Virtuoso (again, both Open Source and Enterprise Edition), so that is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting these working beyond assisting with Virtuoso JDBC URLs and/or retrieving SPARQL query results in JSON serialization.
In the other direction, Virtuoso (Enterprise Edition; not Open Source Edition) can be used to query against external ODBC data sources, and there are ODBC drivers available for Hadoop/SPARK data sources, so this is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting their drivers working, but once you have a functional DSN, we can assist in getting Virtuoso connected to it.