From Databricks Tables to a Virtual Knowledge Graph with Virtuoso

Introduction

Modern AI and analytics workloads increasingly demand graph reasoning capabilities—the ability to traverse relationships, discover patterns across connected data, and answer multi-hop questions that flat tables and vector searches cannot address efficiently. Yet most enterprise data remains locked in relational warehouses and lakehouses, where relationships are modeled using implicit coarse-grained foreign keys rather than fine-grained entity relationships defined by machine-computable relationship type semantics described in an ontology. In addition, these relationships and the entities they connect aren’t named using hyperlinks which impedes intelligent navigation across instance boundaries, as and when required.

This guide demonstrates how to unlock machine-computable entity relationships across your existing Databricks data without migration, duplication, or platform lock-in.

Using open standards—including RDF for expressing entity relationships, SPARQL for graph-oriented query and reasoning, and R2RML for exposing relational data as graphs—you can transform existing warehouse tables into a navigable web of linked entities without moving the underlying data.

At the heart of this approach are hyperlinks (IRIs), which provide globally unique identifiers for the entities, relationships, and data spaces represented in your data. These hyperlinks enable data from disparate systems to be connected, referenced, explored, and reasoned over as a coherent whole.

The result is a Knowledge Graph that extends beyond the boundaries of any single database, platform, or application. Rather than creating another data silo, it establishes a loosely coupled semantic layer in which identities, relationships, and meanings become explicitly computable and interoperable across both enterprise and public data spaces.

By virtualizing graph structures over existing data assets, organizations can create a foundation for AI agents, analytics, automation, and knowledge-driven applications that operate over a web of linked entities and relationships while preserving existing investments in data infrastructure.

What you get:

  • Graph reasoning on data you already have: Query relationships, traverse paths, and discover patterns across your Databricks tables using SPARQL, without moving data out of your governance perimeter.
  • Standards-based interoperability: RDF and SPARQL are W3C standards supported across hundreds of tools, triple stores, and semantic web platforms. Your graph isn’t locked to a single vendor’s query language or data model.
  • Zero data movement: Virtuoso’s Virtual Database architecture attaches to Databricks via ODBC, leaving your source data in place while exposing it through a live, queryable RDF layer.
  • Semantic enrichment: R2RML mappings let you define rich ontologies, infer relationships that aren’t explicit in your schema, and integrate external vocabularies for cross-domain reasoning.
  • Production-grade infrastructure: Virtuoso has powered enterprise Knowledge Graphs for decades, with proven scalability, ACID compliance, and full support for federated SPARQL queries across heterogeneous data sources.

This approach is ideal when you need graph capabilities over analytical data that lives in a warehouse, when governance or scale requires data to stay where it is, or when you want to test the value of graph patterns before committing to a parallel graph stack. For workloads requiring millisecond-latency traversals or high-frequency writes, a native graph store remains the better fit—but for batch reasoning, GraphRAG over reference data, or exploratory graph analytics, this standards-based virtualization layer delivers graph power without the operational overhead.

Use case: We’ll use the public samples.bakehouse dataset from Databricks—a fictional bakery chain with customers, franchises, suppliers, transactions, and reviews—to demonstrate how relational tables become a navigable, SPARQL-queryable Knowledge Graph in minutes.


Step 1: Create the ODBC DSN

Configure an ODBC Data Source Name (DSN) called databricks_odbc to connect to your Databricks warehouse.

Prerequisites

  • Install the Databricks ODBC Driver
  • Obtain your Databricks workspace Host, HTTP Path, and Personal Access Token

Configuration Files

Use the example configuration files as templates:

Copy these files and update them with your actual paths and credentials:

odbcinst.ini:

[ODBC Drivers]
databricks_odbc = Installed 

[databricks_odbc]
Driver = /path/to/libdatabricksodbc64.so

odbc.ini:

[ODBC Data Sources]
databricks_odbc = Databricks ODBC Driver

[databricks_odbc]
Driver          = /path/to/libdatabricksodbc64.so
Host            = your-workspace.cloud.databricks.com
Port            = 443
HTTPPath        = /sql/1.0/warehouses/your-warehouse-id
SSL             = 1
AuthMech        = 3
UID             = token
PWD             = dapiYOURTOKENHERE
ThriftTransport = 2
Schema          = bakehouse
Catalog         = samples

Place these files in your system’s ODBC configuration directory (e.g., /etc/odbc.ini and /etc/odbcinst.ini on Linux/macOS).


Step 2: Register the DSN in Virtuoso

Open the External Data Sources Manager in Virtuoso Conductor:

http://localhost:8890/conductor/vdb_conn_dsn.vspx

Or use your custom hostname/port:

http(s)://cname:port/conductor/vdb_conn_dsn.vspx

Step 3: Connect to the Data Source

  1. In the External Data Sources Manager, locate databricks_odbc in the list
  2. Click Connect
  3. Provide the username and password for your DSN (typically token and your Databricks personal access token)
  4. Click Connect to establish the connection

Step 4: Clone the Demo Repository

Clone the repository containing the R2RML mappings and setup scripts:

git clone https://github.com/danielhmills/databricks-sample-kg.git
cd databricks-sample-kg

Step 5: Run the Quick Setup Script

Execute the quick_setup.sql script using Virtuoso’s isql command-line tool:

isql {port} {username} {password} verbose=off bakehouse/quick_setup.sql

Example:

isql 1111 dba dba verbose=off bakehouse/quick_setup.sql

What the Script Does

1. Attach Remote Tables

The script attaches each Databricks table to Virtuoso’s local catalog, making them queryable via SQL:

ATTACH TABLE "bakehouse"."sales_customers" 
PRIMARY KEY(customerID)
AS "databricks"."bakehouse"."sales_customers"
FROM 'databricks_odbc';

vdd_disconnect_data_source('databricks_odbc');

This creates a local reference to the remote table without copying data. The vdd_disconnect_data_source call releases the connection after each attach operation.

2. Grant SPARQL Access

Each attached table receives SPARQL query privileges:

GRANT SELECT ON "databricks"."bakehouse"."sales_customers" TO SPARQL_SELECT;

This allows the SPARQL engine to query the attached tables when generating RDF views.

3. Load R2RML Mappings and Ontology

The script loads the R2RML mapping definitions and ontology vocabulary into named graphs:

SPARQL LOAD <https://raw.githubusercontent.com/danielhmills/databricks-sample-kg/refs/heads/main/bakehouse/r2rml.ttl> 
INTO <urn:databricks:bakehouse:r2rml>;

SPARQL LOAD <https://raw.githubusercontent.com/danielhmills/databricks-sample-kg/refs/heads/main/bakehouse/ontology.ttl> 
INTO <http://www.databricks.com/bakehouse#>;

The R2RML mapping defines how relational tables map to RDF triples (entities, properties, relationships).

4. Generate Quad Maps

The script converts R2RML mappings into Virtuoso’s internal Quad Map format for efficient SPARQL query execution:

EXEC ('SPARQL ' ||
 DB.DBA.R2RML_MAKE_QM_FROM_G('urn:databricks:bakehouse:r2rml')
);

Quad Maps are Virtuoso’s optimized representation of RDF views over relational data.

5. Configure Linked Data Rewrite Rules

The script sets up URL rewrite rules for content negotiation, enabling entity URIs to be dereferenceable:

DB.DBA.URLREWRITE_CREATE_REGEX_RULE (
'databricks_bakehouse_rule1',
1,
'(/databricks/bakehouse/[^#]*)',
vector('path'),
1,
'/describe/?url=http%%3A//^{URIQADefaultHost}^%U%%23this&graph=http%%3A//^{URIQADefaultHost}^/databricks-bakehouse-r2rml%%23&distinct=0',
vector('path'),
null,
null,
2,
303
);

These rules allow browsers to view entity descriptions and RDF clients to retrieve machine-readable data.


Step 6: Test the Attached Tables

Before generating the Knowledge Graph, verify that the Databricks tables are properly attached and accessible via SQL. You can use Virtuoso’s isql command-line tool or the Conductor web interface.

Using the Conductor UI:
Navigate to: http://localhost:8890/conductor/isql_main.vspx

Or use your custom hostname/port:
http(s)://cname:port/conductor/isql_main.vspx

Test Query:

SELECT TOP 10 * FROM databricks.bakehouse.sales_customers;

iSQL Query Results
Query results showing attached Databricks table data accessible via SQL

If successful, you should see rows from the sales_customers table, confirming that Virtuoso can access your Databricks data through the ODBC connection.


Step 7: Verify the Knowledge Graph

Confirm that the Knowledge Graph is accessible by running the following SPARQL query in Virtuoso’s isql or the SPARQL endpoint:

SPARQL 
SELECT *
FROM <http://demo.openlinksw.com/databricks-bakehouse-r2rml#>
WHERE
{
  ?s a ?o
}
LIMIT 10

Links:

If successful, you should see RDF triples representing entities from the Databricks bakehouse schema (customers, franchises, suppliers, transactions, and reviews).


Sample Query: Revenue by Franchise

This query demonstrates how to join data across multiple tables (Transaction and Franchise) using SPARQL to calculate total revenue by franchise and city:

PREFIX : <http://www.databricks.com/bakehouse#>

SELECT
?franchise
?franchiseCity
SUM(?totalPrice) as ?revenue

FROM <http://demo.openlinksw.com/databricks-bakehouse-r2rml#>
WHERE
{
  #Transaction Table
  ?transaction a :Transaction;
   :franchise ?franchise;
   :totalPrice ?totalPrice.

  #Franchise Table
  ?franchise :city ?franchiseCity.   
}
GROUP BY ?franchise ?franchiseCity
ORDER BY DESC(?revenue)
LIMIT 10

Links:

SPARQL Query Results
Query results showing revenue aggregated by franchise with clickable entity URIs

Each franchise URI in the results (e.g., http://demo.openlinksw.com/databricks/bakehouse/franchise-3000046#this) is clickable and navigates to a detailed entity description page.

Entity Description from Query Results
Clicking a franchise URI from query results navigates to its full entity description


Graph Visualization with SPARQLWorks

The Knowledge Graph can be visualized interactively using SPARQLWorks, a graph visualization tool that renders SPARQL query results as interactive node-link diagrams.

Try it: View Transaction-Franchise Relationships

This visualization uses a SPARQL CONSTRUCT query to extract a subgraph showing relationships between transactions, franchises, and cities:

PREFIX : <http://www.databricks.com/bakehouse#>

CONSTRUCT
{
  #Transaction Table
  ?transaction a :Transaction;
   :franchise ?franchise;
   :totalPrice ?totalPrice.

  #Franchise Table
  ?franchise :city ?franchiseCity.   
}
WHERE
{
  GRAPH <http://demo.openlinksw.com/databricks-bakehouse-r2rml#>
  {
    #Transaction Table
    ?transaction a :Transaction;
     :franchise ?franchise;
     :totalPrice ?totalPrice.

    #Franchise Table
    ?franchise :city ?franchiseCity.   
  }
}
LIMIT 100

The interactive visualization allows you to:

  • Explore entity relationships visually
  • Click nodes to navigate to entity descriptions
  • Zoom and pan to examine graph structure
  • See property values on hover

SPARQLWorks Visualization
Interactive graph visualization showing transaction-franchise-city relationships

Clicking on any entity node in the visualization navigates to its description page. Try it: Franchise 3000024 Description


Linked Data: Navigable Entity URIs

One of the key benefits of this approach is that every entity in your Knowledge Graph has a dereferenceable HTTP URI. This means you can navigate directly to any customer, franchise, supplier, or transaction by visiting its URL in a browser.

Try it: http://demo.openlinksw.com/databricks/bakehouse/franchise-3000046#this

How It Works

When you visit an entity URI, Virtuoso’s rewrite rules implement content negotiation:

For browsers (requesting HTML):

  • The server returns a 303 redirect to a human-readable description page
  • You see all the properties of the franchise: name, location, coordinates, supplier relationships, etc.
  • Links to related entities (suppliers, cities, countries) are clickable, letting you navigate the graph

Linked Data Hyperlink Navigation
Clicking on hyperlinked entities in the graph visualization navigates to their Linked Data descriptions

For RDF clients (requesting RDF/XML, Turtle, JSON-LD):

  • The server executes a SPARQL DESCRIBE query against the Knowledge Graph
  • Returns machine-readable RDF data about the entity in the requested format
  • Perfect for automated agents, data integration, and semantic web applications

Linked Data Entity Description
Human-readable entity description page showing all properties and relationships

This is the essence of Linked Data: every entity has a globally unique identifier (URI) that both humans and machines can dereference to discover what that entity is and how it relates to other entities. Your Databricks warehouse data becomes part of a navigable web of interconnected information, accessible through standard HTTP and queryable through standard SPARQL—without moving the data or changing your existing infrastructure.


Next Steps


Troubleshooting

  • DSN not found: Verify ODBC configuration files are in the correct location and the DSN name matches exactly (databricks_odbc)
  • Connection fails: Check Databricks credentials, ensure the SQL warehouse is running, and verify network connectivity
  • No SPARQL results: Confirm the R2RML mapping loaded successfully and R2RML_MAKE_QM_FROM_G executed without errors
  • Permission errors: Ensure SPARQL_SELECT grants were applied to all attached tables

Resources