Virtuoso HowTo: Generating Linked Data from CSV version of Northwind Database

The Northwind Database is a broadly used and generally understood Database and associated Schema (or Data Dictionary). Thus, it provides a great foundation for demonstrating recent innovations in the Database Management Systems (DBMS) realm, especially in regards to “Graph Databases” vs “RDF Graph Databases” discussions.

This post covers the process and implications of transforming the Northwind Database (sourced from a collection of CSV documents) into a KnowledgeGraph that manifests as a Semantic Web of Linked Data. Fundamentally, it showcases how hyperlinks as entity identifiers alter the dynamics of entity relationship models by bringing the data to life via the hypermedia medium provided by HTTP-based hyperlinks (or HTTP URIs).

Northwind Database Entity Relationship Model

  1. Download the Northwind Database (a collection of CSV Documents packed in a Zip Archive) click HERE to download; you can also situate yourself in a download folder via your OS command-line interface and issue the following command: curl -O http://bitnine.net/tutorial/import-northwind-dataset.zip ;
  2. From your downloads folder unzip the Zip Archive using the command: unzip import-northwind-dataset.zip or simply click on the archive file name from your File Manager or Finder UI
  3. Move into the newly created “import-northwind-dataset” folder
  4. Copy all the extracted CSV documents to your Virtuoso installations “vad” sub-folder
  5. Download the SQL Script containing the commands by simply clicking HERE or you can manually download from https://github.com/OpenLinkSoftware/SPASQL-Utility-Showcase-Queries/blob/master/northwind-csv-spasql-demo.sql
  6. Start the Virtuoso iSQL command-line interface and then run the command: load ./northwind-csv-spasql-demo.sql

What’s Happening Here?

  1. Virtuoso will attach (rather than import) all the CSV documents and then automatically map each document to a SQL Table (i.e., map the content of each document to an N-Tuple Relation – colloquially referred to as Tables) – note its use of a file: scheme URI to identify each external CSV document

  2. Once each document is attached, it is worthwhile verifying usability using a sampling of SQL and SPARQL queries

-- ATTACH CATEGORIES TABLE

DROP TABLE csv.northwind.categories;
ATTACH_FROM_CSV ('csv.northwind.categories', 'file:../vad/categories.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.categories;

--ATTACH CUSTOMERS TABLE

DROP TABLE csv.northwind.customers;
ATTACH_FROM_CSV ('csv.northwind.customers', 'file:../vad/customers.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.customers;


--ATTACH EMPLOYEE TERRITORY TABLE

DROP TABLE csv.northwind.employee_territories;
ATTACH_FROM_CSV ('csv.northwind.employee_territories', 'file:../vad/employee_territories.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.employee_territories;


--ATTACH EMPLOYEES TABLE

DROP TABLE csv.northwind.employees;
ATTACH_FROM_CSV ('csv.northwind.employees', 'file:../vad/employees.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.employees;


--ATTACH ORDER DETAILS TABLE

DROP TABLE csv.northwind.orders_details;
ATTACH_FROM_CSV ('csv.northwind.orders_details', 'file:../vad/orders_details.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.orders_details;


--ATTACH ORDERS TABLE

DROP TABLE csv.northwind.orders;
ATTACH_FROM_CSV ('csv.northwind.orders', 'file:../vad/orders.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.orders;

--ATTACH PRODUCTS TABLE

DROP TABLE csv.northwind.products;
ATTACH_FROM_CSV ('csv.northwind.products', 'file:../vad/products.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.products;

--ATTACH REGIONS TABLE

DROP TABLE csv.northwind.regions;
ATTACH_FROM_CSV ('csv.northwind.regions', 'file:../vad/regions.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.regions;

--ATTACH REGIONS TABLE

DROP TABLE csv.northwind.regions;
ATTACH_FROM_CSV ('csv.northwind.regions', 'file:../vad/regions.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.regions;

--ATTACH SHIPPERS TABLE

DROP TABLE csv.northwind.shippers;
ATTACH_FROM_CSV ('csv.northwind.shippers', 'file:../vad/shippers.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.shippers;

--ATTACH SUPPLIERS TABLE

DROP TABLE csv.northwind.suppliers;
ATTACH_FROM_CSV ('csv.northwind.suppliers', 'file:../vad/suppliers.csv', ',', '\n', null, 1, vector(1));
SELECT TOP 5 * FROM csv.northwind.suppliers;


--ATTACH TERRITORIES TABLE

DROP TABLE csv.northwind.territories;
ATTACH_FROM_CSV ('csv.northwind.territories', 'file:../vad/territories.csv', ',', '\n', null, 1, vector(1,2));
SELECT TOP 5 * FROM csv.northwind.territories;

Create & Verify Sample Transformations

-- Generate SPARQL from Order CSV 

SPARQL 
DEFINE
get:soft "replace"
SELECT * 
FROM <file://../vad/order_details.csv>
	{
		?s ?p ?o.
	}

Limit 100;


-- CREATE RDF VIEW using built-in Sponger Cartridge for mapping CSV to RDF Statement Graphs

DROP TABLE csv.northwind.sparql_order_details;

CREATE VIEW csv.northwind.sparql_order_details AS
SPARQL 
SELECT 
		?orderID
		?productID
		?unitPrice
		?quantity
		?discount

FROM <file://../vad/order_details.csv>
WHERE
	{
		?s a <file://../vad/order_details.csv#class>;
		<file://../vad/order_details.csv#productID> ?orderID;
		<file://../vad/order_details.csv#productID> ?productID;
		<file://../vad/order_details.csv#unitPrice> ?unitPrice;
		<file://../vad/order_details.csv#productID> ?quantity;
		<file://../vad/order_details.csv#productID> ?discount.
	};

-- Modified Test Query

SELECT TOP 5 Y.orderID
FROM (
SPARQL
SELECT
       ?orderID
       ?productID
       ?unitPrice
       ?quantity
       ?discount

FROM <file://../vad/order_details.csv>
WHERE
   {
       ?s a <file://../vad/order_details.csv#class>;
       <file://../vad/order_details.csv#orderID> ?orderID; 
       <file://../vad/order_details.csv#productID> ?productID;
       <file://../vad/order_details.csv#unitPrice> ?unitPrice;
       <file://../vad/order_details.csv#quantity> ?quantity;
       <file://../vad/order_details.csv#discount> ?discount. 
   }

LIMIT 5 ) 
AS X
INNER JOIN  "csv"."Northwind"."Orders" Y
ON X.orderID = Y.orderID;

Tables to RDF Graphs Mappings and their Storage

At this juncture, based on script above, Virtuoso would have only generated Virtual 3-Tuple Relations (Triples) associated with a Named Graph (Data Source Identifier/Name) associated with its Virtual RDF Storage rather than Physical 3-Tuple Relations (Triples) associated with a Named Graph associated with its Physical RDF Storage.

A Virtual Storage can be identified by an IRI using define input:storage {storage-IRI} if the default storage isn’t desired

A Physical Storage is identified by the IRI: virtrdf: .

A combined Virtual and Physical Storage is identified by the IRI: virtrdf:DefaultQuadMap, and is the default mapping used whenever you use the built-in wizards mapping Tables to RDF Graphs.

To generate Physical Triples the following would need to be executed:

SPARQL
SELECT COUNT(*) FROM <http://demo.openlinksw.com/csv_northwind#>
WHERE {?s ?p ?o} ;

RDF_VIEW_SYNC_TO_PHYSICAL ('http://demo.openlinksw.com/csv_northwind#', 1, 'urn:demo:csv_northwind:data');


SPARQL
SELECT COUNT(*) FROM <urn:demo:csv_northwind:data>
WHERE {?s ?p ?o} ;

What’s the difference between Virtual and Physical Triples?

Virtual Triples are directly translated to SQL “on the fly” when executed, but it they also have a fundamental shortcoming with regards to the performance of queries that operate on instances of classes via rdf:type (isA) relations. Why? Due to the fact that unions need to be constructed across the Tables associated with instances of classes when the object of rdf:type relations takes the form of a variable.

Physical Triples don’t have any performance issues with variables identifying the objects of an rdf:type relation, but they do introduce a change-sensitivity challenge i.e., what happens when data is altered in the original CSV document or some other relevant Table (e.g., once referenced via an ODBC connection)?

Change-sensitivity is addressed via the following options:

  1. Use of a Trigger on the source SQL Table (e.g., in this exercise we have a native Virtuoso Table [where triggers are supported] generated from a CSV document using a file system reference, courtesy of file: scheme URI)

  2. Use of a store procedure that can apply a delta-calculation to the source en route to applying or deleting triples from a Named Graph associated with Physical RDF Store.

RDF Views

After successfully verifying the binding between the filesystem hosted CSV documents and Relations in Virtuoso, you can now proceed to generating a highly navigable KnowledgeGraph based on RDF Views that manifests as a Semantic Web of Linked Data.

You have the following options for generating these powerful views:

  • Virtuoso’s built-in Transformation Middleware (“Sponger”) – which generates RDF Statement Graphs “on the fly” from CSV documents (as per the file: reference and usage in the examples above)

  • Virtuoso’s built-in RDF Views Wizard – which generates a Knowledge Graph that manifests as a Semantic Web of Linked Data Automatically

  • R2RML Scripts that comprise a collection of RDF-Turtle statements that describe your desired mappings from Relational Tables to RDF Statement Graphs – HERE is a an R2RML Script for the CSV edition of the Northwind Database used in this post.

  • Virtuoso Quad Maps that comprise SPASQL (fusion of SQL Procedures and SPARQL) statements that map out your desired mappings from Relational Tables to RDF Statement Graphs

Here are live examples of various entities, identified by hyperlinks (specifically, HTTP URIs), produced by the RDF Views generating options outlined above:

Conclusion
By collectively standardizing on the user of hyperlinks (specifically, HTTP URIs) to identify documents the world build a global Document Network via the Internet commonly known as the World Wide Web (a/k/a “Web”).

Extending the same approach for identifying documents to the more fine-grained act of structured data representation, based on the Entity Relationship Model, results in a Web of Data (a/k/a “Linked Data”).

Finally, if you describe the Relationship Types (a/k/a Relations) in regards to issues like equivalence, inversion, transitivity, sub-classing, symmetry etc., this new web-like Entity Relationship Model becomes a Knowledgebase (or KnowledgeGraph) that manifests as a Semantic Web of Linked Data.

The implications of this powerful data access, integration, and management innovation are profound and dwarfs everything unleashed by the Web as currently understood and experienced.

In Virtuoso we have developed an innovative platform that brings all of this power to bear without undue complexity, confusion, and costs. Simply install Virtuoso and let it guide you through the process of making current and future projects associated with the following themes ultra-successful:

Links

Related