Use Case Exercise – Knowledge Graph Generation from existing Relational CSV Project
Direct link to video:
https://www.openlinksw.com/data/screencasts/2026_Data_Engineering_Survey_CSV_to_Knowledge_Graph_Splash_Video.mp4
When teams say they need a “Knowledge Graph,” what they usually mean is a production system capable of ingesting messy source data, transforming it into structured, queryable knowledge, and publishing it with durable identifiers and a user-friendly browsing experience.
This post demonstrates how that goal can be achieved using our Virtuoso platform, which supports the full lifecycle of Knowledge Graph creation and deployment. Virtuoso not only hosts RDF data but also provides a multi-model database engine, integrated SPARQL and SQL capabilities, and a native faceted search and browsing service for Linked Data — all within a performant and scalable environment.
The walkthrough shows how unobtrusive Knowledge Graph generation from CSV-based relational data becomes possible through Virtuoso’s end-to-end pipeline, covering CSV transformation, ontology generation, query-language interoperability, and identifier resolution.
A Knowledge Graph Pipeline That Starts with CSV
In reality, many Knowledge Graph projects begin with spreadsheets or CSV exports. Virtuoso embraces this reality by enabling repeatable and transparent CSV-to-RDF transformations, allowing rows to become entities and columns to become attributes or relationships.
This approach delivers several practical advantages:
- Deterministic transformation: CSVs can be consistently converted into RDF, enabling reproducible graph builds.
- Traceable provenance: Each entity retains lineage back to the original dataset, making governance and auditing much simpler.
- Iterative modeling: As the dataset evolves, the transformation can be refined without rewriting downstream systems.
In practice, this means a CSV document can easily evolve into a Knowledge Graph, deployed using Linked Data principles, with entities, entity types, and relationships — all within a Virtuoso-managed workflow.
Ontology Generation and Modeling in Practice
A Knowledge Graph is only useful if it actually informed by an ontology. Virtuoso supports ontology generation and management, allowing teams to express explicit entity types (classes), relationship types (or attributes / properties), and constraints that align with the semantics of a discourse domain. This is what makes data interoperable and machine‑interpretable.
Some practical benefits of ontology‑driven modeling:
- Consistency across datasets: Shared vocabularies reduce ambiguity and improve interoperability.
- Semantic interoperability: Other systems can interpret the data without custom glue delivered via so-called integrations.
- Extensibility: New entity types (classes) and relationship types can be added without breaking existing use i.e., it’s all loosely coupled.
This is especially important when publishing Knowledge Graphs deployed using Linked Data principles, where ontology use allow others to integrate with ease.
SQL and SPARQL: Both First‑Class Citizens
One of Virtuoso’s most powerful differentiators is that it natively supports both SQL and SPARQL, and allows them to work together. This isn’t just a convenience; it unlocks real operational flexibility:
- SQL for analysts: Analysts and data engineers can query graph‑backed datasets using familiar SQL.
- SPARQL for graph workloads: RDF‑native queries, reasoning, and graph patterns remain first‑class.
- SPASQL (SPARQL inside SQL): Virtuoso allows SPARQL, as subqueries, to be embedded inside SQL, making it easy to bridge and leverage both conventional relational tables and entity relationship graph thinking in a single query.
This dual‑language support means you can expose the same knowledge graph to multiple user groups without forcing them into a single query paradigm. Those familiar with SQL are simply presented with conventional SQL views equipped the powerful lookups provided by hyperlinks functioning as platform agnostic super-keys.
Identifier Resolution as a Deployment Feature
Linked Data requires that entities have resolvable, stable standardized identifiers (e.g., HTTP-based Hyperlinks). Virtuoso supports this out of the box with a native faceted search and browsing service. Identifiers resolve to rich entity descriptions, which can be browsed by humans and consumed by machines (e.g., AI Agents).
This has several implications for deployment:
- Durable, dereferenceable IRIs: Entities can be linked and reused across systems.
- Native faceted navigation: Users can explore the graph without writing queries.
- Live data exploration: The browsing service operates directly against the live graph.
This turns a knowledge graph from a static dataset into a living, explorable product, mesh, or web.
Steps for Creating the Knowledge Graph from CSV
The transformation of the 2026 Data Engineering Survey from a flat CSV into a Linked Data Knowledge Graph followed these seven steps:
1. Ingestion and Schema Analysis
We began by analyzing the survey_2026_data_engineering.csv file to identify core entities (Respondents, Industries, Roles) and their literal attributes (Bottlenecks, AI usage, etc.).
2. Semantic Enrichment (Classification)
We enriched the raw data by mapping text-based categories to official standards based identifiers:
- Industries: Mapped to SIC (Standard Industrial Classification) and NAICS (North American Industry Classification System) codes.
- Roles: Mapped to SOC (Standard Occupational Classification) codes using keyword-based logic.
3. IRI Grounding
For each standard code, we derived authoritative Internationalized Resource Identifiers (IRIs). These links point to official registries (OSHA, Census Bureau, and O*NET) and utilize the #this fragment to denote the specific concept.
4. Ontological Modeling (The Schema)
Generate a custom RDFS/OWL Ontology (survey_ontology.ttl) to define the rules of the graph.
- Class Definitions: Created
:SurveyResponse,:Industry, and:Role. - Property Mapping: Bound local attributes (like
:modelingApproach) to global standards (Schema.org and DBpedia) usingrdfs:subPropertyOf. - Provenance: Applied
rdfs:isDefinedByto all terms for governance.
5. Subject and Object Denotation (Hashing)
To ensure stable identifiers in the Knowledge Graph:
- Survey Entries: Assigned incrementing IDs (
:survey_entry_0, etc.). - Entities: Used SHA-256 hashing on Industry and Role names to create unique, repeatable IRIs (e.g.,
:d0da7ab8bc927960) grounded in the project’s base URI.
6. RDF Triplification (Turtle Serialization)
We converted each row of the enriched CSV into RDF triples.
- Lossless Transformation: Included all original response text as literals.
- Spatial Context: Mapped regions to DBpedia resources (e.g.,
dbr:United_States_Canada). - Format Compliance: Formatted timestamps as strict
xsd:dateTime(ISO 8601).
7. Syntactic Refinement and Validation
Finally, we optimized the Turtle file for SPARQL parsers:
- Multi-line Handling: Used triple-quotes (
""") for all literals to prevent line breaks from breaking the syntax. - Escaping: Thoroughly escaped internal double-quotes and backslashes to ensure the graph successfully passes syntax checking in environments like URIBurner.
Scaling the Graph in Production
Virtuoso is inherently scalable and capable of handling large‑scale graphs (or tables) with high query throughput demands. It supports:
- Multi‑model storage: RDF, relational, and other models in the same engine.
- Optimized query execution: Efficient SPARQL processing at scale.
- Operational maturity: Built for production workloads with high concurrency.
For Knowledge Graph deployment, this means you can move from prototype to production without migrating between systems.
Publishing Generated Knowledge Graph
This showcases how Virtuoso’s core value is that it treats Knowledge Graphs as a complete system, not just a data format. It covers:
- CSV (and many other formats) transformation into RDF-based Knowledge Graphs
- Ontology generation and refinement
- First‑class SQL and SPARQL support – authenticate as user ‘Demo’ using password ‘Demo’
- SPASQL for hybrid querying
- Linked Data identifier resolution
- Native faceted search and browsing – just click to interact with engine and explore
- Production‑grade scalability
This combination makes Virtuoso a pragmatic choice for teams who want to build and deploy Knowledge Graphs that are both technically rigorous and operationally durable.
Why Should You Care?
Converting CSVs into Knowledge Graphs is more than a technical exercise — it fundamentally reshapes how you capture, contextualize, and leverage data. Here’s why this matters today:
- Context-Rich Insights: Knowledge Graphs explicitly encode entities, relationships, and types. This allows you to reason over your data, uncover hidden connections, and generate insights that are difficult—or impossible—to extract from flat tables. Context Engineering ensures that these relationships are meaningful, coherent, and reusable.
- AI Agents Ready: Structured, context-rich Knowledge Graphs form the backbone for AI Agents. By providing a semantically harmonized dataset, AI Agents can reliably answer questions, perform tasks, and coordinate actions across multiple data spaces. CSVs alone lack this richness, making AI assistance shallow and brittle.
- Interoperability Across Systems: Using Linked Data principles and standardized identifiers (hyperlinks) ensures that data is globally addressable, machine-readable, and reusable across systems. AI Agents, LLMs, or other automated workflows can integrate and augment your Knowledge Graph without fragile, brittle mappings.
- Provenance and Reusability: Every entity and relationship in the graph retains traceability to its source. This makes governance, auditing, and iterative improvement easier while enabling the same Knowledge Graph to support multiple AI Agents and analytical workflows.
- Future-Proofing Data & Decisions: Context Engineering plus AI Agents unlocks the next generation of knowledge-driven workflows. Your data is no longer just a static record — it becomes a dynamic, actionable asset, capable of supporting reasoning, inference, and multi-modal AI interactions.
In short: what was once isolated CSV data now becomes an interoperable, contextually rich, and AI-ready Knowledge Graph — empowering smarter decisions, more reliable AI assistance, and scalable, machine-actionable knowledge.
Conclusion
Knowledge Graph creation and deployment requires more than just an RDF store. It requires tooling that supports ingestion, modeling, querying (using SPARQL and/or SQL [via SPASQL]), and publication in a way that is interoperable and scalable. Virtuoso provides those capabilities in a single, integrated platform—making it possible to go from CSV to a fully navigable Linked Data experience with minimal friction.
If your organization needs to bridge relational and graph worlds while publishing Linked Data at scale, Virtuoso is a powerful platform that handles the task natively, while also delivering high performance and scalability across a variety of platforms (Unix/Linux, Windows, and macos).
Related
- NotebookLM Slide Deck about this CSV to RDF based Knowledge Graph Generation Exercise
- Knowledge Graph Edition of 2026 Data Engineering Survey
- Google Gemini Project used for CSV to RDF Knowledge Graph Transformation
- Realworld AI Coding Agent Exercise
- Trillion-Dollar++ Agentic Software Opportunity
- Agent Skills, Filesystems, and Context Graphs: Collapsing Software Complexity at the Root
- About Virtuoso