SemMed Database
The SemMed Database is a biomedical medical database deployed using the MySQL Database Management System.
This post is about generating a Knowledge Graph that manifests as a Semantic Web from this database by leveraging the multi-model database management system capabilities of Virtuoso.
Relational Data Details
ENTITY
table
This table contains entity information whose data come from ENTITY
output generated using full fielded output. It includes the following data fields:
-
ENTITY_ID
— Auto-generated primary key for each unique entity -
SENTENCE_ID
— The foreign key toSENTENCE
table -
CUI
— The CUI of the entity -
NAME
— The preferred name of the entity -
TYPE
— The semantic type of the entity -
GENE_ID
— The EntrezGene ID of the entity -
GENE_NAME
— The EntrezGene name of the entity -
TEXT
— The text in the utterance that maps to the entity -
START_INDEX
— The first character position (in document) of the text denoting the entity -
END_INDEX
— The last character position (in document) of the text denoting the entity -
SCORE
— The confidence score
PREDICATION
table
Each record in this table identifies a unique predication. The data fields are as follows:
-
PREDICATION_ID
— Auto-generated primary key for each unique predication -
SENTENCE_ID
— Foreign key to theSENTENCE
table -
PMID
— The PubMed identifier of the citation to which the predication belongs -
PREDICATE
— The string representation of each predicate (for exampleTREATS
,PROCESS_OF
) -
SUBJECT_CUI
— The CUI of the subject of the predication -
SUBJECT_NAME
— The preferred name of the subject of the predication -
SUBJECT_SEMTYPE
— The semantic type of the subject of the predication -
SUBJECT_NOVELTY
— The novelty of the subject of the predication -
OBJECT_CUI
— The CUI of the object of the predication -
OBJECT_NAME
— The preferred name of the object of the predication -
OBJECT_SEMTYPE
— The semantic type of the object of the predication -
OBJECT_NOVELTY
— The novelty of the object of the predication
SENTENCE
table
This table contains information about individual sentences from PubMed citations and includes the following data fields:
-
SENTENCE_ID
— Auto-generated primary key for each sentence -
PMID
— The PubMed identifier of the citation to which the sentence belongs -
TYPE
—'ti'
for the title of the citation,'ab'
for the abstract -
NUMBER
— The location of the sentence within the title or abstract -
SENTENCE
— The actual string or text of the sentence -
SECTION_HEADER
— Section header name of structured abstract (from Version 3.1) -
NORMALIZED_SECTION_HEADER
— Normalized section header name of structured abstract (from Version 3.1)
Knowledge Graph Creation Steps
-
Download the SemMed Database MySQL data dump file (see data sources table)
-
Load the MySQL database dump file into a MySQL database
-
Attach MySQL Database to Virtuoso via ODBC
-
Generate RDF Views using Attached Tables.
The Virtuoso installation architecture for achieving this task could be as depicted below.
At this stage, you have basic RDF 3-Tuple based Relations (RDF Graphs) generated from N-Tuple Relations (SQL Tables) in place that reflect the following:
- Each Table is treated as a Class (Entity Type)
- Each Table Record is treated as an Instance (or Individual) of the Class for that Table
- Each Column Name is a Relation Name (RDF sentence Predicate)
- Each Column Value is a Relation Value (RDF sentence Object)
- Each Column is also associated with a Class via an
rdfs:domain
Relation - Each Column Value is associate with a Class via an
rdfs:range
Relation
Based on the above, you can generate a domain-specific Knowledge Graph to, for example, associate genes with diseases.
-
Generate of additional triples using SPARQL
INSERT
-
Run test queries to verify the state and quality of the resulting Knowledge Graph
Example Queries
Genes – Live Link
PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>
SELECT DISTINCT *
FROM <urn:semmed:rdf:data:new3>
FROM <urn:semmed:rdf:data:genes3>
WHERE
{
?s a semmed-gene:this ;
owl:sameAs ?o
}
LIMIT 100
Genes associated with Asthma – [Live Link]
PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#>
PREFIX copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>
SELECT DISTINCT *
FROM <urn:semmed:rdf:data:genes3>
FROM <urn:semmed:rdf:data:new3>
WHERE
{
?s a semmed-gene:this ;
rdfs:label ?genName ;
associated_with:this asthma:this .
}
Genes associated with Asthma, but not COPD 1 – [Live Link]
PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#>
PREFIX copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/>
PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>
SELECT DISTINCT *
FROM <urn:semmed:rdf:data:genes3>
FROM <urn:semmed:rdf:data:new3>
WHERE
{
?s a semmed-gene:this ;
rdfs:label ?genName ;
associated_with:this asthma:this .
FILTER ( NOT EXISTS { ?s associated_with:this copd:this }
)
}
Genes associated with Asthma, but not COPD 2 (using GROUP_CONCAT
) – [Live Link]
PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#>
PREFIX copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/>
PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>
SELECT DISTINCT ?s
( GROUP_CONCAT ( ?geneName , "\n" ) AS ?name )
FROM <urn:semmed:rdf:data:genes3>
FROM <urn:semmed:rdf:data:new3>
WHERE
{
?s a semmed-gene:this ;
rdfs:label ?geneName ;
associated_with:this asthma:this .
FILTER ( NOT EXISTS { ?s associated_with:this copd:this }
)
}
Genes associated with Asthma and COPD – [Live Link]
PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#>
PREFIX copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/>
PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>
SELECT DISTINCT *
FROM <urn:semmed:rdf:data:genes3>
FROM <urn:semmed:rdf:data:new3>
WHERE
{
?s a semmed-gene:this ;
rdfs:label ?genName ;
associated_with:this asthma:this ,
copd:this .
}
Entity Reconciliation Demo Sequence Example
Using B3GNT2
as the industry standard identifier of interest.
Here’s LOD, as a superior resolver:
-
B3GNT2
Text Search results filtered by type which reveals Bio2RDF effect (this Data Space was the original resolver for Biological Identifiers)