Semantic MEDLINE (SemMed) Knowledge Graph Generation from MySQL Database

kidehen · December 10, 2019, 8:31pm

SemMed Database

The SemMed Database is a biomedical medical database deployed using the MySQL Database Management System.

This post is about generating a Knowledge Graph that manifests as a Semantic Web from this database by leveraging the multi-model database management system capabilities of Virtuoso.

Relational Data Details

`ENTITY` table

This table contains entity information whose data come from ENTITY output generated using full fielded output. It includes the following data fields:

ENTITY_ID — Auto-generated primary key for each unique entity
SENTENCE_ID — The foreign key to SENTENCE table
CUI — The CUI of the entity
NAME — The preferred name of the entity
TYPE — The semantic type of the entity
GENE_ID — The EntrezGene ID of the entity
GENE_NAME — The EntrezGene name of the entity
TEXT — The text in the utterance that maps to the entity
START_INDEX — The first character position (in document) of the text denoting the entity
END_INDEX — The last character position (in document) of the text denoting the entity
SCORE — The confidence score

`PREDICATION` table

Each record in this table identifies a unique predication. The data fields are as follows:

PREDICATION_ID — Auto-generated primary key for each unique predication
SENTENCE_ID — Foreign key to the SENTENCE table
PMID — The PubMed identifier of the citation to which the predication belongs
PREDICATE — The string representation of each predicate (for example TREATS, PROCESS_OF)
SUBJECT_CUI — The CUI of the subject of the predication
SUBJECT_NAME — The preferred name of the subject of the predication
SUBJECT_SEMTYPE — The semantic type of the subject of the predication
SUBJECT_NOVELTY — The novelty of the subject of the predication
OBJECT_CUI — The CUI of the object of the predication
OBJECT_NAME — The preferred name of the object of the predication
OBJECT_SEMTYPE — The semantic type of the object of the predication
OBJECT_NOVELTY — The novelty of the object of the predication

`SENTENCE` table

This table contains information about individual sentences from PubMed citations and includes the following data fields:

SENTENCE_ID — Auto-generated primary key for each sentence
PMID — The PubMed identifier of the citation to which the sentence belongs
TYPE — 'ti' for the title of the citation, 'ab' for the abstract
NUMBER — The location of the sentence within the title or abstract
SENTENCE — The actual string or text of the sentence
SECTION_HEADER — Section header name of structured abstract (from Version 3.1)
NORMALIZED_SECTION_HEADER — Normalized section header name of structured abstract (from Version 3.1)

Knowledge Graph Creation Steps

Download the SemMed Database MySQL data dump file (see data sources table)
Load the MySQL database dump file into a MySQL database
Attach MySQL Database to Virtuoso via ODBC
Generate RDF Views using Attached Tables.

The Virtuoso installation architecture for achieving this task could be as depicted below.

At this stage, you have basic RDF 3-Tuple based Relations (RDF Graphs) generated from N-Tuple Relations (SQL Tables) in place that reflect the following:

Each Table is treated as a Class (Entity Type)
Each Table Record is treated as an Instance (or Individual) of the Class for that Table
Each Column Name is a Relation Name (RDF sentence Predicate)
Each Column Value is a Relation Value (RDF sentence Object)
Each Column is also associated with a Class via an rdfs:domain Relation
Each Column Value is associate with a Class via an rdfs:range Relation

Based on the above, you can generate a domain-specific Knowledge Graph to, for example, associate genes with diseases.

Generate of additional triples using SPARQL INSERT
Run test queries to verify the state and quality of the resulting Knowledge Graph

Example Queries

Genes – Live Link

PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT *
FROM <urn:semmed:rdf:data:new3>
FROM <urn:semmed:rdf:data:genes3> 
WHERE
  {
    ?s a          semmed-gene:this ; 
       owl:sameAs ?o 
  }
LIMIT 100

Genes associated with Asthma – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this .
  }

Genes associated with Asthma, but not COPD 1 – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this .
    FILTER ( NOT EXISTS { ?s associated_with:this copd:this } 
           )
      }

Genes associated with Asthma, but not COPD 2 (using `GROUP_CONCAT`) – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT                                        ?s 
                ( GROUP_CONCAT ( ?geneName , "\n" ) AS ?name )
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?geneName ;
       associated_with:this asthma:this .
    FILTER ( NOT EXISTS { ?s associated_with:this copd:this } 
           )
  }

Genes associated with Asthma and COPD – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this , 
                            copd:this .
  }

Entity Reconciliation Demo Sequence Example

Using B3GNT2 as the industry standard identifier of interest.

Here’s LOD, as a superior resolver:

B3GNT2 Identifier lookup using Text Search
B3GNT2 Text Search results filtered by type which reveals Bio2RDF effect (this Data Space was the original resolver for Biological Identifiers)
Bio2RDF entry description
Bio2RDF URI
Identifiers.org URI

Semantic MEDLINE (SemMed) Knowledge Graph Generation from MySQL Database

SemMed Database

Relational Data Details

`ENTITY` table

`PREDICATION` table

`SENTENCE` table

Knowledge Graph Creation Steps

Example Queries

Genes – Live Link

Genes associated with Asthma – [Live Link]

Genes associated with Asthma, but not COPD 1 – [Live Link]

Genes associated with Asthma, but not COPD 2 (using `GROUP_CONCAT`) – [Live Link]

Genes associated with Asthma and COPD – [Live Link]

Entity Reconciliation Demo Sequence Example

Alternative route, still leveraging LOD:

Related

Semantic MEDLINE (SemMed) Knowledge Graph Generation from MySQL Database

SemMed Database

Relational Data Details

ENTITY table

PREDICATION table

SENTENCE table

Knowledge Graph Creation Steps

Example Queries

Genes – Live Link

Genes associated with Asthma – [Live Link]

Genes associated with Asthma, but not COPD 1 – [Live Link]

Genes associated with Asthma, but not COPD 2 (using GROUP_CONCAT) – [Live Link]

Genes associated with Asthma and COPD – [Live Link]

Entity Reconciliation Demo Sequence Example

Alternative route, still leveraging LOD:

Related

`ENTITY` table

`PREDICATION` table

`SENTENCE` table

Genes associated with Asthma, but not COPD 2 (using `GROUP_CONCAT`) – [Live Link]