Semantic MEDLINE (SemMed) Knowledge Graph Generation from MySQL Database

SemMed Database

The SemMed Database is a biomedical medical database deployed using the MySQL Database Management System.

This post is about generating a Knowledge Graph that manifests as a Semantic Web from this database by leveraging the multi-model database management system capabilities of Virtuoso.

Relational Data Details

ENTITY table

This table contains entity information whose data come from ENTITY output generated using full fielded output. It includes the following data fields:

  • ENTITY_ID — Auto-generated primary key for each unique entity
  • SENTENCE_ID — The foreign key to SENTENCE table
  • CUI — The CUI of the entity
  • NAME — The preferred name of the entity
  • TYPE — The semantic type of the entity
  • GENE_ID — The EntrezGene ID of the entity
  • GENE_NAME — The EntrezGene name of the entity
  • TEXT — The text in the utterance that maps to the entity
  • START_INDEX — The first character position (in document) of the text denoting the entity
  • END_INDEX — The last character position (in document) of the text denoting the entity
  • SCORE — The confidence score

PREDICATION table

Each record in this table identifies a unique predication. The data fields are as follows:

  • PREDICATION_ID — Auto-generated primary key for each unique predication
  • SENTENCE_ID — Foreign key to the SENTENCE table
  • PMID — The PubMed identifier of the citation to which the predication belongs
  • PREDICATE — The string representation of each predicate (for example TREATS, PROCESS_OF)
  • SUBJECT_CUI — The CUI of the subject of the predication
  • SUBJECT_NAME — The preferred name of the subject of the predication
  • SUBJECT_SEMTYPE — The semantic type of the subject of the predication
  • SUBJECT_NOVELTY — The novelty of the subject of the predication
  • OBJECT_CUI — The CUI of the object of the predication
  • OBJECT_NAME — The preferred name of the object of the predication
  • OBJECT_SEMTYPE — The semantic type of the object of the predication
  • OBJECT_NOVELTY — The novelty of the object of the predication

SENTENCE table

This table contains information about individual sentences from PubMed citations and includes the following data fields:

  • SENTENCE_ID — Auto-generated primary key for each sentence
  • PMID — The PubMed identifier of the citation to which the sentence belongs
  • TYPE'ti' for the title of the citation, 'ab' for the abstract
  • NUMBER — The location of the sentence within the title or abstract
  • SENTENCE — The actual string or text of the sentence
  • SECTION_HEADER — Section header name of structured abstract (from Version 3.1)
  • NORMALIZED_SECTION_HEADER — Normalized section header name of structured abstract (from Version 3.1)

Knowledge Graph Creation Steps

  1. Download the SemMed Database MySQL data dump file (see data sources table)

  2. Load the MySQL database dump file into a MySQL database

  3. Attach MySQL Database to Virtuoso via ODBC

  4. Generate RDF Views using Attached Tables.

The Virtuoso installation architecture for achieving this task could be as depicted below.

At this stage, you have basic RDF 3-Tuple based Relations (RDF Graphs) generated from N-Tuple Relations (SQL Tables) in place that reflect the following:

  1. Each Table is treated as a Class (Entity Type)
  2. Each Table Record is treated as an Instance (or Individual) of the Class for that Table
  3. Each Column Name is a Relation Name (RDF sentence Predicate)
  4. Each Column Value is a Relation Value (RDF sentence Object)
  5. Each Column is also associated with a Class via an rdfs:domain Relation
  6. Each Column Value is associate with a Class via an rdfs:range Relation

Based on the above, you can generate a domain-specific Knowledge Graph to, for example, associate genes with diseases.

  1. Generate of additional triples using SPARQL INSERT

  2. Run test queries to verify the state and quality of the resulting Knowledge Graph

Example Queries

Genes – Live Link

PREFIX semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT *
FROM <urn:semmed:rdf:data:new3>
FROM <urn:semmed:rdf:data:genes3> 
WHERE
  {
    ?s a          semmed-gene:this ; 
       owl:sameAs ?o 
  }
LIMIT 100

Genes associated with Asthma – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this .
  } 

Genes associated with Asthma, but not COPD 1 – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this .
    FILTER ( NOT EXISTS { ?s associated_with:this copd:this } 
           )
      } 

Genes associated with Asthma, but not COPD 2 (using GROUP_CONCAT) – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT                                        ?s 
                ( GROUP_CONCAT ( ?geneName , "\n" ) AS ?name )
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?geneName ;
       associated_with:this asthma:this .
    FILTER ( NOT EXISTS { ?s associated_with:this copd:this } 
           )
  } 

Genes associated with Asthma and COPD – [Live Link]

PREFIX associated_with: <http://semmed.demo.openlinksw.com/terms/associated_with#> 
PREFIX            copd: <http://semmed.demo.openlinksw.com/about/c0024117#>
PREFIX          asthma: <http://semmed.demo.openlinksw.com/about/c0004096#>
PREFIX          semmed: <http://semmed.demo.openlinksw.com/schemas/semmed/> 
PREFIX     semmed-gene: <http://semmed.demo.openlinksw.com/schemas/Gene#>  

SELECT DISTINCT * 
FROM <urn:semmed:rdf:data:genes3> 
FROM <urn:semmed:rdf:data:new3> 
WHERE
  { 
    ?s a                    semmed-gene:this ;
       rdfs:label           ?genName ;
       associated_with:this asthma:this , 
                            copd:this .
  } 

Entity Reconciliation Demo Sequence Example

Using B3GNT2 as the industry standard identifier of interest.

  1. Gene Example

  2. NIH sameAs cross-ref

Here’s LOD, as a superior resolver:

  1. B3GNT2 Identifier lookup using Text Search

  2. B3GNT2 Text Search results filtered by type which reveals Bio2RDF effect (this Data Space was the original resolver for Biological Identifiers)

  3. Bio2RDF entry description

  4. Bio2RDF URI

  5. Identifiers.org URI

Alternative route, still leveraging LOD:

  1. Text Search Results filtered by Type

  2. Results page that includes identifiers.org URIs

Related