Virtuoso and h2oGPT: talking RDF

What

h2oGPT is a private, on-premises, open-source chat/GPT application incorporating model management, a vector database and langchain (document ingestion / RAG) workflow and a web UI all in one.

Its vectorDB is segmented both by user and “collection”, typically used for disparate fields of interest.

It has a wide variety of “tokenisers” by which many different file formats can be ingested, including PDF, Excel, plain text, HTML, CSV, HDF, JSON, and more.

Why

To explore and demonstrate data (especially RDF in various formats) interoperability between Virtuoso and h2oGPT.

How

Goal: to demonstrate langchain, progressing from wrong to correct answers using a mixture of RDF & HTML resources about OpenLink company and products.

Follow your choice of installing h2oGPT either by Docker or as a regular native python app (preferred).

At the end of the process there’ll be an HTTP listener on http://0.0.0.0:7860/ to use:

[click screenshots to enlarge]

Overview of the h2oGPT primary interaction tab. Note the ask-or-ingest input area and buttons for submit, ingest and upload.

Document Selection - an overview of what collections and documents have been loaded. Currently empty.

An initial attempt at answering the question “What products does OpenLink Software offer?”. Note that, by default, the Mistral model gets this completely wrong, talking about OpenLink Financial (a company that no longer exists independently), yet conflating our Virtuoso product with them.

Data-sources can be ingested by the Upload button or by entering a URL and hitting “Ingest”. To bulk-import many documents by URL, we create a file of extension .urls .

We attempt to improve the vectorDB’s knowledge of the real OpenLink Software by feeding it a combination of our webpages and the URLs to SPARQL queries that populate the data on them, using OSDS (the Sniffer) browser extension:

The finished list of 14 URLs to be ingested - save as opl-h2o.urls so h2oGPT understands it:

https://www.openlinksw.com/
https://www.openlinksw.com/c/LDvkFRUCT
https://www.openlinksw.com/c/54sTf4jxvW
https://www.openlinksw.com/c/8hGE5BTPZ7
https://uda.openlinksw.com/
https://uda.openlinksw.com/c/4Pg2oWxDRs
https://uda.openlinksw.com/c/5eJ7HXEEbR
https://virtuoso.openlinksw.com/
https://virtuoso.openlinksw.com/c/8ZXKtjHq8V
https://virtuoso.openlinksw.com/c/AYbsooMty7
https://virtuoso.openlinksw.com/c/7NnjUpXsCT
https://virtuoso.openlinksw.com/c/5uijiweVMf
https://virtuoso.openlinksw.com/c/6ATbPwtmQf
https://virtuoso.openlinksw.com/c/4HVuBTvVC5

Back in h2oGPT, we use the Upload button to send the file:

Once complete, we see the “Doc Counts” has increased to 14, the same as the number of URLs in our opl-h2o.urls file, and re-run the same query for our products. This time, it gets it right, and talks something like plausible sense:

Finally, scrolling down a little, we can see which sources are behind this answer - in this case, one of the SPARQL queries contributed by a factor of 0.21 to it:

A screencast video of the whole process can be seen over on YouTube: h2oGPT screencast video

A Dash of Virtuoso Magic

What

Goal: finally, we demonstrate using a SPARQL query to “seed” h2oGPT with a list of URLs for further retrieval.

Why

The OpenLink websites host a wealth of data with precise assertions about HOWTO documents, FAQs, product information and more.

h2oGPT’s langchain implementation (using ChromaDB as a vector database backend) is dependent on the quality of input documents and its tokenizers’ abilities to extract sentences - specifically, we wish to avoid conflating keywords and phrases by mere proximity in apparently adjacent sentences.

By isolating individual Questions, Answers and HOWTO Steps with a URI for each, we bring greater clarity to the documents stored in h2oGPT and therefore increase the potential quality of answer given within the OpenLink web realm.

(Example: see https://data.openlinksw.com/oplweb/faq/general/A1#this)

h2oGPT accepts a format filename.urls which is a list of URLs to retrieve, plain text, one per line.

Virtuoso’s SPARQL engine can be used to generate a list of URIs of individual entities, each of which is dereferenceable to reveal the question, answer, howto-step text. This can be preserved as a resource in DAV where it will be slow-term cached (regenerated automatically every 50mins for 2 days).

By putting these two together, we have maximum precision: crawling is precise, limited to a finite specific set of dynamically generated and maintained URLs, generated from the SPARQL query; while h2oGPT can be configured to crawl webpages, it is tricky to control and likely to follow too many links out of context.

How

First, we configure the SPARQL endpoint to allow saving query results in DAV.

Note: this requires the VAL package not be installed as its SPARQL endpoint differs.

  • For full instructions, see the link “save to dav” in the SPARQL endpoint

  • Edit the SPARQL user to enable DAV access and create a home directory

  • Create a DAV collection /DAV/home/SPARQL/saved-sparql-results/ of type Dynamic Resources

  • On reloading the /sparql endpoint, the options for saving the results to DAV should now be visible

SPARQL Query

The SPARQL query simply retrieves a list of distinct entity URIs present in a grab-bag of RDF/Turtle documents hosted on our websites. A handful of pragmas (defines) at the start will tell the SPARQL engine to invoke the built-in Sponger and retrieve the Turtle resources.

define get:soft "soft"
define input:grab-limit 200
define input:grab-depth 5
define input:grab-all "yes"

select distinct concat("\n", ?url, "\n")
from <https://www.openlinksw.com/data/turtle/general/openlink-general-product-faq.ttl>
from <https://www.openlinksw.com/data/turtle/general/openlink-value-propositions.ttl>
from <https://www.openlinksw.com/data/turtle/general/virtuoso-howto-install-guides.ttl>
from <https://www.openlinksw.com/data/turtle/general/virtuoso-howto-page-descriptions.ttl>
from <https://www.openlinksw.com/data/turtle/general/virtuoso-general-faq-2023.ttl>
where {
  ?url ?p ?o
}
limit 1000

Query Output Format & Destination

The output should be CSV format (the closest to plain text) and we specify a location in DAV /DAV/home/SPARQL/saved-sparql-results/misc-website-stuff-for-crawl.urls in which it should be stored.

Ingestion

Using a macOS client, in the Finder we simply mount the DAV resource http://localhost:8889/DAV/home/ as a network volume. In h2oGPT, click the Upload button and navigate to the resource misc-website-stuff-for-crawl.urls in the mounted DAV volume.

Over the course of several minutes, h2oGPT will attempt to retrieve all the URIs listed and the number of documents present in the current collection will increase dramatically.

New h2o session

We can now ask h2oGPT the question On what operating systems can Virtuoso be installed?.

Where previously the answer was a rather vague:

Based on the context provided, Virtuoscan be installed on both Windows and macOS operating systems. The instructions given in the context detail the steps to install Virtuoso on Windows, but there is also a mention of instructions for macOS installation further in the text. Therefore, Virtuoso is compatible with at least these two operating systems.

it now becomes the more informative:

Virtuoso, based on the context provided in the document, can be installed on both Windows and macOS operating systems. The context details the steps to install Virtuoso on Windows, but there is also mention of instructions for macOS installation further in the text. Therefore, Virtuoso is compatible with at least these two operating systems.

For Windows, the installation process involves downloading the latest operating system and Virtuoso updates, obtaining a license file (if not already possessed), and selecting the components to install, which can be either the Full version (installing both Client and Server components) or Client version (installing Client components only).

For macOS, the installation process is mentioned but not detailed in the context provided. However, it is clear that instructions for macOS installation exist within the document.

In summary, Virtuoso can be installed on Windows and macOS operating systems, with separate installation processes for each platform.

Completing the Circle

We have already seen how to use data collated by Virtuoso as a seed to h2oGPT. As a final challenge, we use h2oGPT to generate some JSON-LD and automatically ingest it back into Virtuoso. The key concept here will be Virtuoso’s rdf_sink — a DET (magic collection DAV), by which files inserted into that collection will automatically be parsed and ingested into the RDF quadstore.

For completeness, since the h2oGPT UI seems not to have a way of saving its output to a file, we will also use the llm(1) utility to interact with the h2oGPT instance via the commandline.

Preparing Virtuoso

Nothing is required except a running Virtuoso instance!

Preparing llm(1)

Install llm by either

  • macOS/homebrew: brew install llm
  • linux/other: pip install llm

Create a file

  • macOS: ~/Library/Application Support/io.datasette.llm/extra-openai-models.yaml
  • linux: ~/.config/io.datasette.llm/extra-openai-models.yaml

with these contents:

- model_id: h2oGPT
  model_name: "h2oGPT"
  aliases: ["h2o"]
  api_base: http://localhost:5000/v1

You can test this by running llm -m h2o "What are you?" or similar prompt text.

Preparing the client

macOS

Using the Finder, mount your Virtuoso instance’s DAV root home collection, e.g., http://localhost:8889/DAV/home/.

Linux

Use davfs to mount the Virtuoso instance’s DAV root home collection, e.g., http://localhost:8889/DAV/home/.

zsh% sudo mount.davfs http://localhost:8889/DAV/home/ dav/
Please enter the username to authenticate with server
http://localhost:8889/DAV/home/ or hit enter for none.
  Username: dav
Please enter the password to authenticate user dav with server
http://localhost:8889/DAV/home/ or hit enter for none.
  Password:  

Using h2oGPT to generate JSON-LD

The following command retrieves a webpage and passes it through llm(1) to h2oGPT with a prompt to generate JSON-LD of the HOWTO steps involved in installing Virtuoso:

curl -s "https://wikis.openlinksw.com/VirtuosoWikiWeb/Virtuoso8InstallWin64" | \
  llm -s prompt -m h2o  "summarize this, using compact JSON-LD output format, listing the HOWTO steps required to install Virtuoso. No extra commentary required." | \
  jq . - | \
  tee virtuoso-81-windows-install.jsonld

(We use jq(1)to validate and prettify the JSON-LD.)

After a while, JSON-LD should be seen and the file virtuoso-81-windows-install.jsonld created.

  • macOS: Using the Finder, simply copy the file virtuoso-81-windows-install.jsonld into the location /DAV/home/dav/rdf_sink.
  • linux: cp virtuoso-81-windows-install.jsonld ~/dav

Testing the Results

The following SPARQL query will show the ingested JSON-LD summary:

SELECT DISTINCT * 
FROM <http://localhost:8889/DAV/home/dav/rdf_sink/virtuoso-81-windows-install.jsonld> 
{
  ?s ?p ?o
} 
ORDER BY ?s ?p ?o

Related Links

Update: Open-weight H2O-Danube3-4B

H2o have recently released the Open-weight H2O-Danube3-4B model in two sizes, 4B and 500M, alongside an iOS and Android app, allowing an implementation of a small model on iPhone.

I loaded the model from huggingface into my previous h2oGPT installation and ran the iOS app on the mac desktop.

In terms of speed, the desktop cranks-out 8-9 token/s and feels quite responsive even when the machine is under load. Similarly the iPhone app is very responsive, possibly enhanced by the haptic feedback buzzing every time.

In terms of accuracy, their quoted “80%” seems not to extend to our areas of interest. Two initial prompts for SPARQL query and JSON-LD summary fail.

Both screenshots show h2oGPT in the background with the desktop app in the foreground:

  1. Using DBpedia, write a SPARQL query listing movies by Spike Lee encoded in a clickable link.

  2. Describe Spike Lee in JSON-LD using terms from Schema.org.

Could do better.

What happens when you request RDF-Turtle rather than JSON-LD?

Abject FAIL:

Explanation:
Introduction:
Context: The text provides information about Spike Lee's work, including his films, awards, and notable achievements.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
Object: Spike Lee is described as an American filmmaker, director, and actor.
...

This are the prompts for future reference:

  1. Using the DBpedia endpoint, list all movies by Spike Lee and encode the sparql query in a clickable link.
  2. Describe Spike Lee in JSON-LD using terms from Schema.org.
  3. Describe Spike Lee in RDF-Turtle using terms from Schema.org.

/cc @hwilliams @danielhm @imitko @smalinin