HowTo: Web Crawling via Bard, Bing+GPT, ChatGPT, and ChatGPT + Virtuoso Sponger

Web crawling is an exciting feature that is now being incorporated into both Google Bard and OpenAI’s ChatGPT services. In this post, I use a challenging crawl to gauge the current state of affairs, particularly regarding my ongoing thesis about the potent effects of combining the capabilities of the GPT (4 and 3.5) LLMs with SPARQL fine-tuning for exploring the Semantic Web, facilitated by Virtuoso’s Sponger Middleware module.

Before proceeding, here is some vital background information about SPARQL and the Virtuoso Sponger.

SPARQL

SPARQL is a combination of a declarative query language, an HTTP-based wire protocol, and query solution serialization formats. A SPARQL query solution can be represented as a document identified by a Web Address (URL), with content negotiated by an HTTP client.

Fine-tuning ChatGPT using SPARQL Query Templates

Here’s the template that’s ultimately invoked for the Virtuoso Sponger Crawling exercise showcased later on in this post.

Count FAQs associated with knowledge graph <G>, using terms from schema.org;SELECT COUNT(*) FROM <G> WHERE { ?page a schema:FAQPage; schema:mainEntity ?question. ?question (schema:name|schema:title) ?name; (schema:acceptedAnswer|schema:suggestedAnswer) ?answer. ?answer schema:text ?answerText.}  ;
Count FAQs from <Org> associated with knowledge graph <G>, using terms from schema.org;SELECT COUNT(*) FROM <G> WHERE { ?page a schema:FAQPage; schema:mainEntity ?question. ?question (schema:name|schema:title) ?name; (schema:acceptedAnswer|schema:suggestedAnswer) ?answer. ?answer schema:text ?answerText. FILTER (CONTAINS(STR(?page),'<Org>'))} ;

This template can be loaded using prompts such as:

  • Remember the following …
  • Load the following template

Note: the second prompt style isn’t always understood based on my experiments in recent times.

Virtuoso Sponger

Virtuoso Sponger is a middleware layer hosted by a DBMS that allows deep integration of HTTP functionality into the production pipeline of SPARQL query solutions. This enables comprehensive data access by reference, both for constants and variables, within a SPARQL query. It can crawl the Web (including private intranets) and the Semantic Web, where links represent relationship types that resolve to machine-computable entity relationship type semantics, as part of the query solution production pipeline.

Web Page Crawling Exercise

As previously noted in numerous past posts highlighting the increasing occurrence of RDF-based structured data islands on web pages, Apple has consistently led the way in this endeavor. This is evidenced by the presence of structured data islands seamlessly integrated into its product pages.

For this exercise, I will use a page about the Apple Watch Ultra as the test case, where I aim to determine the number of embedded FAQ questions.

What follows are the results for Bard, Bing+GPT, ChatGPT, and ChatGPT + Virtuoso Sponger.

Bard

Q: How many FAQ questions are there in the following page https://www.apple.com/shop/buy-watch/apple-watch-ultra?preSelect=false&product=MQF33LL/A&step=detail?

A:
The page you linked does not have any FAQs.

Bard Response Screenshot

Bing + ChatGPT

Q: How many FAQ questions are there in the following page https://www.apple.com/shop/buy-watch/apple-watch-ultra?preSelect=false&product=MQF33LL/A&step=detail?

A:
I’m sorry but I couldn’t find the exact number of FAQ questions on the page you provided. However, I can tell you that there is a “Frequently Asked Questions” section on the page that contains several questions and answers about the Apple Watch Ultra¹. You can also find more information about the Apple Watch Ultra on the official Apple website⁵.

Is there anything else I can help you with?

Source: Conversation with Bing, 5/17/2023
(1) Frequently asked questions about the Apple Watch Ultra. Frequently asked questions Apple Watch Ultra.
(2) Apple Watch - Official Apple Support. Apple Watch - Official Apple Support.
(3) 5 Reasons the Apple Watch Ultra Isn’t a Great Choice for Most People - MUO. https://www.makeuseof.com/apple-watch-ultra-isnt-for-most-people/.
(4) I wore the Apple Watch Ultra for 6 months - Tom’s Guide. I wore the Apple Watch Ultra for 6 months — what I like (and don't like) | Tom's Guide.
(5) Apple Watch Ultra User Guide - Apple Support. Apple Watch Ultra User Guide - Apple Support.

Bing+GPT Response Screenshot

ChatGPT + Web Plugin

Q: How many FAQ questions are there in the following page https://www.apple.com/shop/buy-watch/apple-watch-ultra?preSelect=false&product=MQF33LL/A&step=detail?

A:
The page you provided contains a total of 8 Frequently Asked Questions (FAQs). Here they are:

  1. How do I find my band size?
  2. How do I customize my Apple Watch?
  3. What’s the difference between GPS and GPS + Cellular?
  4. Which carriers support Apple Watch cellular service?
  5. Which iPhone models are compatible with Apple Watch?
  6. Is it easy to transfer my data and set up my new Apple Watch?
  7. How can I pay using Apple Card Monthly Installments?
  8. How does Apple Trade In work?

ChatGPT + Web Plugin Screenshot

ChatGPT + Virtuoso Sponger

Q: How many FAQ questions are there in the following page https://www.apple.com/shop/buy-watch/apple-watch-ultra?preSelect=false&product=MQF33LL/A&step=detail?

A:
Sure, let’s construct the SPARQL query using the template:

DEFINE get:soft "soft"
SELECT (COUNT(?question) AS ?questionCount) 
FROM <https://www.apple.com/shop/buy-watch/apple-watch-ultra?preSelect=false&product=MQF33LL/A&step=detail> 
WHERE {
  ?page a schema:FAQPage; 
  schema:mainEntity ?question. 
} 

And here is the clickable SPARQL link:

Execute the query

Virtuoso SPARQL Query Solution Page Screenshot

agi-vs-semweb-page-crawl-iwatch-virt-sponger

In addition, here is a Faceted Search & Browsing Page link that shows details of the FAQ Questions used in this exercise.

Virtuoso Entity Description Page for Apple Watch Ultra

Virtuoso Entity Description Page for Apple Watch Ultra Screenshot

This page comprises entries for each the ten FAQ questions.

Tools Used

Virtuoso Sponger Middleware

Combined Multi-Model Database & HTTP Server Technology

OpenLink Structured Data Sniffer – Multi-Browser Extension

Related