German umlauts with SPARQL queries

Hi guys,

First of all, thank you for reading my post.

I am new to virtuoso and at a loss. I have an instance running on Virtuoso 7.2.5 open source docker image prepared by tenforce (Docker Hub).

The triples loaded to my SPARQL endpoint were encoded with UTF-8. I have set the parameters for HTTPServer (Charset=UTF-8) and Client (SQL_UTF8_EXECS=1) accordingly.

However, when filtering for strings with umlaut characters (e.g., München), I get nothing. I found out that the web client is transforming ü to ü. I tried changing the filter string to München and I get the correct results displayed as München.

I have also tried restarting my image and reloading my data, to no avail.

Any tip would be greatly appreciated :slight_smile:

We recommend using the OpenLink Virtuoso Open Source docker image and see if the problem occurs with that.

Should it occur please provide the steps you are using for inserting the data and then query the data to recreate the issue and we can attempt to recreate …

1 Like

Thanks for the reply.

I just tried the docker image specified above, same problems. To replicate:

I have an .nt file (UTF-8) containing this line:

This file was added in the DB.DBA.load_list table and loaded using rdf_loader_run(). Used SPARQL web client service to query, which returned nothing:

PREFIX dct: <http://purl.org/dc/terms/>
       
SELECT DISTINCT ?item ?title
WHERE {
 ?item dct:title ?title FILTER regex(str(?title), "Räuber", "i") .
}

However, searching for R.*uber returns the triple, with the literals displayed correctly.

I checked the docker image’s locale setting, it says ‘POSIX’. Would this matter?

Checking current_charset() returns:

SQL> select current_charset();
current_charset
VARCHAR 
_____________________________________________________________________
ISO-8859-1

I have trouble understanding the documentation for define_charset()

Also, there seem to be an open GitHub issue from 2012 similar to mine UTF-8 Encoding Problem in SPARQL Select Results XML and RDF/XML · Issue #17 · openlink/virtuoso-opensource · GitHub

I don’t understand your triple definition, i.e., <IBDPENVFMNA4TNCNMJ772MB72RITSYMK> <dct*:title> "Die Räuber"@de . in particular the predicate <dct*:title> . Why the “*” in the name and why the <...> brackets, as in your query you reference the predicate value as dct:title using the PREFIX dct: http://purl.org/dc/terms/ prefix definition which expands the predicate to http://purl.org/dc/terms/title whereas you have load it as IRI dct*:title ?

Hi Hugh,

Sorry about the confusion. I had to post that unintelligible triple because I couldn’t submit my original post with more than 1 URLs. I’ve edited my original post and replaced it with a screenshot instead.

I’ve also attempted to do a SPARQL INSERT and deliberately placed a syntax error by omitting the prefix. Notice that in the error message, the original umlaut character was transformed.

This seems to be a problem with the str() function in your query which is not handling the umlaut char correctly, which we are going to look into, but it is not needed in this case as the following works without it:

SQL> SPARQL PREFIX dct: <http://purl.org/dc/terms/> SELECT DISTINCT ?item ?title FROM <mt555XAny> WHERE { ?item dct:title ?title FILTER regex(?title, "Räuber", "i") . };
item                                                                              title
LONG VARCHAR                                                                      LONG VARCHAR
_______________________________________________________________________________

IBDPENVFMNA4TNCNMJ772MB72RITSYMK                                                  Die Räuber

1 Rows. -- 1 msec.
SQL>
1 Like

Thanks a lot @hwilliams! without str() it works! :clap: