Errors when loading ChEMBL 30.0 dataset into Virtuoso open source edition v.7.2.7 running on Microsoft Windows Server 2016

When trying to bulk-load the ChEMBL 30.0 RDF dataset (https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/30.0/) into an instance of Virtuoso open source edition v.7.2.7 (the latest released version of the open source edition as of now) running on Microsoft Windows Server 2016, the following three files fail to load due to errors:

  1. chembl_30.0_activity.ttl.gz 37000 [Vectorized Turtle loader] TURTLE RDF loader, line 63108355: SP029: TURTLE RDF loader, line 63108355: Unterminated short double-quoted string at
  2. chembl_30.0_assay.ttl.gz: 37000 [Vectorized Turtle loader] TURTLE RDF loader, line 53084351: SP029: TURTLE RDF loader, line 53084351: syntax error
  3. chembl_30.0_molecule.ttl.gz: 37000 [Vectorized Turtle loader] TURTLE RDF loader, line 40695228: SP029: TURTLE RDF loader, line 40695228: syntax error

The above files have been loaded without any errors on the same machine into Virtuoso opensource edition “Version 07.20.3217-threads for Win64 as of Apr 25 2016”.

Unzipping the above files and navigating to the reported error locations does not reveal what the cause of the problem is. For example, all statements in the vicinity of line 53084351 in chembl_30.0_assay.ttl appear to be correct Turtle (I have added the line numbers to each source line, the reported offending line is marked with *):

53084339 chembl_activity:CHEMBL_ACT_17887045 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084340
53084341 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887046 .
53084342
53084343 chembl_activity:CHEMBL_ACT_17887046 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084344
53084345 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887047 .
53084346
53084347 chembl_activity:CHEMBL_ACT_17887047 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084348
53084349 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887048 .
53084350
*53084351 chembl_activity:CHEMBL_ACT_17887048 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084352
53084353 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887049 .
53084354
53084355 chembl_activity:CHEMBL_ACT_17887049 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084356
53084357 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887050 .
53084358
53084359 chembl_activity:CHEMBL_ACT_17887050 cco:hasAssay chembl_assay:CHEMBL3991643 .
53084360
53084361 chembl_assay:CHEMBL3991643 cco:hasActivity chembl_activity:CHEMBL_ACT_17887051 .
53084362
53084363 chembl_activity:CHEMBL_ACT_17887051 cco:hasAssay chembl_assay:CHEMBL3991643 .

Should I also be reporting this issue in GitHub - openlink/virtuoso-opensource: Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform?

Those 3 Chembl 30 datasets load fine for me on my Linux Virtuoso Open Source instance:

SQL> ld_dir ('chembl', '*.ttl.gz', 'urn:chembl:data');

Done. -- 1 msec.
SQL> select * from load_list;
ll_file                                                                           ll_graph                                                                          ll_state    ll_started           ll_done              ll_host     ll_work_time  ll_error
VARCHAR NOT NULL                                                                  VARCHAR                                                                           INTEGER     TIMESTAMP            TIMESTAMP            INTEGER     INTEGER     VARCHAR
_______________________________________________________________________________

chembl/chembl_30.0_activity.ttl.gz                                                urn:chembl:data                                                                   0           NULL                 NULL                 NULL        NULL        NULL
chembl/chembl_30.0_assay.ttl.gz                                                   urn:chembl:data                                                                   0           NULL                 NULL                 NULL        NULL        NULL
chembl/chembl_30.0_molecule.ttl.gz                                                urn:chembl:data                                                                   0           NULL                 NULL                 NULL        NULL        NULL

3 Rows. -- 0 msec.
SQL> rdf_loader_run();

Done. -- 5139282 msec.
SQL> select * from load_list;
ll_file                                                                           ll_graph                                                                          ll_state    ll_started           ll_done              ll_host     ll_work_time  ll_error
VARCHAR NOT NULL                                                                  VARCHAR                                                                           INTEGER     TIMESTAMP            TIMESTAMP            INTEGER     INTEGER     VARCHAR
_______________________________________________________________________________

chembl/chembl_30.0_activity.ttl.gz                                                urn:chembl:data                                                                   2           2022.6.7 12:8.42 138787000  2022.6.7 13:34.21 419399000  0           NULL        NULL
chembl/chembl_30.0_assay.ttl.gz                                                   urn:chembl:data                                                                   2           2022.6.7 12:8.49 997281000  2022.6.7 12:25.39 775072000  0           NULL        NULL
chembl/chembl_30.0_molecule.ttl.gz                                                urn:chembl:data                                                                   2           2022.6.7 12:8.55 785654000  2022.6.7 13:23.33 4816000  0           NULL        NULL

3 Rows. -- 1 msec.
SQL> sparql select count(*) from <urn:chembl:data> where {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

604067485

1 Rows. -- 5249 msec.
SQL> status('');
REPORT
VARCHAR
_______________________________________________________________________________

OpenLink Virtuoso  Server
Version 07.20.3234-pthreads for Linux as of May 23 2022 
Started on: 2022-06-07 12:02 GMT+0

As you indicate running on Windows will have to repeat the test against the latest Windows Virtuoso Open Source build, in case it is Windows specific …

I installed the Windows Virtuoso binaries from Virtuoso_OpenSource_Server_7.2.x64.exe (file version 7.2.7.0). I suspect this error may be caused by a bug in the TTL parser that results in some kind of buffer overrun condition, which, in turn, results in the corruption of data read from the source files and false ‘syntax error’ reporting. These types of errors depend on many factors (memory alignment, etc.) and are platform- and compiler-specific. As I already mentioned, the same files were imported without any errors on the same machine into a much earlier version of Virtuoso built several years ago.

BTW, did you run
select * from db.dba.load_list where ll_error is not null;
?

I only loaded the 3 dataset files you indicated having problems with to limit resources required and time to load, etc., so the WHERE clause in the load_list table query was not needed to filter any errors…

Were you loading all the datasets? If this is a memory-buffer-overrun related issue, then perhaps all need to be loaded, in which case I will need to assess the dataset size and make sure I have enough memory, as my test instance had just about enough memory allocated for the load. But then those 3 datasets seem to be the largest and main ones?

Just want to make sure that, in your test, the values in the ll_error column were all NULL's, right? In my tests, I tried importing these files all by themselves and as part of a bigger import (the entire ChEMBL + a bunch of other stuff), and these three files always imported partially but failed to import completely. (It’s my understanding that, when the loader encounters an error while loading a file, it stops processing that file and does not try to recover and continue.) With the old version (Version 07.20.3217-threads for Win64 as of Apr 25 2016), no problems were encountered when importing the entire ChEMBL30 + other very large datasets (the total of 6.7 billion triples) on the same machine and with the same ini file settings.

I get the same error you encounter testing with the latest Windows Virtuoso 7.2 Open Source build:

SQL> select * from load_list;
ll_file                                                                           ll_graph                                                                          ll_state    ll_started           ll_done              ll_host     ll_work_time  ll_error
VARCHAR NOT NULL                                                                  VARCHAR                                                                           INTEGER     TIMESTAMP            TIMESTAMP            INTEGER     INTEGER     VARCHAR
_______________________________________________________________________________

chembl/chembl_30.0_activity.ttl.gz                                                urn:chembl:data                                                                   2           2022.6.7 21:58.2 656659000  2022.6.7 22:12.38 158833000  0           NULL        37000 [Vectorized Turtle loader] TURTLE RDF loader, line 63108355: SP029: TURTLE RDF loader, line 63108355: Unterminated short double-quoted string at
chembl/chembl_30.0_assay.ttl.gz                                                   urn:chembl:data                                                                   2           2022.6.7 21:58.6 969118000  2022.6.7 22:10.8 267834000  0           NULL        37000 [Vectorized Turtle loader] TURTLE RDF loader, line 53084351: SP029: TURTLE RDF loader, line 53084351: syntax error
chembl/chembl_30.0_molecule.ttl.gz                                                urn:chembl:data                                                                   2           2022.6.7 21:58.33 469199000  2022.6.7 22:11.56 596225000  0           NULL        37000 [Vectorized Turtle loader] TURTLE RDF loader, line 40695228: SP029: TURTLE RDF loader, line 40695228: syntax error

3 Rows. -- 78 msec.
SQL> sparql select count(*) from <urn:chembl:data> where {?s ?p ?o};
callret-0
INTEGER
_______________________________________________________________________________

107171652

1 Rows. -- 781 msec.
SQL> status('');
REPORT
VARCHAR
_______________________________________________________________________________

OpenLink Virtuoso  Server
Version 07.20.3234-threads for Win64 as of May 18 2022
Started on: 2022-06-07 21:56 GMT+0
...

and have reported to development to look into and fix with the Windows build …

Have they already reported this as an issue in github? I could not find it in Issues · openlink/virtuoso-opensource · GitHub

This is a really strange error affecting only the Windows binaries. Can it be related to the gz decompression (that is, reading from the compressed stream and gunzipping it on the fly)? I have not tried to unzip those files to *.ttl before loading them. What is also interesting about it is that it is reproducible, which hints it is not caused by some random memory corruption or trying to read beyond some array boundary.

The issue has been reported on our internal issue tracking system for development to look into and fix, the public git issue system is for end users.

This is probably a build issue with the Windows open source build, as the commercial Windows build does not have the problem. Thus, you could consider compiling your own build, if comfortable doing so, until development can schedule to look into this. Or if possible you could use the Linux prebuilt open source standalone installer or docker container or macOS standalone installer.

Unfortunately, I cannot use the Linux version of Virtuoso in our application that uses the triple store, because the backend depends on a number of Windows-specific components, and hosting both the triple store and all other parts of the backend on a single Windows Server AWS instance greatly simplifies the ETL operations, support and maintenance.

However, in general, when I see errors in software written in C/C++ that ‘vanish’ when a different compiler or runtime libraries are used to build the same source code, they are usually caused by errors in that code, most typically, buffer overrun or heap memory allocation errors, and/or parallel thread synchronization errors, which are highly sensitive to the target machine architecture and OS, memory alignment, built-in data type sizes, specific implementations of C/C++ library functions and data structures, and execution timing. Only once in my entire software development career many years ago did I come across a bug in one of the C compilers I used that actually produced incorrect machine instructions from correct C source.

You mentioned that the commercial Windows build did not have this problem. Does the commercial version differ in the part that parses the compressed Turtle and inserts the quads into the quads table from the open source version? BTW, there is still a small probability that something is actually wrong with those ChEMBL ttl.gz files (for example, they actually do have Turtle syntax errors), but in this case it is not clear why those errors would only be detected by the open source Windows build.

It is hard to say where the problem might be until development have the time to look into this.

The Virtuoso 8 commercial code does differ from the Virtuoso 7 open source code.

Have you tried making a Windows Virtuoso open source build yourself ?

Hi Hugh –
I am wondering if the developers have been able to look into this issue.

The issue is still to be scheduled for looking into by development.

In the meantime we have a Jena RDF Bulk Loader program you could consider using.