Virtuoso open source “gdb” core file stack trace generation for analysis

What

The GNU gdb debugger can be used to process core files and obtain a stack trace of calls made by a process leading up to it crashing for analysis.

Why

A gdb stack trace can be used by development to analyse the sequence of calls being made leading up to a crash and correlate with the source code and possibly determine the cause of the crash and fix.

How

For Virtuoso commercial builds an unstripped debug version of the Virtuoso binary would need to be provided by OpenLink.

For Virtuoso open source the following steps need to be performed to build an unstripped debug version of the Virtuoso binary with symbols left intact in the binary, such that a meaningful stack trace can be obtained:

  1. Configure the Virtuoso build using the —with debug option:
    ./configure —with-debug
    
  2. In the Makefile, check to ensure CFLAGS has the -g option set which generates debug information:
    CFLAGS = -g -O2
    
  3. Then perform a make clean all & make to build a new debug unstripped Virtuoso binary (virtuoso-t).
    make clean all
    make
    
  4. Ensure core file creation is enabled with the command:
    ulimit -c unlimited
    

Start the Virtuoso database with the new Virtuoso unstripped debug binary and force the crash condition again to generate a new core file, then:

  1. Use gdb to load the core file with the command:
    gdb virtuoso-t  corefile
    
  2. At the (gdb) prompt, type bt or backtrace to back trace through stack and provide the output when top of stack is reached:
    $ gdb /opt/virtuoso/bin/virtuoso-t core.31731
    GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
    Copyright (C) 2010 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-redhat-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /opt/virtuoso/bin/virtuoso...done.
    [New Thread 6025]
    [New Thread 6028]
    .
    .
    .
    Core was generated by `/opt/virtuoso/bin/virtuoso'.
    Program terminated with signal 6, Aborted.
    #0  0x0000003ad2a324f5 in raise () from /lib64/libc.so.6
    Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libaio-0.3.107-10.el6.x86_64 libgcc-4.4.7-23.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64
    (gdb) bt
    #0  0x0000003ad2a324f5 in raise () from /lib64/libc.so.6
    #1  0x0000003ad2a33cd5 in abort () from /lib64/libc.so.6
    #2  0x0000003ad2a70417 in __libc_message () from /lib64/libc.so.6
    #3  0x0000003ad2a75e5e in malloc_printerr () from /lib64/libc.so.6
    #4  0x0000003ad2a78cad in _int_free () from /lib64/libc.so.6
    #5  0x0000000000b60468 in dk_free (ptr=0x7f8cf8fb3218, sz=65544)
        at Dkalloc.c:947
    #6  0x0000000000b65aad in dk_free_tree (box=0x7f8cf8fb3220) at Dkbox.c:808
    #7  0x00000000006a665a in ssl_free_data_v (sl=0x7f8c2400a8a0, 
        data=0x7f8cf8fb3020 "", inst=0x7f8c2400d338) at sqlrun.c:226
    #8  0x00000000006a6ee1 in qi_inst_state_free (qi_box=0x7f8c2400d338)
        at sqlrun.c:556
    #9  0x00000000006b0cf4 in qi_free (inst=0x7f8c2400d338) at sqlrun.c:3113
    #10 0x00000000007a0e09 in cl_qf_exec (clt=0x7f8da84da610, clo=0x7f8c24016b58)
        at clop.c:2298
    #11 0x00000000007a2006 in cls_qf (clt=0x7f8da84da610, clo=0x7f8c24016b58, 
        is_continue=0) at clop.c:2568
    #12 0x00000000007a49ac in clt_process_cm (clt=0x7f8da84da610, 
        cm=0x7f8da89d4160) at clop.c:3343
    #13 0x00000000007202c5 in cluster_thread_func (clt=0x7f8da84da610)
        at clsrv.c:364
    #14 0x0000000000de4eff in _thread_boot (arg=0x7f8da8d87b20)
        at sched_pthread.c:303
    #15 0x0000003ad3607aa1 in start_thread () from /lib64/libpthread.so.0
    #16 0x0000003ad2ae8c4d in clone () from /lib64/libc.so.6
    (gdb) 
    (gdb) quit
    $
    

Start Virtuoso with gdb

Virtuoso can also be run within gdb, with the command and left run as such with the commands below run from the database directory:

gdb virtuoso-t
gdb> run -f

Should Virtuoso crash, it will remain in the debugger, and the stack trace can be obtained:

]$ gdb ./virtuoso-t 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/virtuoso/bin/virtuoso-t ...done.
(gdb) run -f
Starting program: /opt/virtuoso/bin/virtuoso-t -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

		Wed Oct 23 2019
21:03:21 { Loading plugin 1: Type `plain', file `wikiv' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/wikiv.so>
Missing separate debuginfo for ../hosting/wikiv.so
21:03:21   WikiV version 0.6 from OpenLink Software
21:03:21   Support functions for WikiV collaboration tool
21:03:21   SUCCESS plugin 1: loaded from ../hosting/wikiv.so }
21:03:21 { Loading plugin 2: Type `plain', file `mediawiki' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/mediawiki.so>
Missing separate debuginfo for ../hosting/mediawiki.so
21:03:21   MediaWiki version 0.1 from OpenLink Software
21:03:21   Support functions for MediaWiki collaboration tool
21:03:21   SUCCESS plugin 2: loaded from ../hosting/mediawiki.so }
21:03:21 { Loading plugin 3: Type `plain', file `creolewiki' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/creolewiki.so>
Missing separate debuginfo for ../hosting/creolewiki.so
21:03:21   CreoleWiki version 0.1 from OpenLink Software
21:03:21   Support functions for CreoleWiki collaboration tool
21:03:21   SUCCESS plugin 3: loaded from ../hosting/creolewiki.so }
21:03:21 { Loading plugin 4: Type `plain', file `im' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/im.so>
Missing separate debuginfo for ../hosting/im.so
21:03:21   IM version 0.63 from OpenLink Software
21:03:21   Support functions for Image Magick 6.9.9
21:03:21   SUCCESS plugin 4: loaded from ../hosting/im.so }
21:03:21 { Loading plugin 5: Type `plain', file `wbxml2' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/wbxml2.so>
Missing separate debuginfo for ../hosting/wbxml2.so
21:03:21   WBXML2 version 0.9 from OpenLink Software
21:03:21   Support functions for WBXML2 0.9.2 Library
21:03:21   SUCCESS plugin 5: loaded from ../hosting/wbxml2.so }
21:03:21 { Loading plugin 6: Type `attach', file `libphp5.so' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/libphp5.so>
Missing separate debuginfo for ../hosting/libphp5.so
21:03:21   SUCCESS plugin 6: loaded from ../hosting/libphp5.so }
21:03:21 { Loading plugin 7: Type `Hosting', file `hosting_php.so' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/hosting_php.so>
Missing separate debuginfo for ../hosting/hosting_php.so
21:03:21   Hosting version 3309 from OpenLink Software
21:03:21   PHP engine version 5.6.37
21:03:21   SUCCESS plugin 7: loaded from ../hosting/hosting_php.so }
21:03:21 { Loading plugin 8: Type `plain', file `qrcode' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/qrcode.so>
Missing separate debuginfo for ../hosting/qrcode.so
21:03:21   QRcode version 0.1 from OpenLink Software
21:03:21   Support functions for ISO/IEC 18004:2006, using QR Code encoder (C) 2006 Kentaro Fukuchi <fukichi@megaui.net>
21:03:21   SUCCESS plugin 8: loaded from ../hosting/qrcode.so }
21:03:21 { Loading plugin 10: Type `plain', file `proj4' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/proj4.so>
Missing separate debuginfo for ../hosting/proj4.so
21:03:21   plain version 1.0.3309 from OpenLink Software
21:03:21   Cartographic Projections support based on Frank Warmerdam's proj4 library
21:03:21   SUCCESS plugin 10: loaded from ../hosting/proj4.so }
21:03:21 { Loading plugin 11: Type `plain', file `geos' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/geos.so>
Missing separate debuginfo for ../hosting/geos.so
21:03:21   plain version 1.0.3309 from OpenLink Software
21:03:21   GEOS plugin based on Geometry Engine Open Source library from Open Source Geospatial Foundation
21:03:21   SUCCESS plugin 11: loaded from ../hosting/geos.so }
21:03:21 { Loading plugin 12: Type `plain', file `shapefileio' in `../hosting'
warning: Ignoring non-absolute filename: <../hosting/shapefileio.so>
Missing separate debuginfo for ../hosting/shapefileio.so
21:03:21   ShapefileIO version 0.1virt71 from OpenLink Software
21:03:21   Shapefile support based on Frank Warmerdam's Shapelib
21:03:21   SUCCESS plugin 12: loaded from ../hosting/shapefileio.so }
21:03:21 OpenLink Virtuoso Universal Server
21:03:21 Version 07.20.3230-pthreads for Darwin as of Oct 14 2019
21:03:21 uses parts of OpenSSL, PCRE, Html Tidy
21:03:21 Enabled Cluster Extension
21:03:21 Enabled Column Store Extension
21:03:21 Enabled Virtual Database Extension
21:03:21 Enabled Replication Extension
21:03:21 Enabled Scalable ACL Extension
21:03:21 Enabled Custom Reasoning & Inference Rules
21:03:21 Database version 3126
[New Thread 0x7fffde190700 (LWP 37719)]
21:03:21 SQL Optimizer enabled (max 1000 layouts)
21:03:22 Compiler unit is timed at 0.000171 msec
[New Thread 0x7fffcfe9d700 (LWP 37723)]
21:03:24 Roll forward started
21:03:24 Roll forward complete
[New Thread 0x7fffcce7c700 (LWP 37727)]
21:03:25 Checkpoint started
21:03:25 Checkpoint finished, log reused
[New Thread 0x7fffb6ffd700 (LWP 37730)]
[New Thread 0x7fffb67fc700 (LWP 37731)]
21:03:27 HTTP/WebDAV server online at 8890
21:03:27 Server online at 1111 (pid 37715)
21:03:27 ZeroConfig registration virtuoso (LOCALHOST)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7bca6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) bt
#0  0x00007ffff7bca6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000cbe7a8 in semaphore_enter (sem=0x1dcfc10) at sched_pthread.c:961
#2  0x000000000041254c in main (argc=2, argv=0x1a4e7f0) at viunix.c:766
(gdb)

Attach gdb to running Virtuoso instance to force core file generation

A core file can be created by attaching to a running Virtuoso debug binary process with “gdb” and force core file to be generated as follows:

  1. Attach to gdb to the Virtuoso instance Linux process id:
   gdb -p <virtuoso pid> 
  1. Generate core file in gdb with the generate-core-file command :
   (gdb) generate-core-file <core-filename> 

Close or quit gdb and the Virtuoso instance will then continue running as normal.

Hello, Hugh @hwilliams

Thank you for the great article.

Sometimes our Virtuoso instance hangs, sometimes it crashes with core dump.
Usually we try to trace the queries that led to the failure, but we don’t always understand the reason for failure.

It seems that usually Virtuoso hangs if we put too much load on it. Maybe mass migrations or import or heavy queries, but we don’t have an algorithm to trace down the exact reasons.

If sometimes we have coredumps, usually we can’t decipher what is written there.

Maybe you could give us some hints and methods to understand what leads to hanging and crashes?

We are pushing to Kibana system metrics, virtuoso.log and QRL log, but we don’t what those metrics mean.
Here is the script to export from QRL to CSV: https://github.com/shedy2/virtuoso_qrl

Thank you!
Stas

@Branovitskiy: Core file analysis requires understanding of the Virtuoso code base, which is why we request gdb stack traces for analysis by development.

If you are encountering crashes and a core file is being produced then gdb stack trace from a Virtuoso binary with symbols in place is generally required to see where the crash is occurring and the sequence of calls leading up to it. Further analysis can also be performed by development based on trace to print variable values, etc., if they have the binary and core file, or they may request that a user run specific gdb commands to get such info.

If hangs are occurring and you cannot access the server via HTTP or SQL, you can also force the Virtuoso server to dump core with the kill -9 <pid> command, and then similarly obtain a gdb stack trace to see the state of the Virtuoso server at that point, i.e., the sequence of calls leading up to the hang, threads in use, etc., which may enable development to determine the cause and possible remedy to prevent such hangs. Note also if you are running Virtuoso from within gdb when it hangs, a stack trace and threads-in-use info can also be obtained directly from within gdb without having to kill the server to force core file creation.

For the virtuoso.qrl file, the query logging documentation provides details of the metrics available, although once again understanding of the Virtuoso engine is required to make meaningful use of many of the available metrics although some are self explanatory.

Thank you, @hwilliams!

Those are great advices!
We didn’t know about them.
We’ll try to generate core dump next time Virtuoso hangs.

Are there any materials that can let us understand how Virtuoso engine functions?

There are documents on the Virtuoso engine, like the Core Database Engine and the Hybrid RDBMS/Graph Column Store documents, but these would not have the level of details required to diagnose the problems you seem to be encountering which require low level knowledge and understanding of the Virtuoso source code available in git for the open source product.

Thank you, @hwilliams

We will try to read those instructions.

Concerning core dumps…

We did have a problem once, when we were getting core dump and we were lucky to see the culprit query right inside backtrace.

Though it is not always the case. I agree that it is almost no use trying to understand source code.
But maybe if we could find some lifehack how to understand which query leads to hanging or crash… Even if we don’t understand the detail, but we know the exact query… then we could just rewrite it in a different way.

What do you think? Is there a way to find out culprit queries if database crashes?

Thank you!
Stas