load the full freebase dump into Virtuoso

Austin · December 20, 2018, 2:10pm

I want to load the full freebase dump (397G with unzipped) into Virtuoso, but it takes tooooo long time. Is it convenient to provide with the virtuoso.db file which has been already compiled? Thanks a lot.

TallTed · December 20, 2018, 2:20pm

We do not have such a pre-loaded Freebase-in-Virtuoso instance to offer at this time.

We may be able to provide some advice toward speeding up your dataload, if you give us more detail about what you’ve tried, what “tooooo long time” means, how your Virtuoso is configured, what system resources (primarily RAM, but also disk and processors) are available, etc.

(Also, it is unclear what “397G … full freebase dump” you’re trying to load, as the official dumps only claim to be 260GB including all downloadable archives.)

Austin · December 20, 2018, 3:12pm

Thanks for your reply. I downloaded the official dumps freebase-rdf-latest (32G zipped, 397G unzipped, I don’t know why), and try to upload it into Virtuoso with pdf_loader_run(). It was totally 1.9 billion triples according to the official statement. I load it for already 3 days and with 0.3 billion triples finished.
MemAvailable: 490033504 kB
disk available: 400G
Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
By the way, the resources are not only used by me alone. So maybe smaller actually.
Thanks.

TallTed · December 20, 2018, 4:16pm

A few thoughts…

I think you meant rdf_loader_run()?
Is your Virtuoso instance properly performance-tuned? Most important is to set NumberOfBuffers and MaxDirtyBuffers to make best use of available RAM (which may be substantially less than is physically present; please do read that whole section, if not the whole article).
Have you tried following the advice about running multiple loaders, in our Bulk Loader documentation? It looks like the processor in this machine has 20 cores, and 40 threads, so you should definitely be able to run more than 1 load thread. IF you were the only machine user, we would suggest 10, 15, or even 19 (or more, if there’s more than one E5-2698 in the machine) – and this still might be worthwhile, given that the load will finish much faster this way.
If the above thoughts do not speed the load enough for you, it would be worth checking version information for your Virtuoso instance (the first section of output from virtuoso-t -? or virtuoso-iodbc-t is usually best).

Austin · December 21, 2018, 5:58am

Thanks. May I know how to split the dumps into multiple files?

hwilliams · December 21, 2018, 12:11pm

@Austin: The Freebase dataset file is in N-Triples format, thus providing it contains no blank nodes, and you’re on Linux, the split shell command can be used to split the dataset into multiple files, with something like:

split -l 1000000 /path/to/large/rdf/file.nt

The above command will split your input data file into several smaller files, of 1,000,000 lines each.

I would also refer you to the following Virtuoso git issue where a user details how they successfully loaded the Freebase datasets, which included splitting the file for optimum platform utilisation during the dataset load and also disabled the Free Text index build during the load which will significantly improve the load time.

Austin · December 23, 2018, 9:26am

Thanks for your advice. I have followed the git issue. But the process entered into ‘S’ state and the system is stuck, even after killing the process (not because of RAM, disk, IO). I can’t know how many tuples I have already loaded into Virtuoso. I have already set the NumberOfBuffers and MaxDirtyBuffers according to my resources. Aiming to this problem, do you have any suggestions? Thank you.

Austin · December 23, 2018, 10:23am

And the virtuoso.db file have be broken, I can’t remove it. the ls, mv, cp operation all can’t be finished. I don’t know why. Do you have any advices? Thank you!

TallTed · December 24, 2018, 12:22am

@Austin - We don’t have enough information to diagnose or advise much further.

Commandline output showing the S process state you’re describing might help us understand what you mean.
Seeing your current INI file (virtuoso.ini by default; including at least the full stanza with the NumberOfBuffers and MaxDirtyBuffers settings, if not the whole file) would help.
Content of the log file (virtuoso.log by default) would also be helpful.
Full text of any errors that result when you try “the ls mv cp operation” that “can’t be finished” would be helpful.
“the virtuoso.db file have be broken, I can’t remove it” suggests that some process has the file open – and once that process exits, you should be able to remove it.