Would loading a large graph be sped up if I dropped all indices first and then re-added them?

I’m loading a large chunk of dbpedia (about 1 billion triples) on Virtuoso Open Source 7.2.6. I’ve been running count(*) queries periodically, and I get the sense that the whole load process is gradually slowing down. I wonder whether this is because, as indices grow in size, inserts become more expensive, so the bigger the index, the slower the inserts.

So I’m wondering whether it would speed things up if I:

  1. Dropped all indices
  2. Loaded in all the data
  3. re-indexed with the default indices

Would I be right to expect that this would be overall faster than my current approach?

Have you performance tuned your Virtuoso instance for RDF usage as detailed in the linked guide ? As if your load rate is slowing the Virtuoso server is probably running out of resources (memory) during the load and having to swap to and from disk, which will degrade performance.

For 1 billion triples you would need at least 10GB RAM for hosting the dataset in memory during load for optimum performance, preferably with fast storage devices (SSD etc) to minimise performance loss if swapping to disk is required.

What are the specs of the machine you are using to load the data into interms of memory and CPUs especially ? Are you using the Virtuoso RDF Bulk Loader for the loading the datasets ? Note also the Virtuoso LD Meter Utility for monitoring RDF data load rates.

I doubt dropping the RDF_QUAD table indexes would help as if you are running out of resources loading with them in place, then you would hit the same problem when dropping and recreating the indexes.

Thank you for the prompt response! I realized that the two mistakes I was making were:

  • I forgot to performance-tune my installation, so Virtuoso was indeed under-utilizing resources
  • I was running Virtuoso within a docker installation which was configured to use a fraction of my machine’s overall resources.

After performance-tuning Virtuoso and re-configuring Docker to max out on my machine’s resources, I achieved a 80 million triple/hour insert rate, which is good enough for me. Thanks again.