Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deletion of large number of graphs #1338

Open
nicolastoira opened this issue Jan 13, 2025 · 3 comments
Open

Deletion of large number of graphs #1338

nicolastoira opened this issue Jan 13, 2025 · 3 comments

Comments

@nicolastoira
Copy link

In my Virtuoso database instance I have around 30k graphs that I want to delete. I've search for an efficient solution but didn't find any so far. Trying to delete so many graphs leads to big performance problems. It would take less time to restart the database from scratch and re-ingest all the 30k graphs.

I tried with different logics like:

DELETE FROM rdf_quad WHERE g IN (iri_to_id('graph1', 'graph2')
SPARQL CLEAR GRAPH <graph1>;
SPARQL CLEAR GRAPH <graph2>;

In my opinion, the main issue is that when some graphs are removed from the rdf_quad then a re-indexing is triggered and this takes the majority of the computational time. I tried to remove the indexes associated with the table rdf_quad but this was creating inconsistencies.
The best approach would be to first remove all the graphs from the rdf_quad table (even in batches it should work), then at the end of the removal we trigger the re-indexing. By doing so we can reduce the overhead be the re-indexing and only execute it the least amount of times possible. It's fine if the database is not accessible during the deletion.

Do you have any suggestion on how this could be achieved? Or other any recommendations on how to remove a large number of graphs from the database? Thank you.

@HughWilliams
Copy link
Collaborator

HughWilliams commented Jan 13, 2025

How many triples are in these 30K graphs ?

Are you deleting the graphs with transaction log mode to set to autocommit on, i.e., with log_enable(3,1) as detailed in this document on How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?

Are you ultimately seeking to delete and reload these 30K graphs, as you indicate above It would take less time to restart the database from scratch and re-ingest all the 30k graphs? If so, note that Virtuoso supports a Delta-aware bulk loading of datasets into Virtuoso option, if the datasets are in nquad format for optimal loading of such data.

@nicolastoira
Copy link
Author

nicolastoira commented Jan 13, 2025

That was an example, the number and size of the graphs can vary, but I would say that it is possible to have in total up to billions of triples.

Currently, we are completely disabling it with log_enable(2) and then reactivating it after the load, log_enable(1). Setting log_enable(3,1) will not slow down the process?

Yes, the goal is that there might be some graphs that needs to be removed and then reloaded into the database. The type of files loaded might be turtle, trig, or nquads.

Thank you.

@HughWilliams
Copy link
Collaborator

HughWilliams commented Jan 13, 2025

As detailed in the log_enable documentation:

log_enable(2) — turns off transaction logging and turns on auto-commit mode
log_enable(3) — turns on transaction logging and turns on auto-commit mode

So, log_enable(2) would be faster as it does not write to the transaction log file, but then in the event of failure for some reason, the operations performed are lost, which is not the case with log_enable(3) as the transaction log is will replay the operations on database startup.

The key is to not be in manual commit mode which can quickly deplete/consume memory when deleting large graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants