Delete data over million UIDs

Lyiang · March 20, 2024, 2:56am

hi, my code like this

query := `
	  query {
		var(func: has(resourceId)) @filter(not eq(resourceId, "11111111")) {
		  uids as uid
		}
	  }
	`

	mu := &api.Mutation{
		DelNquads: []byte(`uid(uids) * * .`),
	}
	req := &api.Request{
		Query:     query,
		Mutations: []*api.Mutation{mu},
		CommitNow: true,
	}
	_, err := txn.Do(context.Background(), req)

but when there is a lot of data in the database, an error will be reported:

rpc error: code = Unknown desc = var [uids] has over million UIDs

so, how can I bypass the one million limit by only one txn.Do

my self-cluster config:

dgraph zero --my=127.0.0.1:5080 --bindall=false
dgraph alpha --my=127.0.0.1:7080 --zero=127.0.0.1:5080 --bindall=false

thx.

vnium · March 20, 2024, 12:37pm

Don’t bypass it. Paginate the results (Pagination - Query language) and do multiple requests.

amaster507 · March 20, 2024, 8:48pm

This is a limitation of Dgraph unfortunately and could lead to corrupt data if you don’t carefully manage the pagination to ensure all expected requests completed. Tale a look at upserts that might help do the next pqgination set without needing to externally control the pagination.

This same limitation applies at an even smaller scale when updating data.

Damon · March 21, 2024, 1:06pm

Good point. I think amaster means that if you retrieve UIDs to delete, especially in parallel, in pages of 1,000, and those pages are retrieved at different times, they may not line up perfectly due to additions or deletions that occur between the various page queries.

Even if single threaded, if you retrieve a sequence of page queries, you may get an incomplete list due to concurrent changes.

To avoid issues, consider using after: and limit: (vs offset: and limit:) on UIDs to do the pagination. UIDs are sequential (or at least semi-sequential via block allocation - not sure). So the UIDs won’t be perturbed in earlier pages as you work through your millions of deletes.

Alternatively for any batch operation you can run single threaded and write the selection query in a way that limits to unprocessed items. For deletes that is easier since a “processed” item is gone, so no need to worry about re-processing it twice.

General approach for bulk edits/deletes, in your language of choice:

for i = 1 to totalNum div CHUNK_SIZE + 5 * CHUNK_SIZE // add 5 chunks for safety
   queryUnprocessed { your query for UIDs here, first chunk of CHUNK_SIZE UIDS }
   processThem( your mutation using the UIDs above }

E.g. to alter a field from firstName to givenName, your query would look for the first 1,000 items that do not have the givenName, so you are never reprocessing and don’t have to page.

Topic		Replies	Views
What is NQuad count and how can I increase it's limit? GraphQL kind:question , kind:bug	27	2661	November 13, 2020
Failed to delete and confusion about query Dgraph dgraph	2	765	January 3, 2022
How to delete multiple uids Users	2	772	March 6, 2019
Using Upsert Query to Delete Old Nodes Dgraph kind:question	2	562	May 18, 2021
Encounter nquad count limitation Dgraph dgraph	1	874	June 14, 2022

Delete data over million UIDs

Related topics