Soft deleting nodes + query performance

What I want to do

Soft delete nodes without any performance penalties

What I did

I am soft deleting by flagging nodes as ‘deleted’ + applying a filter on each node reference in the query, meaning something like this

query {
  d(func: uid(1))  {
    uid
    pred {
     uid
    }
  }
}

turns into this

query {
  d(func: uid(1)) @filter(not(has(deletedAt))) {
    uid
    pred @filter(not(has(deletedAt))) {
     uid
    }
  }
}

which does work and I have been using this for over 2 years now. However recently I investigated the performance of this and found that this method basically doubles the query time.

I wrote some benchmarks, which confirm my original findings. These benchmarks show, that on a query with around 350 @filter statements (which is a usual amount for some of our queries) even without any data stored inside of dgraph, queries are significantly slower

Benchmark_query-8                  	     267	   4513224 ns/op
Benchmark_query_without_filter-8   	     348	   3732911 ns/op

(if you want I can share the code for the benchmarks later).

I am curious if you can think of a better implementation for this? Obviously, the easiest way would be to hard delete the nodes, but do you maybe see a way of achieving this without such a hard performance penalty?

3 Likes

Is that nanosecond?

3,73 ms and 4,51 ms seems pretty fast.

https://www.google.com/search?q=4513224++nanosecond+to+ms

It is fast, but as I said that this benchmark works against 0 data stored. The difference on our production data is around 200ms vs 100ms without the @filter statements.

I see,

Can you tell under what conditions it reaches 200 ms?

In the common day a simple query to get a very small data takes 200 ms? or is it just for “Bulk queries”?

A comment aside, I suspect the application of filters in Dgraph is not distributed. I think the gathering of predicates itself is distributed and the application of filters and parameters is centralized into the instance you are connect to. I think decentralizing this could help in cases like this.

But it’s just a theory, I have no idea how the code is implemented literally.

However, it is evident that you gain more “weight”(consumes more RAM and CPU resources) when you apply filters and parameters in a query. There’s no way to get the same latency against a clean query.

it’s 200ms under zero load. It is a pretty big query (around 1200 lines) but not uncommon for us.