Dgraph can't idle without being oomkilled after large data ingestion

When it reboots, Dgraph must be trying to reapply the write-ahead log.

Looks like OpenCensus traces are taking up most of your memory.

(pprof) top
Showing nodes accounting for 15.09GB, 91.05% of 16.57GB total
Dropped 354 nodes (cum <= 0.08GB)
Showing top 10 nodes out of 100
      flat  flat%   sum%        cum   cum%
    6.74GB 40.70% 40.70%     6.74GB 40.70%  contrib.go.opencensus.io/exporter/jaeger.attributeToTag
    2.78GB 16.76% 57.46%     9.52GB 57.46%  contrib.go.opencensus.io/exporter/jaeger.spanDataToThrift
    2.32GB 13.98% 71.44%     2.42GB 14.61%  fmt.Sprintf
    1.45GB  8.75% 80.19%     1.45GB  8.75%  github.com/dgraph-io/dgraph/edgraph.buildUpsertQuery

I’d set the tracing to zero, or something really low.

$ dgraph alpha --help

      --trace float                    The ratio of queries to trace. (default 1)

On that note, @martinmr, can we set this to 0.01 by default?

1 Like

Ah @mrjn here’s a better snapshot- apologies I forgot I had tracing still on for the last one. Your suspicion of CPU bottleneck on those hot predicates seems correct, furthermore it appears to be the root cause of cluster instability that results in my big memory blowups.

Hard to catch the spike which takes it to 27gb, but here I caught one at 9gb.

pprofheappostinglist.pg.gz (110.7 KB)

Here is the cpu on the alpha getting hammered- 26cores being used- roughly 200 upserts/s (Im assuming this is the alpha with project/xid predicates)

Once I kick up concurrency up, that particular alpha starts to get overwhelmed- logs like this- also seems like it had connection issues with its peers/zero:

W0509 01:20:03.764147       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 814ms. Breakdown: [{disk 814ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
W0509 01:22:22.778423       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 412ms. Breakdown: [{disk 412ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
W0509 01:23:02.882394       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 686ms. Breakdown: [{disk 623ms} {advance 62ms} {proposals 1ms}] Num entries: 0. MustSync: false
W0509 01:23:12.926267       1 node.go:420] Unable to send message to peer: 0x3. Error: Unhealthy connection
W0509 01:23:20.017682       1 node.go:420] Unable to send message to peer: 0x2. Error: Unhealthy connection
W0509 01:23:33.758306       1 pool.go:254] Connection lost with slafferty-dgraph-zero-1.slafferty-dgraph-zero-headless.slafferty.svc.cluster.local:5080. Error: rpc error: code = Unavailable desc = transport is closing
E0509 01:23:44.711037       1 groups.go:994] While proposing delta with MaxAssigned: 11979 and num txns: 0. Error=Server overloaded with pending proposals. Please retry later. Retrying...
I0509 01:23:57.994460       1 log.go:34] 1 [logterm: 15, index: 10299, vote: 1] rejected MsgPreVote from 3 [logterm: 15, index: 10272] at term 15
I0509 01:23:57.994660       1 log.go:34] 1 [logterm: 15, index: 10299, vote: 1] rejected MsgPreVote from 3 [logterm: 15, index: 10272] at term 15
E0509 01:24:29.603307       1 groups.go:994] While proposing delta with MaxAssigned: 11989 and num txns: 0. Error=Server overloaded with pending proposals. Please retry later. Retrying...
E0509 01:24:57.604428       1 groups.go:994] While proposing delta with MaxAssigned: 11989 and num txns: 0. Error=Server overloaded with pending proposals. Please retry later. Retrying...
I0509 01:25:28.257217       1 groups.go:929] Zero leadership changed. Renewing oracle delta stream.
E0509 01:25:28.257931       1 groups.go:905] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0509 01:25:29.264755       1 groups.go:865] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0509 01:25:29.265130       1 groups.go:874] Got Zero leader: slafferty-dgraph-zero-2.slafferty-dgraph-zero-headless.slafferty.svc.cluster.local:5080
W0509 01:25:50.204076       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 1.357s. Breakdown: [{disk 1.357s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
W0509 01:25:52.878036       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 807ms. Breakdown: [{advance 807ms} {disk 0s} {proposals 0s}] Num entries: 0. MustSync: false
W0509 01:25:53.625515       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 641ms. Breakdown: [{advance 641ms} {disk 0s} {proposals 0s}] Num entries: 0. MustSync: false
W0509 01:25:54.827916       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 698ms. Breakdown: [{disk 698ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
W0509 01:26:00.724864       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 1.141s. Breakdown: [{disk 1.141s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false

Once this alpha goes down, the other alphas must try to take over with their replicas- however there’s a massive memory spike (25+ gb) when they do this, causing oom. This is a bit of a perpetual cycle and its hard for them to break out of it (especially if more mutations continually come in).

So yes, it seems like if mutations were able to run concurrently on the same predicate in ludicrous mode, that might mitigate my issue :slight_smile:

BTW having so many jaeger traces throughout the core is fantastic :+1:

So- interesting thought. There’s no real reason all my entities need to have their xid live in the same predicate. For instance if I have TypeA…TypeZ in dgraph, I think I could just “namespace” the ids to the types. TypeA.id, TypeB.id, etc. This would balance the load of my “xid” predicate upserts across my dgraph types. Since I have a lot of types, this may be an effective solution for me- which Ill spend some time testing.

I still have the project predicate though- which is effectively my means of multi-tenancy. I suppose I could make it per-type as Im describing doing with xid, but that feels a little weird. I was also thinking about maybe making project a node that has a relationship to all-the-things… which I think would be cheaper since on disks that just a sorted list of ints that get compressed. Might work. I still have the risk of schema collisions across tenants… but thats another thing to tackle :stuck_out_tongue:

The one concern there is that you might have a single node, which fans-out to millions of other nodes. That can slow the queries down a bit.

The memory usage is due to some query returning lots of UIDs. See this: Optimize Dgraph Memory Usage - #2 by mrjn

Appreciate the analysis- roaring bitmaps seem cool! I will attempt sharding my ids by type and report back in the meantime.

Woah @mrjn check this out!

I switched from having xid on all predicates and instead namespaced it by type… i.e VM_id, VCPU_id, etc (across about 200 types) and found some ludicrous performance.

1st mark is concurrency1: batch 200
2nd is concurrency 1 batch 1000
Third is concurrency 10 batch 1000

The last one was only using 10cpu, 12gb mem to push almost 80k! I was struggling to break 1000/s with 28 cores and 28gb mem with the xid-on-everything approach… so thats an 80x performance on half the hardware. Nice!

I was curious if I could push higher than this but any increase in concurrency or batchsize seems to hit diminishing returns even though Im below max cpu util.

Is there much of a downside of making my predicates as granular as possible? I suppose there would be some sort of a read performance penalty, but since my use case is less of searching for “name” across everything and more of searching for “name, type” tuple- perhaps its a better format for me.

Additionally, if I hit the nodes too hard and they oomkill, they’ll struggle to come back online due to big memory spikes (heap is quite small but total mem used by the process is high). Maybe I’ve exacerbated the replica-moving thrash by having so many predicates? I’ll try to grab some more data on that.

Given enough time or heavy inserting I’ll get back to the point where the cluster gets unhealthy and swallows lots of ram and boot loops from oomkills (even with all querying and inserting stopped).

Here’s kubectl top
Screen Shot 2020-05-11 at 7.05.40 PM

And heres top on the node itself

Screen Shot 2020-05-11 at 7.04.54 PM

And here’s a heap profile- which is less than half the memory in use
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.048.pb.gz (71.7 KB)

The logs look pretty bad, 50m pauses

I0511 22:07:47.213177       1 draft.go:822] Blocked pushing to applyCh for 50m11.988s
W0511 22:07:47.528009       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 50m16.661s. Breakdown: [{proposals 50m16.592s} {disk 70ms} {advance 0s}] Num entries: 3144. MustSync: true

alpha2logs (449.2 KB)

Maybe having so many predicates is causing a lot of thrash when a node tries to recover after a reboot?

@ibrahim, looking at the logs, looks like Badger replays are taking a while. Any ideas why that would be the case?

Also, this one:

   1 I0511 22:33:12.274514       1 log.go:34] LOG Compact 1->2, del 9 tables, add 9 tables, took 17m32.13744674s
   2 I0511 22:33:12.274622       1 log.go:34] Compaction for level: 1 DONE

Looks like the compaction took over 17 mins. That’s very slow! Any ideas what might be going on there?

Showing nodes accounting for 9297.22MB, 96.58% of 9626.05MB total
Dropped 189 nodes (cum <= 48.13MB)
Showing top 10 nodes out of 99
      flat  flat%   sum%        cum   cum%
 3645.55MB 37.87% 37.87%  4793.26MB 49.79%  github.com/dgraph-io/badger/v2/table.OpenTable
 2235.26MB 23.22% 61.09%  2246.26MB 23.34%  github.com/DataDog/zstd.Decompress
  850.11MB  8.83% 69.92%  1004.61MB 10.44%  github.com/dgraph-io/dgraph/protos/pb.(*Mutations).Unmarshal
  752.19MB  7.81% 77.74%  1149.21MB 11.94%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  549.98MB  5.71% 83.45%   549.98MB  5.71%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  397.02MB  4.12% 87.58%   397.02MB  4.12%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
     256MB  2.66% 90.24%      256MB  2.66%  github.com/dgraph-io/ristretto.newCmRow
  249.61MB  2.59% 92.83%   249.61MB  2.59%  github.com/dgraph-io/badger/v2/skl.newArena
     207MB  2.15% 94.98%      207MB  2.15%  github.com/dgraph-io/badger/v2/table.NewTableBuilder
  154.51MB  1.61% 96.58%   154.51MB  1.61%  github.com/dgraph-io/dgraph/protos/pb.(*DirectedEdge).Unmarshal

The 4.7GB of memory usage is because of LoadToRam setting for opening Badger. Could be improved by instead setting it to mmap @seanlaff .

The 2.2GB of memory is because of zstd.Decompress. I think @ibrahim made a change to optimize that, IIRC.

Very nice! :slight_smile:

Not really. That’s what we do for our GraphQL implementation. Each type’s fields is a different predicate internally (typename.fieldname).

That’d only happen if you have multiple shards. Also, Zero only does that calculation at a regular interval, so there would only be one move happening every so often. The time duration for that is configurable via a flag in Zero.

Thanks for the feedback. Good to hear about the predicates- and the mmap and zstd compression will be good tweaks to have in the back pocket.

Cluster has been idle for about 12 hours now and is still struggling to get out of its unhealthy state. Are there any other data/traces/logs that I could pull that would be helpful in diagnosing the slow badger replays/high cpu/mem?

I’ll wait for @ibrahim’s take on this. It might help to take a copy of p or w directory, which is slow to open.

@mrjn @seanlaff The badger db open doesn’t take too long to open

I0511 20:47:14.593184      17 log.go:34] All 258 tables opened in 1m11.905s
I0511 20:47:56.702918      17 log.go:34] All 864 tables opened in 38.226s
I0511 21:01:46.561048       1 log.go:34] All 258 tables opened in 54.605s
I0511 21:02:31.180809       1 log.go:34] All 911 tables opened in 40.731s

The maximum is 1 minute for 258 tables which is not super slow. It’s strange that 900+ tables were opened in 40s but it took 1m to open 258 tables.

I see that there was a point after which compactions got slow

I0511 21:16:50.409226       1 log.go:34] LOG Compact 2->3, del 5 tables, add 3 tables, took 1.356606356s
I0511 21:16:52.690498       1 log.go:34] LOG Compact 2->3, del 5 tables, add 4 tables, took 2.280474683s
I0511 21:16:53.854120       1 log.go:34] LOG Compact 2->3, del 5 tables, add 4 tables, took 3.594580144s
I0511 22:15:40.136795       1 log.go:34] LOG Compact 0->1, del 14 tables, add 10 tables, took 2m58.029109802s
I0511 22:33:12.274514       1 log.go:34] LOG Compact 1->2, del 9 tables, add 9 tables, took 17m32.13744674s
I0511 22:35:55.973987       1 log.go:34] LOG Compact 2->3, del 4 tables, add 3 tables, took 2m45.303871907s

I see only these 3 compactions that took some minutes. I have two possible explanations for the slow compactions

  1. I think it might because the block cache doesn’t contain these blocks and they have to be decompressed always. We have a 1 GB block cache in alpha and this compaction alone would’ve tried to insert 540 MB (64*9) of data in the 17 minute long compaction which would lead to eviction of existing blocks from the cache.
    The 9 tables from level 1 were decompressed and the new 9 tables on the level 2 were recompressed.
    If performance is important, we should increase the cache size here, if memory is important we should keep the cache to default 1 GB.
  2. Edit - The following point is invalid. The compactions don’t cascade from level 1 onwards. We do that only for level 0.
    I see there were multiple compactions running at the same time https://gist.github.com/jarifibrahim/5fdd80275401ce950c07467e4bf780d2 . I believe it took 17 minutes because the lower levels might not have enough space to accommodate the new tables and the compaction cascaded to the lower levels.
    So the cache plus cascading compactions could’ve caused the slow compactions.

The PR reduces the amount of memory we allocate https://github.com/dgraph-io/badger/pull/1308

@seanlaff we had made some performance improvements to compression (which improves the speed of compactions) in v20.03.1 (Release Dgraph v20.03.1 · dgraph-io/dgraph · GitHub). Can you double-check your dgraph version?

Currently running v20.03.1. As my test went on, I added a few more (non-namespaced) predicates that had a moderate chance of being applied to any node. I also set the cardinality of the values for these predicates super low to see what would happen (I generate an int between 1-15, to simulate something like “orgID”, “deployEnv”, etc). Maybe the badger posting-lists for these just ballooned in size as would be expected? I noticed these large tablets in the new ratel ui for the low cardinality predicates I sent in.

200mb is pretty chunky so I could see that causing some churn in the 1gb cache you mentioned :slight_smile: Perhaps I need to be aggressive as possible about namespacing all predicates to avoid that.

@seanlaff Those are pretty big tables. The default table size is 64 MB. We store all duplicate keys (with different versions) in a single table. This means you have a lot of duplicate keys. @mrjn is this expected? If the tables are of 200 mb size, the compactions and compression/decompression/cache would be affected.

How do we know that? Sean’s image shows Dgraph tablets, not Badger tables.

Also, what I saw from the logs was that the value log replay went on for a long while after the tables were opened. The question is, why did the value log need to be replayed from so far back? For around 14 mins, it was replaying the value log.

I0511 21:02:31.196860       1 log.go:34] Replaying file id: 559 at offset: 0
I0511 21:02:33.668348       1 log.go:34] Replaying file id: 564 at offset: 0
I0511 21:02:36.043134       1 log.go:34] Replaying file id: 565 at offset: 0
I0511 21:02:39.134462       1 log.go:34] Replaying file id: 566 at offset: 0
I0511 21:02:42.870889       1 log.go:34] Replaying file id: 567 at offset: 0
...
I0511 21:15:56.718166       1 log.go:34] Replaying file id: 786 at offset: 0
I0511 21:15:58.766321       1 log.go:34] Replaying file id: 787 at offset: 0
I0511 21:16:01.757383       1 log.go:34] Replaying file id: 788 at offset: 0
I0511 21:16:03.753972       1 log.go:34] Replaying file id: 789 at offset: 0
I0511 21:16:06.446048       1 log.go:34] Replaying file id: 790 at offset: 0
I0511 21:16:08.663018       1 log.go:34] Replaying file id: 791 at offset: 0

My bad. I confused badger tables with dgraph tablets.

This is because we keep level 0 in memory. So in case of a crash, the data in level 0 and memtable has to be replayed. This means more data has to be ingested during replay leading to more compactions which leads to longer replays.

1 Like

Heya so I ran another test with namespacing all the predicates by type- everything ran pretty smooth and level, however since go is greedy about not returning memory, it slowly climbs and it’ll eventually oomkill.

The good news is I don’t see outrageous memory spikes on reboot using this namespacing-everything pattern, however it looks like the rebooting alpha is taking a really long time to catch back up. My mutations are hanging (which is the back pressure I’d assume to see), but I’m 25m in and it’s still exhibiting this behavior.

Here’s some logs
dgraphalphareboot (131.6 KB)

Do the replays happening serially? Maybe since I have 1000 predicates that slows down the replay? Any advice on how to speed this up? If I run more than 3 dgraph alphas maybe it’ll be easier for the cluster to hot-swap over to a new alpha while the rebooting one finally catches up- although Im not sure if that rebooting one would trigger cluster-wide unhleathiness.

@ibrahim is testing on switching L0 back to on-disk (instead of being kept in memory). If that happens, then a reboot should be fast from Badger side.

Cool stuff. If you guys have a branch that starts to look promising, let me know I’d be happy to give it a run in my system.

We just need to set ‘KeepL0InMemory’ to false on Dgraph. That should I think solve the problem of value log replay.

@Paras @martinmr – maybe we can make this change in a branch and give Sean a binary to test?

@seanlaff Can you try this build with @Paras’s change in PR #5442? It sets KeepL0InMemory to false. The Docker image is called dgraph/dgraph:v20.03.1-l0notinmemory.

1 Like