Dgraph crashes after predicate movement (another oom crash?)

Hi team, I am running dgraph cluster in the following format -
Machine a - 1 zero, 3 alphas
Machine b - 3 alphas

with a replication factor of 3, thus forming two groups. Can also be seen in the below image.

Both machines are aws ec2 c5.2xlarge. I am inserting data with following schema -

entity_type: string .
entity_value: string @index(exact) @count .
phone: [uid] @count @reverse .
device: [uid] @count @reverse .

So I am trying dgraph for a use case where I can create links between different users(entity_values) on the basis of similar phone/device. These can be random number of links ranging from 10-500. I am inserting data using Kafka. Everything works fine when I start this, and all the predicates were present in the first group of alphas initially. After sometime, dgraph initiated rebalancing and and moved entity_type predicate to group 2. Everything was fine at this time as well, but when dgraph moved phone predicate to group 2, after a minute or so the node crashed and now both reads and writes are stopped. And only about half a million nodes were inserted from my side.

Here are some logs of zero -

Groups sorted by size: [{gid:1 size:0} {gid:2 size:0}]

I0818 07:00:45.461448      15 tablet.go:213] size_diff 0
I0818 07:08:45.461439      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:0} {gid:1 size:0}]

I0818 07:08:45.461462      15 tablet.go:213] size_diff 0
I0818 07:16:45.461506      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:0} {gid:2 size:0}]

I0818 07:16:45.461532      15 tablet.go:213] size_diff 0
I0818 07:24:45.461429      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:0} {gid:1 size:49693467594}]

I0818 07:24:45.461454      15 tablet.go:213] size_diff 49693467594
I0818 07:24:45.461639      15 tablet.go:108] Going to move predicate: [entity_type], size: [22 GB] from group 1 to 2
I0818 07:24:45.461740      15 tablet.go:135] Starting move: predicate:"entity_type" source_gid:1 dest_gid:2 txn_ts:911263
I0818 07:24:47.536635      15 tablet.go:150] Move at Alpha done. Now proposing: tablet:<group_id:2 predicate:"entity_type" force:true space:22067073752 move_ts:911263 >
I0818 07:24:47.536874      15 tablet.go:156] Predicate move done for: [entity_type] from group 1 to 2
I0818 07:32:45.461442      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 07:32:45.461498      15 tablet.go:213] size_diff 20813563481
I0818 07:40:45.461432      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 07:40:45.461466      15 tablet.go:213] size_diff 20813563481
I0818 07:48:45.461428      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 07:48:45.461459      15 tablet.go:213] size_diff 20813563481
I0818 07:56:45.461443      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 07:56:45.461478      15 tablet.go:213] size_diff 20813563481
I0818 08:04:45.461445      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:04:45.461480      15 tablet.go:213] size_diff 20813563481
I0818 08:12:45.461462      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:12:45.462072      15 tablet.go:213] size_diff 20813563481
I0818 08:20:45.461425      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:20:45.461461      15 tablet.go:213] size_diff 20813563481
I0818 08:28:45.461469      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:28:45.461511      15 tablet.go:213] size_diff 20813563481
I0818 08:36:45.461452      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:36:45.461474      15 tablet.go:213] size_diff 20813563481
I0818 08:44:45.461428      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:44:45.461461      15 tablet.go:213] size_diff 20813563481
I0818 08:52:45.461413      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:42880637233}]

I0818 08:52:45.461435      15 tablet.go:213] size_diff 20813563481
I0818 09:00:45.461431      15 tablet.go:208]

Groups sorted by size: [{gid:2 size:22067073752} {gid:1 size:59877864994}]

I0818 09:00:45.461466      15 tablet.go:213] size_diff 37810791242
I0818 09:00:45.461658      15 tablet.go:108] Going to move predicate: [phone], size: [12 GB] from group 1 to 2
I0818 09:00:45.461742      15 tablet.go:135] Starting move: predicate:"phone" source_gid:1 dest_gid:2 txn_ts:1380036
I0818 09:00:53.291965      15 tablet.go:150] Move at Alpha done. Now proposing: tablet:<group_id:2 predicate:"phone" force:true space:12327615775 move_ts:1380036 >
I0818 09:00:53.292342      15 tablet.go:156] Predicate move done for: [phone] from group 1 to 2
I0818 09:07:28.926155      15 zero.go:417] Got connection request: cluster_info_only:true
I0818 09:07:28.926358      15 zero.go:435] Connected: cluster_info_only:true
I0818 09:08:45.461446      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]

I0818 09:08:45.461467      15 tablet.go:213] size_diff 23813934864
I0818 09:16:45.461453      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]

I0818 09:16:45.461498      15 tablet.go:213] size_diff 23813934864
I0818 09:24:45.461445      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]

I0818 09:24:45.461491      15 tablet.go:213] size_diff 23813934864
I0818 09:32:45.461436      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]

I0818 09:32:45.461487      15 tablet.go:213] size_diff 23813934864
I0818 09:40:45.461459      15 tablet.go:208]

Groups sorted by size: [{gid:1 size:10580754663} {gid:2 size:34394689527}]

I0818 09:40:45.461507      15 tablet.go:213] size_diff 23813934864

And here are the alpha logs from one of the crashed alphas -

I0818 08:50:58.293612      14 log.go:34] Compaction for level: 0 DONE
W0818 08:52:28.076360      14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 227ms. Breakdown: [{disk 195ms} {proposals 31ms} {advance 0s}] Num entries: 0. MustSync: false
I0818 08:53:35.459517      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:53:35.459551      14 log.go:34] Running for level: 0
I0818 08:53:37.107415      14 log.go:34] LOG Compact 0->1, del 7 tables, add 2 tables, took 1.647849234s
I0818 08:53:37.107440      14 log.go:34] Compaction for level: 0 DONE
I0818 08:56:13.761489      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:56:13.761529      14 log.go:34] Running for level: 0
I0818 08:56:14.800691      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.039142093s
I0818 08:56:14.800714      14 log.go:34] Compaction for level: 0 DONE
I0818 08:58:47.459494      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 08:58:47.459524      14 log.go:34] Running for level: 0
I0818 08:58:48.543593      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.084055233s
I0818 08:58:48.543620      14 log.go:34] Compaction for level: 0 DONE
I0818 09:00:45.681933      14 index.go:1183] Dropping predicate: [phone]
I0818 09:00:45.681956      14 log.go:34] DropPrefix Called
I0818 09:00:45.681964      14 log.go:34] Writes flushed. Stopping compactions now...
I0818 09:00:45.909719      14 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 5 112 104 111 110 101]}
I0818 09:00:45.909766      14 log.go:34] Running for level: 0
I0818 09:00:46.981479      14 log.go:34] LOG Compact 0->1, del 3 tables, add 1 tables, took 1.071693499s
I0818 09:00:46.981511      14 log.go:34] Compaction for level: 0 DONE
I0818 09:00:46.981536      14 log.go:34] DropPrefix done
I0818 09:00:46.981545      14 log.go:34] Resuming writes
I0818 09:00:46.981556      14 schema.go:103] Deleting schema for predicate: [phone]
W0818 09:00:48.403542      14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 479ms. Breakdown: [{proposals 478ms} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
I0818 09:00:48.414585      14 predicate_move.go:46] Writing 381906 keys
I0818 09:00:50.246610      14 schema.go:452] Setting schema for attr phone: [uid], tokenizer: [], directive: REVERSE, count: true

I0818 09:00:52.886656      14 log.go:34] 6 is starting a new election at term 2
I0818 09:00:52.886671      14 log.go:34] 6 became pre-candidate at term 2
I0818 09:00:52.886674      14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:00:52.886936      14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 4 at term 2
I0818 09:00:52.887441      14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 5 at term 2
I0818 09:00:52.887652      14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:00:53.478101      14 log.go:34] 6 became follower at term 2
I0818 09:00:53.501749      14 log.go:34] raft.node: 6 elected leader 5 at term 2
I0818 09:00:57.097364      14 log.go:34] 6 [logterm: 2, index: 1848399, vote: 0] cast MsgPreVote for 4 [logterm: 2, index: 1848399] at term 2
I0818 09:00:58.084532      14 predicate_move.go:46] Writing 78460 keys
I0818 09:00:58.719426      14 schema.go:452] Setting schema for attr phone: [uid], tokenizer: [], directive: REVERSE, count: true

I0818 09:00:58.828682      14 log.go:34] 6 is starting a new election at term 2
I0818 09:00:58.828697      14 log.go:34] 6 became pre-candidate at term 2
I0818 09:00:58.828701      14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:00:58.829111      14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 5 at term 2
I0818 09:00:58.829546      14 log.go:34] 6 [logterm: 2, index: 1848399] sent MsgPreVote request to 4 at term 2
I0818 09:00:58.829757      14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:00:58.829773      14 log.go:34] 6 became follower at term 2
I0818 09:00:58.829777      14 log.go:34] raft.node: 6 elected leader 5 at term 2
W0818 09:00:58.829902      14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 791ms. Breakdown: [{advance 747ms} {proposals 43ms} {disk 0s}] Num entries: 0. MustSync: false
I0818 09:01:36.459544      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:01:36.459594      14 log.go:34] Running for level: 0
I0818 09:01:37.577325      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.117714397s
I0818 09:01:37.577352      14 log.go:34] Compaction for level: 0 DONE
I0818 09:04:14.459463      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:04:14.459496      14 log.go:34] Running for level: 0
I0818 09:04:15.666420      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.206907277s
I0818 09:04:15.666445      14 log.go:34] Compaction for level: 0 DONE
I0818 09:06:58.459514      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:06:58.459544      14 log.go:34] Running for level: 0
I0818 09:06:59.736871      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.277311873s
I0818 09:06:59.736906      14 log.go:34] Compaction for level: 0 DONE
I0818 09:07:28.239184      14 log.go:34] 6 is starting a new election at term 2
I0818 09:07:28.239367      14 log.go:34] 6 became pre-candidate at term 2
I0818 09:07:28.239454      14 log.go:34] 6 received MsgPreVoteResp from 6 at term 2
I0818 09:07:28.241978      14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgPreVote request to 5 at term 2
I0818 09:07:28.243636      14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgPreVote request to 4 at term 2
I0818 09:07:28.246158      14 log.go:34] raft.node: 6 lost leader 5 at term 2
I0818 09:07:29.041093      14 log.go:34] 6 received MsgPreVoteResp from 5 at term 2
I0818 09:07:29.041114      14 log.go:34] 6 [quorum:2] has received 2 MsgPreVoteResp votes and 0 vote rejections
I0818 09:07:29.041125      14 log.go:34] 6 became candidate at term 3
I0818 09:07:29.041129      14 log.go:34] 6 received MsgVoteResp from 6 at term 3
I0818 09:07:29.041896      14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgVote request to 5 at term 3
I0818 09:07:29.042170      14 log.go:34] 6 [logterm: 2, index: 1909465] sent MsgVote request to 4 at term 3
I0818 09:07:29.043823      14 log.go:34] 6 received MsgVoteResp rejection from 5 at term 3
I0818 09:07:29.043842      14 log.go:34] 6 [quorum:2] has received 1 MsgVoteResp votes and 1 vote rejections
I0818 09:07:30.248547      14 log.go:34] 6 [term: 3] ignored a MsgPreVoteResp message with lower term from 4 [term: 2]
I0818 09:07:30.248593      14 log.go:34] 6 received MsgVoteResp rejection from 4 at term 3
I0818 09:07:30.248599      14 log.go:34] 6 [quorum:2] has received 1 MsgVoteResp votes and 2 vote rejections
I0818 09:07:30.248609      14 log.go:34] 6 became follower at term 3
I0818 09:07:32.758567      14 log.go:34] 6 [logterm: 2, index: 1909465, vote: 6] cast MsgPreVote for 5 [logterm: 2, index: 1909466] at term 3
I0818 09:07:32.759571      14 log.go:34] 6 [term: 3] received a MsgVote message with higher term from 5 [term: 4]
I0818 09:07:32.759733      14 log.go:34] 6 became follower at term 4
I0818 09:07:32.760040      14 log.go:34] 6 [logterm: 2, index: 1909465, vote: 0] cast MsgVote for 5 [logterm: 2, index: 1909466] at term 4
I0818 09:07:32.760681      14 log.go:34] raft.node: 6 elected leader 5 at term 4
I0818 09:09:38.459441      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:09:38.459475      14 log.go:34] Running for level: 0
I0818 09:09:40.010848      14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 1.551353146s
I0818 09:09:40.010876      14 log.go:34] Compaction for level: 0 DONE
I0818 09:12:17.459512      14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0818 09:12:17.459545      14 log.go:34] Running for level: 0
I0818 09:12:18.799571      14 log.go:34] LOG Compact 0->1, del 6 tables, add 2 tables, took 1.340010232s
I0818 09:12:18.799607      14 log.go:34] Compaction for level: 0 DONE
W0818 09:14:21.846563      14 draft.go:1183] Raft.Ready took too long to process: Timer Total: 222ms. Breakdown: [{advance 148ms} {disk 71ms} {proposals 3ms}] Num entries: 0. MustSync: false

The logs seemed to be normal only. What can be the case here? How to revive the crashed node back without losing the data? And how to proceed after such crashes?

Hey @praneelrathore, I don’t see any crashes in the logs. Why do you think there was a crash? If the pod/container restarted because of memory, you would’ve seen a SIGKILL in the logs and status as OOM in your pod/container.

From the alpha logs that you shared, it seems like there is a lot of raft activity

I0818 09:07:28.239184      14 log.go:34] 6 is starting a new election at term 2
..
I0818 09:07:28.246158      14 log.go:34] raft.node: 6 lost leader 5 at term 2
...
I0818 09:07:29.041125      14 log.go:34] 6 became candidate at term 3
...
I0818 09:07:32.759733      14 log.go:34] 6 became follower at term 4
...

This isn’t a problem if the cluster stabilizes but if the election activities keep happening, there might be a problem.

Can you try to monitor the memory usage of the alphas? If you see any alpha using additional memory, please take a memory profile https://dgraph.io/docs/howto/retrieving-debug-information/#memory-profile

@ibrahim I took the memory profile for the instance that became unresponsive. Attaching them here - pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb (92.7 KB) pprof.dgraph.samples.cpu.001.pb (35.3 KB)

A simple top command on the memory profile shows this -

Showing nodes accounting for 1231.84MB, 93.48% of 1317.75MB total
Dropped 105 nodes (cum <= 6.59MB)
Showing top 10 nodes out of 78
      flat  flat%   sum%        cum   cum%
  256.79MB 19.49% 19.49%   256.79MB 19.49%  github.com/dgraph-io/ristretto.newCmRow
  200.50MB 15.22% 34.70%   351.75MB 26.69%  github.com/dgraph-io/dgraph/worker.(*node).processApplyCh.func1
  166.41MB 12.63% 47.33%   166.41MB 12.63%  github.com/dgraph-io/badger/v2/skl.newArena
  148.26MB 11.25% 58.58%   148.26MB 11.25%  github.com/dgraph-io/badger/v2/table.glob..func1
  137.23MB 10.41% 69.00%   137.23MB 10.41%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
   86.82MB  6.59% 75.58%    96.40MB  7.32%  github.com/dgraph-io/badger/v2/table.OpenTable
      69MB  5.24% 80.82%       69MB  5.24%  github.com/dgraph-io/badger/v2/table.NewTableBuilder
      66MB  5.01% 85.83%       72MB  5.46%  github.com/dgraph-io/dgraph/protos/pb.(*Proposal).Unmarshal
      62MB  4.70% 90.53%       62MB  4.70%  go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
   38.83MB  2.95% 93.48%    49.67MB  3.77%  github.com/dgraph-io/badger/v2/y.(*WaterMark).process.func1

I also have some prometheus metrics, which occurred just before the instance became unresponsive.
https://snapshot.raintank.io/dashboard/snapshot/qs3M3YbaVDOetqo42WZj0QBry25YCd0W?orgId=2

This is the badger metrics that occurred right before crashing. The predicate move happened just at 14:30 time. I was tracking other metrics as well, but couldn’t find any unusual behaviour there. Attaching the logs from the alphas which crashed alpha1.txt (714.3 KB) alpha2.txt (714.3 KB) alpha3.txt (666.6 KB)

Let me know if you need something else as well. This is a kind of reproducible behaviour and I am unable to figure out where I am going wrong. I revisited my schema as well, ran in ludicrous mode, but this keeps on happening. Below is the config file and command for alpha I am using.

lru_mb: 2048
zero: 10.50.34.90:5080
my: 10.50.34.68:7080
postings: /home/ubuntu/server1/p
wal: /home/ubuntu/server1/w
expose_trace: true
whitelist: 172.17.0.1
telemetry: true
max_retries: 3
mutations: allow
normalize_node_limit: 1000
query_edge_limit: 1000
idx: 4
snapshot_after: 50000
log_dir: /home/ubuntu/logs
dgraph alpha --ludicrous_mode --log_dir /home/ubuntu/logs/ --config /config.yaml

Is it possible that dgraph while shifting the tablet, moved too much data in the second instance, and caused a memory spike on it?

@praneelrathore The memory profile looks fine. I would like to see memory profile of the alpha that’s about to OOM (I’m guessing the OOM happened and now the nodes are stable).

From the logs that you shared, it seems like there was movement which caused writes to be blocked and but then the writes started again. I also see that raft.Ready took too long message but I’m guessing alpha was able to process the pending requests and then it started.

From the logs, I see that all the 3 alphas restarted at the same time. Did you restart them?
alpha1 restarted at 2020/08/19 09:57:54.
alpha2 restarted at 2020/08/19 09:57:55
alpha3 restarted at 2020/08/19 09:57:54

Also, there are too many connection errors. Did you change the IP:Port or something after starting the cluster?

E0819 09:56:49.479505      15 groups.go:1095] Error from alpha client subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.50.34.90:7082: connect: network is unreachable"
E0819 09:56:49.579623      15 groups.go:1095] Error from alpha client subscribe: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.50.34.90:7082: connect: network is unreachable"

Are the writes/reads still blocked? If so, is it on all the alphas? You can run dgraph increment --alpha "alphaAddress" which will perform a simple increment operation using upsert (we use to check if the cluster is responding). Please run dgraph increment on all the 3 alphas. If it works, your queries should be working.

@ibrahim

I would like to see memory profile of the alpha that’s about to OOM (I’m guessing the OOM happened and now the nodes are stable).

I think the problem is not related to oom. Earlier the instance crashed and I thought it was this. Now seeing the logs, I don’t think this is the issue.

Did you restart them?

Yes I did reboot the ec2 instance as it was unresponsive.

Also, there are too many connection errors. Did you change the IP:Port or something after starting the cluster?

No didn’t change the ip:port. Not sure why it is reporting these logs.

If it works, your queries should be working.

Yes my queries started working.

The logs were actually set to a rotation policy, so these might not tell us the clear picture.
So I inserted about half a million users, when this happened. If I start inserting other half now, I am quite sure this will happen again. What should we do if such case happens in production? I still think that dgraph moved a heavy predicate to the other instance, and that caused the unresponsiveness. What is the optimal way in which we can distribute predicates? The data sizes are attached in the image in the question.

Interesting. Can you confirm if the IP points to alpha3 (or alpha running with -o 102)? The port 7082 is used for internal grpc communication.

@praneelrathore if you’re sure that there was a crash, can you attempt the same operation in your staging/local environment? Trying to reproduce a crash on prod might not be the best approach :slight_smile:
If a similar issue happens, you should definitely save the logs. If you have monitoring setup on dgraph, you should be actively monitoring the memory and CPU when the tablet rebalance happens. If you see any unusual activity on memory/CPU, you should take a cpu/memory profile so that we can find the root cause.

When tablet move happens, the predicate being moved is read-only. Were you trying to write to the same predicate?

You don’t need to anything to distribute the predicates. Dgraph zero will monitor the predicate sizes and it will move them automatically. See https://dgraph.io/docs/master/deploy/cluster-setup/#shard-rebalancing
Also, I highly recommend reading Dgraph Whitepapers: In-Depth Insights and Analysis if you’re interested in understanding dgraph internals.

@ibrahim

Interesting. Can you confirm if the IP points to alpha3 (or alpha running with -o 102 )? The port 7082 is used for internal grpc communication.

Yeah, it is correct. Maybe the logs were from the time where the instance was unresponsive.

f you’re sure that there was a crash, can you attempt the same operation in your staging/local environment?

Yeah I am actually evaluating dgraph in a test environment only as of now, to make a decision whether to move to prod with it, to avoid such surprises there :stuck_out_tongue:

If you have monitoring setup on dgraph, you should be actively monitoring the memory and CPU when the tablet rebalance happens. If you see any unusual activity on memory/CPU, you should take a cpu/memory profile so that we can find the root cause.

The CPU was normal during the test, but there was a memory spike during the predicate move. The instance has an EBS volume attached to it, with 100 provisioned IOPS. When the predicate move happened, there was a sharp spike on the crashed instance’s readBytes and writeBytes parameter of the EBS. Heavy read/writes during tablet movement can be seen in the badger’s prometheus plot that I attached earlier. But yeah I get it without profiles, we are clueless as to what exactly happened at that time. I will definitely try to repeat this exercise and provide a profile snapshot if I am able to reproduce this.

When tablet move happens, the predicate being moved is read-only . Were you trying to write to the same predicate?

Yes, so I was streaming events from Kafka to dgraph. So yeah, it would have tried to write data on that predicate. Should this be handled in application logic?

You don’t need to anything to distribute the predicates. Dgraph zero will monitor the predicate sizes and it will move them automatically. See https://dgraph.io/docs/master/deploy/cluster-setup/#shard-rebalancing
Also, I highly recommend reading Dgraph Whitepapers: In-Depth Insights and Analysis if you’re interested in understanding dgraph internals.

I read the docs, will give the paper a read as well. Does dgraph checks available memory on the other instance before deciding to move a predicate? Actually I am curious as the size distribution according to the ratel snapshot doesn’t look even at all. Hence I was considering a mix of manual/automatic predicate movement logic.

@ibrahim Sorry I have some more queries that I wanted to ask -

  • During the un-responsiveness of one of the machines, every query to dgraph failed, although the other instance was up and running. Is this the correct behaviour? If yes, then what is the use of replication factor?

  • If my queries involve predicates from two different machines and one would go down, will my query fail?

  • I feel that dgraph would run better with vertical scaling rather than horizontal, as there will be less predicate movement between different machines. Dgraph seems to be quite resource hungry as it is mentioned in the docs as well. What are your thoughts on this?

Hey @praneelrathore, apologies for the delay in reply

Interesting. If the log keeps showing up, please do report it.

Yes. Dgraph will return a Commits are blocked on predicate xxx because of tablet move error if you try to write to a predicate that is being moved.

I might be wrong with this one but we don’t check for memory on the other alpha. The predicate will be moved if we detect that one group has more data than the other group.

Every query to the unresponsive instance failed or every query to any instance failed? Healthy instances would respond to the queries.

If the query cannot resolve the data it needs because the group which serves the predicate doesn’t have any healthy node then your query will fail. So you cannot get half result back.

A heavy predicate movement might have caused the issue (memory spike) you saw. Also, predicates are not moved very often. As for the resources, this totally depends on what operation you’re performing.