Not able to run queries and mutations on Alpha


Report a Dgraph Bug

What version of Dgraph are you using?

$ dgraph version
 
1.0.17

Have you tried reproducing the issue with the latest release?

No

What is the hardware spec (RAM, OS)?

16 core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz running Ubuntu 18.04.5 LTS with 33GB ram

Steps to reproduce the issue (command/config used to run Dgraph).

dgraph alpha --my=alpha_3:7082 --lru_mb=12048 --zero=zero_1:5080 -o 2

Expected behaviour and actual result.

Alpha should be able to execute queries and mutations


What you wanted to do

We had the same setup for a very long time, and all of a sudden our servers stopped executing queries and mutations.
We have no clue what the root cause for this problem is just yet…

Our Grafana Dashboard:


**Cluster Setup **
3 servers running a pair of alpha and zero on swarm

Alpha 1 logs

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.


I0120 15:26:36.934976       1 run.go:461] x.Config: {DebugMode:false PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I0120 15:26:36.934996       1 run.go:462] worker.Config: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:alpha_1:7080 ZeroAddr:zero_1:5080 RaftId:0 ExpandEdge:true WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false SnapshotAfter:10000 AbortOlderThan:5m0s}
I0120 15:26:36.935014       1 run.go:463] edgraph.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap WALDir:w MutationsMode:0 AuthToken: AllottedMemory:12048}
I0120 15:26:36.935042       1 server.go:115] Setting Badger table load option: mmap
I0120 15:26:36.935048       1 server.go:127] Setting Badger value log load option: mmap
I0120 15:26:36.935052       1 server.go:155] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x1f8b490 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:65500 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0120 15:26:37.619444       1 node.go:88] All 8 tables opened in 677ms
I0120 15:26:37.620099       1 node.go:88] Replaying file id: 4884 at offset: 749917
I0120 15:26:37.620157       1 node.go:88] Replay took: 48.588µs
I0120 15:26:37.620189       1 node.go:88] Replaying file id: 4885 at offset: 0
I0120 15:26:37.633398       1 node.go:88] Replay took: 13.202292ms
I0120 15:26:37.633426       1 node.go:88] Replaying file id: 4886 at offset: 0
I0120 15:26:37.634874       1 node.go:88] Replay took: 1.442207ms
I0120 15:26:37.635026       1 server.go:115] Setting Badger table load option: mmap
I0120 15:26:37.635045       1 server.go:127] Setting Badger value log load option: mmap
I0120 15:26:37.635051       1 server.go:169] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x1f8b490 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0120 15:26:38.783748       1 node.go:88] All 12 tables opened in 1.146s
I0120 15:26:38.792483       1 node.go:88] Replaying file id: 54 at offset: 413356010
I0120 15:26:39.759737       1 node.go:88] Replay took: 967.229684ms
I0120 15:26:39.759818       1 node.go:88] Replaying file id: 55 at offset: 0
I0120 15:26:40.028641       1 node.go:88] Replay took: 268.805722ms
I0120 15:26:40.028951       1 groups.go:96] Current Raft Id: 0x1
I0120 15:26:40.029057       1 worker.go:81] Worker listening at address: [::]:7080
I0120 15:26:40.029088       1 run.go:392] gRPC server started.  Listening on port 9080
I0120 15:26:40.029103       1 run.go:393] HTTP server started.  Listening on port 8080
2021/01/20 15:26:40 Endpoint has been deprecated. Please use CollectorEndpoint instead.
I0120 15:26:40.129163       1 pool.go:138] CONNECTED to zero_1:5080
W0120 15:26:40.191245       1 pool.go:226] Connection lost with zero_1:5080. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup zero_1 on 127.0.0.11:53: no such host"
I0120 15:27:04.921881       1 pool.go:138] CONNECTED to zero_2:5081
I0120 15:27:05.415040       1 groups.go:119] Connected to group zero. Assigned group: 0
I0120 15:27:05.415071       1 groups.go:121] Raft Id after connection to Zero: 0x1
I0120 15:27:05.415099       1 pool.go:138] CONNECTED to alpha_2:7081
I0120 15:27:05.415110       1 pool.go:138] CONNECTED to alpha_3:7082
I0120 15:27:05.431551       1 pool.go:138] CONNECTED to zero_3:5082
I0120 15:27:05.431824       1 draft.go:74] Node ID: 0x1 with GroupID: 1
I0120 15:27:05.431949       1 node.go:157] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc019823320 Applied:10862439 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1f8b490 DisableProposalForwarding:false}
I0120 15:27:05.979889       1 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:10862439 Term:71 XXX_unrecognized:[]}
I0120 15:27:05.979937       1 node.go:317] Found hardstate: {Term:80 Vote:0 Commit:10870127 XXX_unrecognized:[]}
I0120 15:27:09.303274       1 node.go:326] Group 1 found 7689 entries
I0120 15:27:09.303307       1 draft.go:1321] Restarting node for group: 1
I0120 15:27:09.303359       1 node.go:193] Setting conf state to nodes:1 nodes:2 nodes:3 
I0120 15:27:12.524082       1 node.go:88] 1 became follower at term 80
I0120 15:27:12.524460       1 node.go:88] newRaft 1 [peers: [1,2,3], term: 80, commit: 10870127, applied: 10862439, lastindex: 10870127, lastterm: 80]
I0120 15:27:12.524500       1 groups.go:139] Server is ready
I0120 15:27:12.524571       1 draft.go:720] Found Raft progress: 10870110
I0120 15:27:12.524579       1 groups.go:650] Got address of a Zero leader: zero_2:5081
I0120 15:27:12.524808       1 groups.go:663] Starting a new membership stream receive from zero_2:5081.
I0120 15:27:12.573386       1 index.go:71] Long term for exact index on predicate: [03ee20c4-d814-4c4f-a678-9475e3a21405]. Consider switching to hash for better performance.
I0120 15:27:12.590515       1 index.go:71] Long term for exact index on predicate: [03ee20c4-d814-4c4f-a678-9475e3a21405]. Consider switching to hash for better performance.
I0120 15:27:12.710234       1 groups.go:680] Received first state update from Zero: counter:1397183 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"alpha_1:7080" last_update:1611155812 > > members:<key:2 value:<id:2 group_id:1 addr:"alpha_2:7081" last_update:1611156430 > > members:<key:3 value:<id:3 group_id:1 addr:"alpha_3:7082" last_update:1611156069 > > tablets:<key:"00022c08-0b6e-43b1-a55a-bf7811ba043f" value:<group_id:1 predicate:"00022c08-0b6e-43b1-a55a-bf7811ba043f" space:511 > > tablets:<key:"0003055d-1a3c-4aef-a1d9-442430da42f3" value:<group_id:1 predicate:"0003055d-1a3c-4aef-a1d9-442430da42f3" space:83 > > 

bunch of schema 

predicate:"wspcPersonDefaultUserGroup" space:21887 > > tablets:<key:"~10aa1897-822b-491c-bb71-8415f8d6f851" value:<group_id:1 predicate:"~10aa1897-822b-491c-bb71-8415f8d6f851" > > tablets:<key:"~6ebb1f51-6806-4605-8d1f-97d25dc2ba11" value:<group_id:1 predicate:"~6ebb1f51-6806-4605-8d1f-97d25dc2ba11" > > tablets:<key:"~74e72b2b-9df5-4af5-a295-8d7a3e1e8e87" value:<group_id:1 predicate:"~74e72b2b-9df5-4af5-a295-8d7a3e1e8e87" > > tablets:<key:"~944dac0d-3197-4e7e-9d49-42bfce283043" value:<group_id:1 predicate:"~944dac0d-3197-4e7e-9d49-42bfce283043" > > tablets:<key:"~a6e30999-c971-4c0c-b1ef-dd6ea0994a19" value:<group_id:1 predicate:"~a6e30999-c971-4c0c-b1ef-dd6ea0994a19" > > tablets:<key:"~c3699619-9cd4-4d33-b24a-0b8c7831ee6a" value:<group_id:1 predicate:"~c3699619-9cd4-4d33-b24a-0b8c7831ee6a" > > snapshot_ts:9141515 checksum:2463542097647359390 > > zeros:<key:1 value:<id:1 addr:"zero_1:5080" > > zeros:<key:2 value:<id:2 addr:"zero_2:5081" leader:true > > zeros:<key:3 value:<id:3 addr:"zero_3:5082" > > maxLeaseId:1570000 maxTxnTs:9200000 maxRaftId:3 cid:"0f521e8e-e045-46ef-8b13-ce9a9610dd37" 
I0120 15:27:14.527721       1 node.go:88] 1 [logterm: 80, index: 10870127, vote: 0] cast MsgPreVote for 2 [logterm: 80, index: 10870127] at term 80
I0120 15:27:14.624895       1 node.go:88] 1 is starting a new election at term 80
I0120 15:27:14.624923       1 node.go:88] 1 became pre-candidate at term 80
I0120 15:27:14.624932       1 node.go:88] 1 received MsgPreVoteResp from 1 at term 80
I0120 15:27:14.625547       1 node.go:88] 1 [logterm: 80, index: 10870127] sent MsgPreVote request to 2 at term 80
I0120 15:27:14.626824       1 node.go:88] 1 [logterm: 80, index: 10870127] sent MsgPreVote request to 3 at term 80
I0120 15:27:15.063423       1 node.go:88] 1 [logterm: 80, index: 10870127, vote: 0] cast MsgPreVote for 3 [logterm: 80, index: 10870127] at term 80
I0120 15:27:15.065336       1 node.go:88] 1 [term: 80] received a MsgVote message with higher term from 2 [term: 81]
I0120 15:27:15.065765       1 node.go:88] 1 became follower at term 81
I0120 15:27:15.066899       1 node.go:88] 1 [logterm: 80, index: 10870127, vote: 0] cast MsgVote for 2 [logterm: 80, index: 10870127] at term 81
I0120 15:27:15.067859       1 node.go:88] 1 [logterm: 80, index: 10870127, vote: 2] rejected MsgVote from 3 [logterm: 80, index: 10870127] at term 81
I0120 15:27:15.529616       1 node.go:88] raft.node: 1 elected leader 2 at term 81

Zero 1 logs

I0120 15:26:58.815437       1 init.go:88] 

Dgraph version   : v1.0.17
Commit SHA-1     : 42deb299
Commit timestamp : 2019-08-30 12:46:31 -0700
Branch           : HEAD
Go version       : go1.12.7

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2018 Dgraph Labs, Inc.


I0120 15:26:58.816191       1 run.go:98] Setting up grpc listener at: 0.0.0.0:5080
I0120 15:26:58.816344       1 run.go:98] Setting up http listener at: 0.0.0.0:6080
badger 2021/01/20 15:26:59 INFO: All 1 tables opened in 198ms
badger 2021/01/20 15:26:59 INFO: Replaying file id: 17 at offset: 58958052
badger 2021/01/20 15:26:59 INFO: Replay took: 11.258µs
2021/01/20 15:26:59 Endpoint has been deprecated. Please use CollectorEndpoint instead.
I0120 15:26:59.046902       1 node.go:157] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc0000c1bc0 Applied:1396059 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x1f8b490 DisableProposalForwarding:false}
I0120 15:26:59.048375       1 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:1396059 Term:41 XXX_unrecognized:[]}
I0120 15:26:59.048414       1 node.go:317] Found hardstate: {Term:47 Vote:3 Commit:1397181 XXX_unrecognized:[]}
I0120 15:26:59.409223       1 node.go:326] Group 0 found 1123 entries
I0120 15:26:59.409253       1 raft.go:433] Restarting node for dgraphzero
I0120 15:26:59.410506       1 node.go:193] Setting conf state to nodes:1 nodes:2 nodes:3 
I0120 15:26:59.430666       1 pool.go:138] CONNECTED to alpha_1:7080
I0120 15:26:59.430697       1 pool.go:138] CONNECTED to alpha_2:7081
I0120 15:26:59.430719       1 pool.go:138] CONNECTED to alpha_3:7082
I0120 15:26:59.430730       1 pool.go:138] CONNECTED to zero_2:5081
I0120 15:26:59.430752       1 pool.go:138] CONNECTED to zero_3:5082
I0120 15:26:59.776292       1 node.go:88] 1 became follower at term 47
I0120 15:26:59.776432       1 node.go:88] newRaft 1 [peers: [1,2,3], term: 47, commit: 1397181, applied: 1396059, lastindex: 1397181, lastterm: 47]
I0120 15:26:59.778021       1 run.go:282] Running Dgraph Zero...
I0120 15:26:59.791191       1 oracle.go:106] Purged below ts:9141515, len(o.commits):1, len(o.rowCommit):0
I0120 15:27:00.157307       1 zero.go:396] Got connection request: cluster_info_only:true 
I0120 15:27:00.157632       1 node.go:88] 1 no leader at term 47; dropping index reading msg
W0120 15:27:02.157746       1 node.go:668] [0x1] Read index context timed out
I0120 15:27:02.158265       1 node.go:88] 1 no leader at term 47; dropping index reading msg
I0120 15:27:03.459602       1 node.go:88] 1 [term: 47] received a MsgHeartbeat message with higher term from 2 [term: 48]
I0120 15:27:03.459812       1 node.go:88] 1 became follower at term 48
I0120 15:27:03.459830       1 node.go:88] raft.node: 1 elected leader 2 at term 48
W0120 15:27:03.548916       1 raft.go:707] Raft.Ready took too long to process: Timer Total: 3.766s. Breakdown: [{proposals 3.758s} {sync 8ms} {disk 0s} {advance 0s}]. Num entries: 0. MustSync: true
W0120 15:27:04.158397       1 node.go:668] [0x1] Read index context timed out
I0120 15:27:04.689838       1 zero.go:414] Connected: cluster_info_only:true 
I0120 15:27:04.690008       1 zero.go:396] Got connection request: cluster_info_only:true 
I0120 15:27:04.778808       1 zero.go:414] Connected: cluster_info_only:true 
I0120 15:27:04.778948       1 zero.go:396] Got connection request: cluster_info_only:true 
I0120 15:27:04.870006       1 zero.go:414] Connected: cluster_info_only:true 

This would be important to know what is happening. When you get blocked what error or what else happens?

Do you have access to ${swarm_IP}:8080/state?

Do you mean ip:6080/state ?

and is there a way to uncommit the transaction that is getting stuck?

can be, this is from zero. The 8080 is good to know, cuz it means that your alpha is accessible.

If you have used the transaction procedure, yes. But if you have committed or used "commit now"no, you can’t.

What do you mean by getting stuck?

when i hit the ${swarm_IP}:8080/state I get
Dgraph browser is available for running separately using the dgraph-ratel binaryroot
and we are using the dgraph-js for our codebase

I see why there’s no state at 8080. There is any reason to be using an old version?

yes we are using the old version for a reason…we did not update our code to be compatible with the breaking changes of v1.1.0…yet

Okay, but without a step by step to reproduce there’s no way to help.

Mate i really wish i could reproduce it

btw we have some repeating logs in our leader alpha

I0120 19:24:11.950429       1 draft.go:1061] TryAbort: No aborts found. Quitting.
I0120 19:24:11.950434       1 draft.go:1087] Done abortOldTransactions for 1 txns. Error: <nil>
I0120 19:25:11.947316       1 draft.go:1084] Found 1 old transactions. Acting to abort them.
I0120 19:25:11.950368       1 draft.go:1045] TryAbort 1 txns with start ts. Error: <nil>
I0120 19:25:11.950381       1 draft.go:1061] TryAbort: No aborts found. Quitting.
I0120 19:25:11.950386       1 draft.go:1087] Done abortOldTransactions for 1 txns. Error: <nil>
I0120 19:26:11.947323       1 draft.go:1084] Found 1 old transactions. Acting to abort them.
I0120 19:26:11.950186       1 draft.go:1045] TryAbort 1 txns with start ts. Error: <nil>
I0120 19:26:11.950198       1 draft.go:1061] TryAbort: No aborts found. Quitting.
I0120 19:26:11.950204       1 draft.go:1087] Done abortOldTransactions for 1 txns. Error: <nil>

can these help debug the problem ??

When you get blocked what error or what else happens?

What do you mean by getting stuck? Where you get stuck? When? Dgraph-js throws something?

Not really. This can be anything. It can be that you tried to mutate the same predicate with @upsert directive. It can be that your mutation had some conflict. And so on. What you can do is restart the servers and wait for the abort to be completed. And this log will go away.

You can set the flag --abort_older_than to less than 5m.

But you need to find out where is the problem, and it is related to the usage itself.

2 Likes

We checked other posts on the forum and updated our compose files with --abort_older_than and max_retries…Dgraph is now healthy and responsive…we are not sure on why and what caused this problem yet…should we make changes to our codebase or is this related to Dgraph ?

and thanks for the fast replies @MichelDiz

Cheers Mate

Not sure, too many scenarios to be precise. You should definitely evaluate your codebase, then analyze the rate of mutations in comparison to your resources. That should make aborts less common for usual scenarios that it happen.

Ok …
One last thing @MichelDiz …can i ask why Dgraph is deprecating the documentation of older versions ?

Dunno, but in general we don’t support old versions.

This is my opinion, not official from Dgraph.

Only starting from the last change in the versioning we might support old versions cuz there are long term support there(not sure, there is a blog post about it).