Eternal pending queries

In my production instance of dgraph we’ve got another accident.

At some point of time all the requests from our client applications started to get timeouts.

From the log:
[ERROR] [2021-06-03T11:16:08.47] Can't send request to DGraph, error: rpc error: code = Canceled desc = context canceled

I looked on the metrics of Dgraph and found strange picture about pending queries:

The first peak is the time when we started to get problems.

We switched off every our that uses dgraph now.
So it’s now on his own. No requests.Nothing.

And still, for more then 8 hours it’s still the same pending queries.

More pictures:

Log alpha:

I0603 16:56:19.794686 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:01:19.794321 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:06:19.794374 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:11:19.794328 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:16:19.794590 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:21:19.794423 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:26:19.794840 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:31:19.795199 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:36:19.794521 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:41:19.794433 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0603 17:46:19.794487 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 

log zero:

I0601 20:04:19.775015 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:05:19.776063 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:06:19.775174 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:07:19.775911 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:08:19.774876 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:09:19.776385 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:10:19.774768 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:11:19.775761 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:12:19.776322 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:13:19.775757 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:14:19.775790 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I0601 20:15:19.784592 21 raft.go:780] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0

I get failed any query from ratel as well
image

3 Likes

What version of Dgraph are you running in production?

Can you share the debug events page for Alpha? This one, specifically: http://localhost:8080/debug/events?fam=Dgraph&b=0&exp=1. It shows the ApplyCh traces. According to your metrics it’s been stuck for a while, but it might still have useful logs.

Without the full logs and metrics, I’m not really sure what issue you’re seeing there that could be causing the blockage.

Have you tried restarting the Alpha?

Image: docker.io/dgraph/dgraph:v21.03.0

When	Elapsed
2021/05/31 20:50:34.146336	259916.932335	ApplyCh
#   github.com/dgraph-io/dgraph/worker.newNode			/ext-go/1/src/github.com/dgraph-io/dgraph/worker/draft.go:252
#   github.com/dgraph-io/dgraph/worker.StartRaftNodes		/ext-go/1/src/github.com/dgraph-io/dgraph/worker/groups.go:150
#   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run.func4	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:816
20:09:34.147120	 .   921	... (85256084 events discarded)
20:09:47.591511	13.444391	... Applying Oracle Delta for key: 4759450885
20:09:47.591598	 .    88	... Applied proposal with key: 4759450885, index: 55511264. Err: <nil>
20:10:05.920594	18.328996	... Applying Oracle Delta for key: 4989162050
20:10:05.920601	 .     7	... Applied proposal with key: 4989162050, index: 55511265. Err: <nil>
20:10:34.146644	28.226043	... Size of applyCh: 0
20:10:34.171783	 . 25140	... Size of previous map: 14
20:10:35.863971	1.692187	... Applying Oracle Delta for key: 5697329641
20:10:35.863979	 .     8	... Applied proposal with key: 5697329641, index: 55511266. Err: <nil>
20:11:05.896079	30.032100	... Applying Oracle Delta for key: 5319296562
20:11:05.896090	 .    11	... Applied proposal with key: 5319296562, index: 55511267. Err: <nil>
20:11:34.147488	28.251397	... Size of applyCh: 0
20:11:35.870711	1.723223	... Applying Oracle Delta for key: 4970003582
20:11:35.870752	 .    41	... Applied proposal with key: 4970003582, index: 55511268. Err: <nil>
20:12:05.901962	30.031210	... Applying Oracle Delta for key: 4397488615
20:12:05.901974	 .    12	... Applied proposal with key: 4397488615, index: 55511269. Err: <nil>
20:12:34.147098	28.245124	... Size of applyCh: 0
20:12:35.872789	1.725692	... Applying Oracle Delta for key: 6022497786
20:12:35.872808	 .    18	... Applied proposal with key: 6022497786, index: 55511270. Err: <nil>
20:13:05.871926	29.999118	... Applying Oracle Delta for key: 8475610403
20:13:05.871932	 .     6	... Applied proposal with key: 8475610403, index: 55511271. Err: <nil>
20:13:34.147261	28.275329	... Size of applyCh: 0
20:13:35.857717	1.710456	... Applying Oracle Delta for key: 5079480683
20:13:35.857747	 .    30	... Applied proposal with key: 5079480683, index: 55511272. Err: <nil>
20:14:05.866709	30.008962	... Applying Oracle Delta for key: 6484920751
20:14:05.866719	 .     9	... Applied proposal with key: 6484920751, index: 55511273. Err: <nil>
20:14:34.146780	28.280061	... Size of applyCh: 0
20:14:36.175125	2.028346	... Applying Oracle Delta for key: 5929185693
20:14:36.175132	 .     7	... Applied proposal with key: 5929185693, index: 55511274. Err: <nil>
20:15:06.175798	30.000665	... Applying Oracle Delta for key: 5687919422
20:15:06.175804	 .     7	... Applied proposal with key: 5687919422, index: 55511275. Err: <nil>
20:15:34.146614	27.970810	... Size of applyCh: 0
20:15:34.174259	 . 27645	... Size of previous map: 16
20:16:06.300849	32.126590	... Applying Oracle Delta for key: 5270640159
20:16:06.300887	 .    38	... Applied proposal with key: 5270640159, index: 55511276. Err: <nil>
20:16:34.146925	27.846038	... Size of applyCh: 0
20:17:06.145725	31.998800	... Applying Oracle Delta for key: 5013812873
20:17:06.145732	 .     7	... Applied proposal with key: 5013812873, index: 55511277. Err: <nil>
20:17:34.147433	28.001702	... Size of applyCh: 0
20:17:35.927117	1.779684	... Applying Oracle Delta for key: 5375621690
20:17:35.927193	 .    76	... Applied proposal with key: 5375621690, index: 55511278. Err: <nil>
20:18:05.902655	29.975462	... Applying Oracle Delta for key: 7397174323
20:18:05.902662	 .     8	... Applied proposal with key: 7397174323, index: 55511279. Err: <nil>
20:18:34.147340	28.244677	... Size of applyCh: 0
20:18:35.613324	1.465984	... Applying Oracle Delta for key: 7487914352
20:18:35.613343	 .    20	... Applied proposal with key: 7487914352, index: 55511280. Err: <nil>
20:19:05.605665	29.992322	... Applying Oracle Delta for key: 4966620082
20:19:05.605673	 .     8	... Applied proposal with key: 4966620082, index: 55511281. Err: <nil>
20:19:34.147182	28.541509	... Size of applyCh: 0
20:20:34.147710	60.000527	... Size of applyCh: 0
20:20:34.171683	 . 23974	... Size of previous map: 16
20:21:34.147562	59.975879	... Size of applyCh: 0
20:22:34.147633	60.000071	... Size of applyCh: 0
20:23:34.146640	59.999006	... Size of applyCh: 0
20:24:34.146993	60.000353	... Size of applyCh: 0
20:25:34.146640	59.999647	... Size of applyCh: 0
20:25:34.170694	 . 24055	... Size of previous map: 6
20:26:34.146819	59.976125	... Size of applyCh: 0
20:27:34.147654	60.000835	... Size of applyCh: 0
20:28:34.147222	59.999567	... Size of applyCh: 0
20:29:34.147427	60.000206	... Size of applyCh: 0
20:30:34.147248	59.999821	... Size of applyCh: 0
20:30:34.172665	 . 25418	... Size of previous map: 0
20:31:34.146964	59.974299	... Size of applyCh: 0
20:32:34.147056	60.000092	... Size of applyCh: 0
20:33:34.147505	60.000448	... Size of applyCh: 0
20:34:34.146745	59.999240	... Size of applyCh: 0
20:35:34.146793	60.000048	... Size of applyCh: 0
20:35:34.146849	 .    56	... Size of previous map: 0
20:36:34.147147	60.000298	... Size of applyCh: 0
20:37:34.147084	59.999937	... Size of applyCh: 0
20:38:34.147399	60.000315	... Size of applyCh: 0
20:39:34.146904	59.999505	... Size of applyCh: 0
20:40:34.147033	60.000129	... Size of applyCh: 0
20:40:34.147077	 .    44	... Size of previous map: 0
20:41:34.147717	60.000640	... Size of applyCh: 0
20:42:34.147256	59.999538	... Size of applyCh: 0
20:43:34.146674	59.999419	... Size of applyCh: 0
20:44:34.146727	60.000052	... Size of applyCh: 0
20:45:34.146513	59.999786	... Size of previous map: 0
20:45:34.146598	 .    85	... Size of applyCh: 0
20:46:34.146668	60.000070	... Size of applyCh: 0
20:47:34.146686	60.000018	... Size of applyCh: 0
20:48:34.147351	60.000665	... Size of applyCh: 0
20:49:34.147086	59.999734	... Size of applyCh: 0
20:50:34.146770	59.999685	... Size of applyCh: 0
20:50:34.146823	 .    52	... Size of previous map: 0
20:51:34.147379	60.000557	... Size of applyCh: 0
20:52:34.147563	60.000184	... Size of applyCh: 0
20:53:34.147482	59.999919	... Size of applyCh: 0
20:54:34.147047	59.999565	... Size of applyCh: 0
20:55:34.147016	59.999968	... Size of applyCh: 0
20:55:34.147072	 .    56	... Size of previous map: 0
20:56:34.146956	59.999884	... Size of applyCh: 0
20:57:34.147196	60.000239	... Size of applyCh: 0
20:58:34.146980	59.999785	... Size of applyCh: 0
20:59:34.147031	60.000050	... Size of applyCh: 0
21:00:34.147331	60.000300	... Size of previous map: 0
21:00:34.147343	 .    12	... Size of applyCh: 0
21:01:34.147281	59.999938	... Size of applyCh: 0
2021/05/31 20:50:19.811691	259931.280129	Schema
#   github.com/dgraph-io/dgraph/schema.(*state).init		/ext-go/1/src/github.com/dgraph-io/dgraph/schema/schema.go:62
#   github.com/dgraph-io/dgraph/schema.reset			/ext-go/1/src/github.com/dgraph-io/dgraph/schema/schema.go:800
#   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run		/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:776
#   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run		/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:775
#   github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:94
#   github.com/spf13/cobra.(*Command).execute			/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:832
#   github.com/spf13/cobra.(*Command).ExecuteC			/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:915
#   github.com/dgraph-io/dgraph/dgraph/cmd.Execute		/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
#   github.com/dgraph-io/dgraph/dgraph/cmd.Execute		/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:78
#   time.(*Ticker).Stop						/usr/local/go/src/time/tick.go:47
20:53:09.239383	15.336324	... (31 events discarded)

There are more, but I truncate it.

I haven’t tried to restart alpha yet, because I didn’t want to lose some info before consulting with you.

Should I restart all the pods?

Okay, I am trying to restart alpha.

I’ve got this:

I0608 12:09:22.489704 20 run.go:799] Caught Ctrl-C. Terminating now (this may take a few seconds)...
E0608 12:09:22.489796 20 run.go:476] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
E0608 12:09:22.489999 20 server.go:30] error from cmux serve: accept tcp [::]:8080: use of closed network connection
I0608 12:09:22.490032 20 run.go:829] GRPC and HTTP stopped.
I0608 12:09:22.490038 20 worker.go:127] Stopping group...
I0608 12:09:22.490078 20 groups.go:894] Closing processOracleDeltaStream
E0608 12:09:22.490096 20 groups.go:838] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
I0608 12:09:22.490174 20 groups.go:748] Closing sendMembershipUpdates
E0608 12:09:22.490122 20 server.go:72] Stopped taking more http(s) requests. Err: mux: listener closed
E0608 12:09:22.490123 20 groups.go:946] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0608 12:09:22.490136 20 groups.go:795] Closing receiveMembershipUpdates
I0608 12:09:22.490232 20 worker.go:131] Updating RAFT state before shutting down...
I0608 12:09:22.490462 20 worker.go:136] Stopping node...
I0608 12:09:22.490663 20 draft.go:1077] Stopping node.Run
I0608 12:09:22.490695 20 draft.go:1157] Raft node done.
I0608 12:09:22.535111 20 draft.go:202] Stopping all ongoing registered tasks...
I0608 12:09:22.535134 20 draft.go:206] Stopping op: opRollup...
I0608 12:11:19.795326 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-dropped: 86385 sets-rejected: 2313956 gets-dropped: 3378112 gets-kept: 2678765440 gets-total: 2787157750 hit-ratio: 0.15
I0608 12:16:19.795046 20 log.go:34] Block cache metrics: hit: 416779298 miss: 2370378452 keys-added: 181877370 keys-updated: 1334870 keys-evicted: 181851039 cost-added: 2549252377810 cost-evicted: 2548916053424 sets-droppe

after a few minutes in this state dgraph-alpha restarted and started to server requests.

Logs:

W0608 12:20:57.719306 20 log.go:36] Block cache might be too small. Metrics: hit: 104432 miss: 1654724 keys-added: 91377 keys-updated: 0 keys-evicted: 3169 cost-added: 1361972812 cost-evicted: 879006859 sets-dropped: 0 sets-rejected: 692 gets-dropped: 192 gets-kept: 1732032 gets-total: 1759156 hit-ratio: 0.06
W0608 12:20:57.719410 20 log.go:36] Cache life expectancy (in seconds):
-- Histogram:
Min value: 0
Max value: 5
Mean: 0.92
Count: 388
[0 B, 2 B) 280 72.16%
[2 B, 4 B) 99 25.52%
[4 B, 8 B) 9 2.32%
--

I0608 12:23:17.235077 20 log.go:34] [1] [E] LOG Compact 0->3 (5, 4 -> 168 tables with 2 splits). [2395904 2395815 2395820 2396641 2396642 . 2394952 2394935 2394945 2394950 .] -> [2396643 2396649 2396653 2396656 2396663 2396667 2396672 2396675 2396679 2396684 2396689 2396692 2396693 2396696 2396698 2396700 2396701 2396703 2396704 2396706 2396708 2396709 2396711 2396712 2396714 2396716 2396717 2396719 2396720 2396722 2396723 2396725 2396727 2396728 2396730 2396731 2396733 2396735 2396736 2396738 2396740 2396742 2396744 2396746 2396748 2396749 2396751 2396752 2396753 2396754 2396755 2396756 2396757 2396758 2396759 2396760 2396761 2396762 2396763 2396764 2396765 2396766 2396767 2396768 2396769 2396770 2396771 2396772 2396773 2396774 2396775 2396776 2396777 2396778 2396779 2396780 2396781 2396782 2396783 2396784 2396785 2396786 2396787 2396788 2396789 2396790 2396791 2396792 2396793 2396794 2396795 2396796 2396797 2396798 2396799 2396800 2396801 2396802 2396803 2396804 2396805 2396806 2396807 2396808 2396809 2396810 2396644 2396645 2396646 2396647 2396648 2396650 2396651 2396652 2396654 2396655 2396657 2396658 2396659 2396660 2396661 2396662 2396664 2396665 2396666 2396668 2396669 2396670 2396671 2396673 2396674 2396676 2396677 2396678 2396680 2396681 2396682 2396683 2396685 2396686 2396687 2396688 2396690 2396691 2396694 2396695 2396697 2396699 2396702 2396705 2396707 2396710 2396713 2396715 2396718 2396721 2396724 2396726 2396729 2396732 2396734 2396737 2396739 2396741 2396743 2396745 2396747 2396750 .], took 2.157s
I0608 12:23:26.262514 20 log.go:34] [1] [E] LOG Compact 5->6 (1, 4 -> 6 tables with 2 splits). [2396920 . 2396087 2396096 2396101 2396108 .] -> [2397145 2397153 2397157 2397158 2397146 2397147 .], took 2.092s
I0608 12:23:35.971117 20 log.go:34] [0] [E] LOG Compact 5->6 (1, 18 -> 18 tables with 5 splits). [2396914 . 2397315 2397316 2392094 2390169 2390172 2390158 2390165 2390173 2390177 2390157 2390166 2390174 2390178 2390160 2390168 2390175 2390176 2396287 .] -> [2397350 2397357 2397362 2397375 2397356 2397361 2397365 2397370 2397354 2397360 2397364 2397377 2397358 2397363 2397378 2397380 2397355 2397359 .], took 2.336s
I0608 12:23:39.055555 20 log.go:34] [0] [E] LOG Compact 5->6 (1, 14 -> 14 tables with 5 splits). [2396428 . 2396154 2396254 2396259 2396252 2396258 2396261 2396253 2396260 2396263 2396250 2396256 2396257 2396251 2397194 .] -> [2397414 2397426 2397427 2397413 2397424 2397430 2397416 2397429 2397432 2397411 2397421 2397422 2397412 2397418 .], took 2.501s
I0608 12:25:57.719313 20 log.go:34] Block cache metrics: hit: 1298521 miss: 3084447 keys-added: 322522 keys-updated: 126912 keys-evicted: 228249 cost-added: 2370083933 cost-evicted: 1956827222 sets-dropped: 0 sets-rejected:

May be this one:

E0608 12:09:22.490123      20 groups.go:946] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
E0608 12:09:22.490122      20 server.go:72] Stopped taking more http(s) requests. Err: mux: listener closed

is important?