I have a very basic (non async) Dgraph setup where I am intermittently receiving the exception in my client app:
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: No connection exists
In a ten minute period, I have ~1000 successful and see about 20 failures. My traffic will increase greatly, as likely will the Dgraph cluster. At this point, is there any advice around this error?
the exception is descriptive, but is it unequivocally a connection issue?
is there any additional debugging/flags i can set in my client or on the server to help diagnose?
this is what i see in the alpha log (nothing corresponding in the zero)
I0909 13:44:19.104995 17 groups.go:928] Zero leadership changed. Renewing oracle delta stream.
E0909 13:44:19.105082 17 groups.go:904] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0909 13:44:20.105150 17 groups.go:864] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0909 13:44:20.443098 17 groups.go:873] Got Zero leader: 10.61.1.4:5080
For clarification, this is not a connection issue between the client and alpha, but a connection issue between alpha and zero. It is happening because of zero leader being re-elected. So, while alpha is trying to connect to the new zero leader, if some mutation request comes through, it receives this error.
Could you provide us with the following information:
The version of dgraph you are using
Your configuration for starting the cluster
Zero and Alpha logs
Please make sure to use -v3 flag while starting Alpha and Zero. That would help us get a lot of information and see if it can be improved.
I was going through the logs you had sent for v20.03.3, and there were following kinds of log lines in Zero corresponding to the Alpha log lines:
W0701 12:20:06.648898 14 raft.go:737] Raft.Ready took too long to process: Timer Total: 358ms. Breakdown: [{proposals 291ms} {sync 67ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
E0701 12:39:17.357236 14 oracle.go:522] Got error: Assigning IDs is only allowed on leader. while leasing timestamps: val:1
All this indicates that Zero leader was being re-elected, and the mutations which failed for you, would have failed because of the Alpha server not being able to get a uid for them from Zero.
I tried to reproduce the same using dgraph4j on dgraph v20.03.3, but didn’t see any issues. In my case, it successfully inserted 98197 predicates without any error in 10 mins.
Let us know if you face the same issue in v20.03.5.
Seeing a lot more logging on the Dgraph and app side… On the app side I get:
{"log":"2020-09-18 04:39:13,095 [pulsar-external-listener-3-1] WARN com.overstock.crm.service.impl.RequestLogIdentityServiceImpl [] - Identity not saved, Upsert failed {} callChainId= com.overstock.crm.service.repository.UpsertException: Failed to upsert userseed node with ID 3941695847706992960
at com.overstock.crm.service.repository.DgraphOperationsImpl.upsert(DgraphOperationsImpl.java:112) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.saveIdentity(RequestLogIdentityServiceImpl.java:74) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.lambda$resolveIdentity$0(RequestLogIdentityServiceImpl.java:45) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
at java.util.Arrays$ArrayList.forEach(Arrays.java:4390) ~[?:?]
at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.resolveIdentity(RequestLogIdentityServiceImpl.java:42) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
at com.overstock.crm.consumer.pulsar.RequestLogIdentityMessageListener.received(RequestLogIdentityMessageListener.java:37) ~[classes!/:0.0.1-373-ea2ee84-43]
at org.apache.pulsar.client.impl.ConsumerImpl.lambda$triggerListener$8(ConsumerImpl.java:1237) ~[pulsar-client-2.6.0.jar!/:2.6.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [pulsar-client-2.6.0.jar!/:2.6.0]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:187) ~[dgraph4j-20.03.1.jar!/:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:214) ~[dgraph4j-20.03.1.jar!/:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:182) ~[dgraph4j-20.03.1.jar!/:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.32.1.jar!/:1.32.1]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) ~[grpc-stub-1.32.1.jar!/:1.32.1]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413) ~[grpc-core-1.32.1.jar!/:1.32.1]
at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.32.1.jar!/:1.32.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742) ~[grpc-core-1.32.1.jar!/:1.32.1]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721) ~[grpc-core-1.32.1.jar!/:1.32.1]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.32.1.jar!/:1.32.1]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.32.1.jar!/:1.32.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
... 1 more
We recently had this PR that went in master, with which I hope you won’t see the readTs less than minTs errors. But, this change is in master not in 20.03.5.
Do let us know if you get any other errors too. Also, please send over the logs for your new testing.
Since, I wasn’t able to reproduce it, would be great if you could share your testing code. That would help us know what exactly is it that you are trying.
No, the logging will be there. I was just saying that this error would now be seen less frequently in master as we have merged a PR regarding the same issue. The fix is only in master, not in v20.03.5 is what my point was. The logs will be there in 20.03.5
@abhimanyusinghgaur Could you kindly confirm whether this issue was fixed in the v20.03.6 ?
Actually we again see the same exception occurring even in version v20.07.2.
Hi @sreenathsp, I just checked, the PR for reducing readTs less than minTs errors is part of both v20.03.6 and v20.07.2
Note that it only reduces the frequency of that error, it doesn’t completely remove the error.
@sreenathsp: our current recommendation is to retry when you encounter these errors as these are transient issues. Could you elaborate how often you encounter them or the client API you are using and if retrying would be an acceptable solution in your case?