Dgraph4j - StatusRuntimeException: UNKNOWN: No connection exists

damienburke · September 10, 2020, 8:03pm

Hi,

I have a very basic (non async) Dgraph setup where I am intermittently receiving the exception in my client app:

            Caused by: io.grpc.StatusRuntimeException: UNKNOWN: No connection exists

In a ten minute period, I have ~1000 successful and see about 20 failures. My traffic will increase greatly, as likely will the Dgraph cluster. At this point, is there any advice around this error?

the exception is descriptive, but is it unequivocally a connection issue?
is there any additional debugging/flags i can set in my client or on the server to help diagnose?

this is what i see in the alpha log (nothing corresponding in the zero)

           I0909 13:44:19.104995      17 groups.go:928] Zero leadership changed. Renewing oracle delta stream.
           E0909 13:44:19.105082      17 groups.go:904] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
           I0909 13:44:20.105150      17 groups.go:864] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
           I0909 13:44:20.443098      17 groups.go:873] Got Zero leader: 10.61.1.4:5080

Java 11
Dgraph4j 23.0.1

Thanks

damienburke · September 10, 2020, 8:26pm

Here is my client code if of any use (ill also paste in my mutation code)

      @Bean
      @DependsOn({"managedChannel"})
      public DgraphClient dgraphClient() {
        DgraphClient dgraphClient = null;
        try {
          DgraphGrpc.DgraphStub stub = DgraphGrpc.newStub(this.getManagedChannel());
          ClientInterceptor timeoutInterceptor = timeout();
          stub.withInterceptors(timeoutInterceptor);
          dgraphClient = new DgraphClient(stub);
        } catch (Exception exc) {
          LOGGER.error("An Exception has occured creating the DgraphClient - ", exc);
        }
        return dgraphClient;
      }
    
      @NotNull
      private ClientInterceptor timeout() {
        return new ClientInterceptor() {
          @Override
          public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
              MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
            return next.newCall(method, callOptions.withDeadlineAfter(500, TimeUnit.MILLISECONDS));
          }
        };
      }
    
      private File getCertFile(final String certFileName, final String contents) {
        File file = new File("temp-" + certFileName);
    
        try (InputStream is = new ByteArrayInputStream(Base64.getDecoder().decode(contents));
            FileOutputStream fos = new FileOutputStream(file)) {
    
          IOUtils.copy(is, fos);
          return file;
        } catch (Exception e) {
          LOGGER.error("Error occurred while getting the cert file {}.", certFileName, e);
          throw new RuntimeException(e);
        }
      }
    
      @Bean(name = "managedChannel", destroyMethod = "shutdown")
      public ManagedChannel managedChannel() {
        try {
          SslContextBuilder builder = GrpcSslContexts.forClient();
          builder.trustManager(getCertFile("ca.crt", dgraphCaCert));
    
          // Performing client verification
          builder.keyManager(
              getCertFile("client.testing.crt", dgraphClientCert),
              getCertFile("client.testing.java.key", dgrapghClientKey));
          SslContext sslContext = builder.build();
    
          this.managedChannel =
              NettyChannelBuilder.forAddress(dgraphHost, dgraphPort)
                  .sslContext(sslContext)
                  // .enableRetry()
                 //  .maxRetryAttempts(2) - okay i can retry, but load is v low for that to be solution.. 
                  .build();
    
        } catch (Exception e) {
          LOGGER.error("An Exception has occured creating the Managed Client. Exception is: {}", e);
        }
        return this.managedChannel;
      }
    
      @Bean
      public Gson gson() {
        return new Gson();
      }
    
      public ManagedChannel getManagedChannel() {
        return this.managedChannel;
      }

damienburke · September 10, 2020, 8:27pm

    final DgraphProto.Mutation mu =
          DgraphProto.Mutation.newBuilder().setSetJson(ByteString.copyFromUtf8(json)).build();
  
      try (Transaction transaction = getClient().newTransaction()) {
        DgraphProto.Request request =
            DgraphProto.Request.newBuilder()
                .setQuery(upsertQuery)
                .addMutations(mu)
                .setCommitNow(true)
                .build();
  
        final DgraphProto.Response response = transaction.doRequest(request);

abhimanyusinghgaur · September 16, 2020, 4:46pm

Hi @damienburke,

Could you please confirm the Dgraph version too that you are using?

You can increase the logging level in the Alpha server by using -v3 flag. That would print more logs in alpha.

I will try it at my end and update you on this by tomorrow.

Thanks

abhimanyusinghgaur · September 17, 2020, 5:00pm

Hi @damienburke,

For clarification, this is not a connection issue between the client and alpha, but a connection issue between alpha and zero. It is happening because of zero leader being re-elected. So, while alpha is trying to connect to the new zero leader, if some mutation request comes through, it receives this error.

Could you provide us with the following information:

The version of dgraph you are using
Your configuration for starting the cluster
Zero and Alpha logs

Please make sure to use -v3 flag while starting Alpha and Zero. That would help us get a lot of information and see if it can be improved.

Thanks

damienburke · September 18, 2020, 9:44am

docker_zero.sh (317 Bytes) docker_alpha.sh (579 Bytes)

The error was happening on version 20.03.3
We just upgraded to version 20.03.5 today and will resend on logs for that version, with the -v3 flag too

Thanks!

abhimanyusinghgaur · September 18, 2020, 10:43am

Thanks @damienburke.

I was going through the logs you had sent for v20.03.3, and there were following kinds of log lines in Zero corresponding to the Alpha log lines:

W0701 12:20:06.648898      14 raft.go:737] Raft.Ready took too long to process: Timer Total: 358ms. Breakdown: [{proposals 291ms} {sync 67ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
E0701 12:39:17.357236      14 oracle.go:522] Got error: Assigning IDs is only allowed on leader. while leasing timestamps: val:1

All this indicates that Zero leader was being re-elected, and the mutations which failed for you, would have failed because of the Alpha server not being able to get a uid for them from Zero.

I tried to reproduce the same using dgraph4j on dgraph v20.03.3, but didn’t see any issues. In my case, it successfully inserted 98197 predicates without any error in 10 mins.

Let us know if you face the same issue in v20.03.5.

damienburke · September 18, 2020, 10:47am

Seeing a lot more logging on the Dgraph and app side… On the app side I get:

	
{"log":"2020-09-18 04:39:13,095 [pulsar-external-listener-3-1] WARN com.overstock.crm.service.impl.RequestLogIdentityServiceImpl [] - Identity not saved, Upsert failed {} callChainId= com.overstock.crm.service.repository.UpsertException: Failed to upsert userseed node with ID 3941695847706992960
    at com.overstock.crm.service.repository.DgraphOperationsImpl.upsert(DgraphOperationsImpl.java:112) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.saveIdentity(RequestLogIdentityServiceImpl.java:74) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.lambda$resolveIdentity$0(RequestLogIdentityServiceImpl.java:45) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at java.util.Arrays$ArrayList.forEach(Arrays.java:4390) ~[?:?]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.resolveIdentity(RequestLogIdentityServiceImpl.java:42) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.consumer.pulsar.RequestLogIdentityMessageListener.received(RequestLogIdentityMessageListener.java:37) ~[classes!/:0.0.1-373-ea2ee84-43]
    at org.apache.pulsar.client.impl.ConsumerImpl.lambda$triggerListener$8(ConsumerImpl.java:1237) ~[pulsar-client-2.6.0.jar!/:2.6.0]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [pulsar-client-2.6.0.jar!/:2.6.0]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:187) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:214) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
    at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
    at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:182) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
    at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.32.1.jar!/:1.32.1]
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) ~[grpc-stub-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    ... 1 more

damienburke · September 18, 2020, 10:51am

Aslo, as part of upgrade, i deleted old data (and added -v 3 flag)
I am now seeing X5 times the number of these errors!

abhimanyusinghgaur · September 18, 2020, 10:56am

We recently had this PR that went in master, with which I hope you won’t see the readTs less than minTs errors. But, this change is in master not in 20.03.5.

abhimanyusinghgaur · September 18, 2020, 11:04am

Do let us know if you get any other errors too. Also, please send over the logs for your new testing.
Since, I wasn’t able to reproduce it, would be great if you could share your testing code. That would help us know what exactly is it that you are trying.

damienburke · September 18, 2020, 11:35am

Thanks, Ill email u the app

damienburke · September 18, 2020, 11:41am

weird… as u can see from our script we are running 20.03.5…
can u re-confirm that this logging is not in this version?

abhimanyusinghgaur · September 18, 2020, 11:56am

No, the logging will be there. I was just saying that this error would now be seen less frequently in master as we have merged a PR regarding the same issue. The fix is only in master, not in v20.03.5 is what my point was. The logs will be there in 20.03.5

damienburke · September 18, 2020, 1:19pm

ah, ok. so to put another way, that PR has a good change of solving my issue?
when can i expect to see a 20.03.6 with this fix incluced?
thanks

abhimanyusinghgaur · September 18, 2020, 4:25pm

Yes, that PR has a change that can fix your issue.
I am not sure about if it would go to 20.03.6 or when will 20.03.6 be released.

Tagging @Paras for the same.

abhimanyusinghgaur · September 21, 2020, 8:59am

That PR would go to 20.03.6, and it should be released in the second half of next month.

sreenathsp · December 3, 2020, 1:07pm

@abhimanyusinghgaur Could you kindly confirm whether this issue was fixed in the v20.03.6 ?
Actually we again see the same exception occurring even in version v20.07.2.

abhimanyusinghgaur · December 7, 2020, 7:21am

Hi @sreenathsp, I just checked, the PR for reducing readTs less than minTs errors is part of both v20.03.6 and v20.07.2
Note that it only reduces the frequency of that error, it doesn’t completely remove the error.

vvbalaji · December 14, 2020, 11:47pm

@sreenathsp: our current recommendation is to retry when you encounter these errors as these are transient issues. Could you elaborate how often you encounter them or the client API you are using and if retrying would be an acceptable solution in your case?

Topic		Replies	Views
Exception in Graph java client while connecting to Remote Graph DB Dgraph	2	2526	June 26, 2019
No connection exist Users	2	1280	January 23, 2018
Using Java Client to connect to Dgraph failed Users java-client	2	897	May 23, 2020
io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason Users	3	7076	November 14, 2018
[dgraph4j] Using java client connect to dgraph alpha failed. (docker on AWS EC2) Users java-client	5	2007	February 13, 2020

Dgraph4j - StatusRuntimeException: UNKNOWN: No connection exists

Related topics