Dgraph4j - StatusRuntimeException: UNKNOWN: No connection exists

Hi,

I have a very basic (non async) Dgraph setup where I am intermittently receiving the exception in my client app:

            Caused by: io.grpc.StatusRuntimeException: UNKNOWN: No connection exists

In a ten minute period, I have ~1000 successful and see about 20 failures. My traffic will increase greatly, as likely will the Dgraph cluster. At this point, is there any advice around this error?

  • the exception is descriptive, but is it unequivocally a connection issue?

  • is there any additional debugging/flags i can set in my client or on the server to help diagnose?

  • this is what i see in the alpha log (nothing corresponding in the zero)

               I0909 13:44:19.104995      17 groups.go:928] Zero leadership changed. Renewing oracle delta stream.
               E0909 13:44:19.105082      17 groups.go:904] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
               I0909 13:44:20.105150      17 groups.go:864] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
               I0909 13:44:20.443098      17 groups.go:873] Got Zero leader: 10.61.1.4:5080
    

Java 11
Dgraph4j 23.0.1

Thanks

Here is my client code if of any use (ill also paste in my mutation code)

      @Bean
      @DependsOn({"managedChannel"})
      public DgraphClient dgraphClient() {
        DgraphClient dgraphClient = null;
        try {
          DgraphGrpc.DgraphStub stub = DgraphGrpc.newStub(this.getManagedChannel());
          ClientInterceptor timeoutInterceptor = timeout();
          stub.withInterceptors(timeoutInterceptor);
          dgraphClient = new DgraphClient(stub);
        } catch (Exception exc) {
          LOGGER.error("An Exception has occured creating the DgraphClient - ", exc);
        }
        return dgraphClient;
      }
    
      @NotNull
      private ClientInterceptor timeout() {
        return new ClientInterceptor() {
          @Override
          public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
              MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
            return next.newCall(method, callOptions.withDeadlineAfter(500, TimeUnit.MILLISECONDS));
          }
        };
      }
    
      private File getCertFile(final String certFileName, final String contents) {
        File file = new File("temp-" + certFileName);
    
        try (InputStream is = new ByteArrayInputStream(Base64.getDecoder().decode(contents));
            FileOutputStream fos = new FileOutputStream(file)) {
    
          IOUtils.copy(is, fos);
          return file;
        } catch (Exception e) {
          LOGGER.error("Error occurred while getting the cert file {}.", certFileName, e);
          throw new RuntimeException(e);
        }
      }
    
      @Bean(name = "managedChannel", destroyMethod = "shutdown")
      public ManagedChannel managedChannel() {
        try {
          SslContextBuilder builder = GrpcSslContexts.forClient();
          builder.trustManager(getCertFile("ca.crt", dgraphCaCert));
    
          // Performing client verification
          builder.keyManager(
              getCertFile("client.testing.crt", dgraphClientCert),
              getCertFile("client.testing.java.key", dgrapghClientKey));
          SslContext sslContext = builder.build();
    
          this.managedChannel =
              NettyChannelBuilder.forAddress(dgraphHost, dgraphPort)
                  .sslContext(sslContext)
                  // .enableRetry()
                 //  .maxRetryAttempts(2) - okay i can retry, but load is v low for that to be solution.. 
                  .build();
    
        } catch (Exception e) {
          LOGGER.error("An Exception has occured creating the Managed Client. Exception is: {}", e);
        }
        return this.managedChannel;
      }
    
      @Bean
      public Gson gson() {
        return new Gson();
      }
    
      public ManagedChannel getManagedChannel() {
        return this.managedChannel;
      }
    final DgraphProto.Mutation mu =
          DgraphProto.Mutation.newBuilder().setSetJson(ByteString.copyFromUtf8(json)).build();
  
      try (Transaction transaction = getClient().newTransaction()) {
        DgraphProto.Request request =
            DgraphProto.Request.newBuilder()
                .setQuery(upsertQuery)
                .addMutations(mu)
                .setCommitNow(true)
                .build();
  
        final DgraphProto.Response response = transaction.doRequest(request);

Hi @damienburke,

Could you please confirm the Dgraph version too that you are using?

You can increase the logging level in the Alpha server by using -v3 flag. That would print more logs in alpha.

I will try it at my end and update you on this by tomorrow.

Thanks

1 Like

Hi @damienburke,

For clarification, this is not a connection issue between the client and alpha, but a connection issue between alpha and zero. It is happening because of zero leader being re-elected. So, while alpha is trying to connect to the new zero leader, if some mutation request comes through, it receives this error.

Could you provide us with the following information:

  • The version of dgraph you are using
  • Your configuration for starting the cluster
  • Zero and Alpha logs

Please make sure to use -v3 flag while starting Alpha and Zero. That would help us get a lot of information and see if it can be improved.

Thanks

docker_zero.sh (317 Bytes) docker_alpha.sh (579 Bytes)

The error was happening on version 20.03.3
We just upgraded to version 20.03.5 today and will resend on logs for that version, with the -v3 flag too

Thanks!

Thanks @damienburke.

I was going through the logs you had sent for v20.03.3, and there were following kinds of log lines in Zero corresponding to the Alpha log lines:

W0701 12:20:06.648898      14 raft.go:737] Raft.Ready took too long to process: Timer Total: 358ms. Breakdown: [{proposals 291ms} {sync 67ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
E0701 12:39:17.357236      14 oracle.go:522] Got error: Assigning IDs is only allowed on leader. while leasing timestamps: val:1 

All this indicates that Zero leader was being re-elected, and the mutations which failed for you, would have failed because of the Alpha server not being able to get a uid for them from Zero.

I tried to reproduce the same using dgraph4j on dgraph v20.03.3, but didn’t see any issues. In my case, it successfully inserted 98197 predicates without any error in 10 mins.

Let us know if you face the same issue in v20.03.5.

Seeing a lot more logging on the Dgraph and app side… On the app side I get:

	
{"log":"2020-09-18 04:39:13,095 [pulsar-external-listener-3-1] WARN com.overstock.crm.service.impl.RequestLogIdentityServiceImpl [] - Identity not saved, Upsert failed {} callChainId= com.overstock.crm.service.repository.UpsertException: Failed to upsert userseed node with ID 3941695847706992960
    at com.overstock.crm.service.repository.DgraphOperationsImpl.upsert(DgraphOperationsImpl.java:112) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.saveIdentity(RequestLogIdentityServiceImpl.java:74) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.lambda$resolveIdentity$0(RequestLogIdentityServiceImpl.java:45) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at java.util.Arrays$ArrayList.forEach(Arrays.java:4390) ~[?:?]
    at com.overstock.crm.service.impl.RequestLogIdentityServiceImpl.resolveIdentity(RequestLogIdentityServiceImpl.java:42) ~[Requestlogidentityconsumer-service-0.0.1-373-ea2ee84-43.jar!/:0.0.1-373-ea2ee84-43]
    at com.overstock.crm.consumer.pulsar.RequestLogIdentityMessageListener.received(RequestLogIdentityMessageListener.java:37) ~[classes!/:0.0.1-373-ea2ee84-43]
    at org.apache.pulsar.client.impl.ConsumerImpl.lambda$triggerListener$8(ConsumerImpl.java:1237) ~[pulsar-client-2.6.0.jar!/:2.6.0]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [pulsar-client-2.6.0.jar!/:2.6.0]
    at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:187) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
    at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:214) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
    at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
    at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:182) ~[dgraph4j-20.03.1.jar!/:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
    at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1692) ~[?:?]
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) ~[?:?]
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) ~[?:?]
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?]
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) ~[?:?]
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:177) ~[?:?]
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: : cannot retrieve UIDs from list with key 0000087573657273656564020b74032d02f5c69b9a9cea56ca9e8bcad173ca38f27c1746ac0a2a5fb882c6ef1f: readTs: 12747 less than minTs: 31610079 for key: "\\x00\\x00\\buserseed\\x02\\vt\\x03-\\x02\\xf5ƛ\\x9a\\x9c\\xeaVʞ\\x8b\\xca\\xd1s\\xca8\\xf2|\\x17F\\xac\u000a*_\\xb8\\x82\\xc6\\xef\\x1f"
    at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.32.1.jar!/:1.32.1]
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) ~[grpc-stub-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.32.1.jar!/:1.32.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    ... 1 more

Aslo, as part of upgrade, i deleted old data (and added -v 3 flag)
I am now seeing X5 times the number of these errors!

We recently had this PR that went in master, with which I hope you won’t see the readTs less than minTs errors. But, this change is in master not in 20.03.5.

Do let us know if you get any other errors too. Also, please send over the logs for your new testing.
Since, I wasn’t able to reproduce it, would be great if you could share your testing code. That would help us know what exactly is it that you are trying.

Thanks, Ill email u the app

1 Like

weird… as u can see from our script we are running 20.03.5…
can u re-confirm that this logging is not in this version?

No, the logging will be there. I was just saying that this error would now be seen less frequently in master as we have merged a PR regarding the same issue. The fix is only in master, not in v20.03.5 is what my point was. The logs will be there in 20.03.5

ah, ok. so to put another way, that PR has a good change of solving my issue?
when can i expect to see a 20.03.6 with this fix incluced?
thanks

Yes, that PR has a change that can fix your issue.
I am not sure about if it would go to 20.03.6 or when will 20.03.6 be released.

Tagging @Paras for the same.

That PR would go to 20.03.6, and it should be released in the second half of next month.

2 Likes

@abhimanyusinghgaur Could you kindly confirm whether this issue was fixed in the v20.03.6 ?
Actually we again see the same exception occurring even in version v20.07.2.

Hi @sreenathsp, I just checked, the PR for reducing readTs less than minTs errors is part of both v20.03.6 and v20.07.2
Note that it only reduces the frequency of that error, it doesn’t completely remove the error.

@sreenathsp: our current recommendation is to retry when you encounter these errors as these are transient issues. Could you elaborate how often you encounter them or the client API you are using and if retrying would be an acceptable solution in your case?