After a successful upgrade to Dgraph v1.1, we rolled out an upsert function that was working locally on a small dataset, but it’s failing in production.
I’m getting this exception:
java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:173) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_212]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ~[?:1.8.0_212]
… 5 more
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:212) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: Uid: [834751] cannot be greater than lease: [0]
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_212]
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:180) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: Uid: [834751] cannot be greater than lease: [0]
at io.grpc.Status.asRuntimeException(Status.java:533) ~[functions.jar:?]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675) ~[functions.jar:?]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[functions.jar:?]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[functions.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
on almost all upserts.
I read through some of the other discussion threads, and they all seem to be associated with bulk imports, not upserts or live mutations.
This is usually called when you try to add mutations with specific UIDs (that don’t exist yet). For example, this mutation would not work. <0xbeef> <name> "My name" .
but this one would work: _:blankNode <name> "My name" .
So the issue seems specific to the contents of your upsert. I would make sure you are not trying to add mutations with specific uids that are not known by Dgraph.
The only UIDs I’m using in the upsert are from querying Dgraph, so that can’t be the case unless there’s a serious bug in the bulk loader or something.
Here are the steps to reproduce the upsert (quoting my comments in my other post).
First, run this alter on the schema:
type Products {
products: [Product]
}
type Product {
productId: string
options: [Option]
}
type Option {
optionId: string
color: string
}
<collectionId>: int @index(int) .
<color>: string .
<optionId>: int @index(int) .
<options>: [uid] .
<productId>: int @index(int) .
<products>: [uid] .
Wha you mean? I’m a bit confused if you are calling “bulk” of upsert or if you bulkloaded before an upsert.
Well, if you have this situation after doing a bulk-load. It might mean you’re using another Zero instance. You should use the same Zero instance used during the bulk-load. Cuz in that instance we have the data of the allocated/mapped uids. If you use an Zero instance from scratch, this will happen because there are no allocated uids.
You are confirming my theory. When you bulkload, you use an instance of Zero to allocate UIDs. This information is in this instance only. Then you should not eliminate it.
Do the following, increase the number of nodes via API e.g. /assign?what=uids&num=100000000000000000000000000 Just in case.
Export it again, and now do the bulk load process again without deleting the zero instance.
How would I avoid eliminating the Zero instance if we need to upgrade the cluster to v1.1?
Are we going to need to roll back the entire cluster and recover from our backup, re-perform the export, re-upgrade the nodes, and re-perform the import?
Or, are you saying that we should just try exporting and re-importing with our current cluster in its current state?
I still am not understanding how we would upgrade a cluster to v1.1 without replacing all zero containers with the ones that are running the upgraded version after performing the export from the alpha node (after which, we import from the Zero leader).
From an old version you just export the data. Nothing more. Put the cluster with the old version down.
Do the bulkload with the New version.
When the bulkload finishes, don’t remove the zero instance. Only the Alphas (you should not have Alphas running oO) or even Ratel (but no need to remove Ratel tho).
Now you start the rest of the cluster with the zero instance from the bulkload.
Thanks. I think we might have accidentally run the bulk import before we upgraded the Zero nodes. We will try removing the data directories, re-deploying the Zero nodes, re-running the bulk loader, re-copying the data to the p directories, and re-starting the Alpha nodes.
When you start the standalone image it shows the log informing that you should not use it in prod. You could, but need to edit the image and add scripts to make it work with bulkload if that the case.