Production Upsert - UNKNOWN: Uid: [834751] cannot be greater than lease: [0]

After a successful upgrade to Dgraph v1.1, we rolled out an upsert function that was working locally on a small dataset, but it’s failing in production.
I’m getting this exception:

java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:173) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_212]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ~[?:1.8.0_212]
… 5 more
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:212) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNKNOWN: Uid: [834751] cannot be greater than lease: [0]
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_212]
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:180) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: Uid: [834751] cannot be greater than lease: [0]
at io.grpc.Status.asRuntimeException(Status.java:533) ~[functions.jar:?]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675) ~[functions.jar:?]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[functions.jar:?]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[functions.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

on almost all upserts.

I read through some of the other discussion threads, and they all seem to be associated with bulk imports, not upserts or live mutations.

Any ideas?

This is caused by the same code that was discussed here: Upsert with multiple UIDs

This is usually called when you try to add mutations with specific UIDs (that don’t exist yet). For example, this mutation would not work.
<0xbeef> <name> "My name" .
but this one would work:
_:blankNode <name> "My name" .

So the issue seems specific to the contents of your upsert. I would make sure you are not trying to add mutations with specific uids that are not known by Dgraph.

The only UIDs I’m using in the upsert are from querying Dgraph, so that can’t be the case unless there’s a serious bug in the bulk loader or something.

Here are the steps to reproduce the upsert (quoting my comments in my other post).

First, run this alter on the schema:

type Products { 
    products: [Product] 
} 
type Product { 
    productId: string 
    options: [Option] 
} 
type Option { 
    optionId: string 
    color: string 
}
<collectionId>: int @index(int) .
<color>: string .
<optionId>: int @index(int) .
<options>: [uid] .
<productId>: int @index(int) .
<products>: [uid] .

Then, run this mutate:

{
  "set":[ {
    "uid": "_:products",
    "dgraph.type": "Products",
    "collectionId": 1,
    "products": [
      {
        "dgraph.type": "Product",
        "uid": "_:product",
        "productId": 19610626,
        "options": [
          {
            "dgraph.type": "Option",
            "uid": "_:option",
            "optionId": 32661491,
            "color": "red"
          }
        ]
      }
    ]
}]
}

Then, run this Java code:

String query = "{\n" +
            "  getVals(func: has(products)) {\n" +
            "    productsUid as uid\n" +
            "    products @filter(eq(productId, 19610626)) {\n" +
            "      productUid  as uid\n" +
            "      options @filter(eq(optionId, 32661491)) {\n" +
            "        optionUid as uid\n" +
            "      }\n" +
            "    }\n" +
            "  }\n" +
            "}";

DgraphProto.Mutation mu =
            DgraphProto.Mutation.newBuilder()
                    .setSetNquads(ByteString.copyFromUtf8("uid(productsUid) <products> uid(productUid) ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(productsUid) <dgraph.type> \"Products\" ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(productUid) <productId> \"19610626\" ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(productUid) <options> uid(optionUid) ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(productUid) <dgraph.type> \"Product\" ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(optionUid) <color> \"blue\" ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(optionUid) <dgraph.type> \"Option\" ."))
                    .setSetNquads(ByteString.copyFromUtf8("uid(optionUid) <optionId> \"32661491\" ."))
                    .build();
    Map uidsMap;
    try(DgraphConnection dgraphConnection = DGraphQueryHelper.createDgraphClient(false, context)){
        Transaction txn = dgraphConnection.getDgraphClient().newTransaction();
        try{
            DgraphProto.Request request = DgraphProto.Request.newBuilder()
                    .setQuery(query)
                    .addMutations(mu)
                    .setCommitNow(true)
                    .build();
            DgraphProto.Response res = txn.doRequest(request);
            uidsMap = res.getUidsMap();
            String exactOutput = res.getJson().toStringUtf8();
        } catch (Exception e){
            throw e;
        }
        finally {
            txn.discard();
        }

    } catch (Exception e){
        logger.error("The message on the exception is: " + e.getMessage());
        StringWriter sw = new StringWriter();
        PrintWriter pw = new PrintWriter(sw);
        e.printStackTrace(pw);
        logger.error("Showing stack trace: " + sw.toString());
        throw e;
    }

Has anyone else successfully run an upsert after a bulk import?

When we run this query from prod Ratel:

upsert {
  query {
  getVals(func: has(products)) {
    productsUid as uid
      products @filter(eq(productId, 19610626)) {
      productUid as uid
      options @filter(eq(optionId, 32661491)) {
        optionUid as uid
      }
    }
  }
}

  mutation {
    set {
      uid(productsUid) <products> uid(productUid) .
      uid(productsUid) <dgraph.type> "Products" .
      uid(productUid) <productId> "19610626" .
      uid(productUid) <options> uid(optionUid) .
      uid(productUid) <dgraph.type> "Product" .
      uid(optionUid) <dgraph.type> "Option" .
      uid(optionUid) <optionId> "32661491" .
    }
  }
}

we get the same error.

However, when we run it in our test environment (which we also ran a bulk import on), we get a different result:

Wha you mean? I’m a bit confused if you are calling “bulk” of upsert or if you bulkloaded before an upsert.

Well, if you have this situation after doing a bulk-load. It might mean you’re using another Zero instance. You should use the same Zero instance used during the bulk-load. Cuz in that instance we have the data of the allocated/mapped uids. If you use an Zero instance from scratch, this will happen because there are no allocated uids.

Here are the steps that we ran:

  1. Connected to an Alpha node via docker-standalone exec -it dgraph-alpha /bin/bash
  2. Ran export via curl localhost:8082/admin/export
  3. Exited container
  4. Copied export directory to shared directory
  5. Backed up data files from Alpha and Zero nodes by copying to different shared directory
  6. Stopped Ratel nodes
  7. Stopped Alpha nodes
  8. Stopped Zero nodes
  9. Removed Ratel, Alpha, and Zero containers
  10. Deleted all data files
  11. Started Zero nodes
  12. Identified leader node from logs
  13. Ran gunzip on exported schema
    14 Edited uid to [uid] in all instances in the unzipped schema file
  14. Copied exported schema file and data file to Dgraph zero leader node
  15. Ran this command on the Dgraph zero leader node:
  16. dgraph bulk -f /dgraph/export/dgraph.r1233498.u1021.2005/g01.rdf.gz -s /dgraph/export/dgraph.r1233498.u1021.2005/g01.schema --map_shards=2 --reduce_shards=1 --http localhost:8002 --zero=localhost:5082
  17. Copied the exported p files to all of the Alpha data directories
  18. Started the Alpha containers
  19. Started the Ratel container
  20. Attempted to run mutation in Ratel by running the above query.

Are you saying that we need to connect Ratel to a specific instance in the cluster?

That’s the issue that I just talked about.

I’m still not following.

We were following the upgrade procedure to upgrade from an older version of Dgraph.

From what we can tell, the import was successful. We can see our data in the cluster. We just can’t run the mutation.

When you say that we’re using another Zero instance, what do you mean by that? We upgraded the entire cluster.

You are confirming my theory. When you bulkload, you use an instance of Zero to allocate UIDs. This information is in this instance only. Then you should not eliminate it.

Do the following, increase the number of nodes via API e.g.
/assign?what=uids&num=100000000000000000000000000 Just in case.
Export it again, and now do the bulk load process again without deleting the zero instance.

How would I avoid eliminating the Zero instance if we need to upgrade the cluster to v1.1?

Are we going to need to roll back the entire cluster and recover from our backup, re-perform the export, re-upgrade the nodes, and re-perform the import?

Or, are you saying that we should just try exporting and re-importing with our current cluster in its current state?

I still am not understanding how we would upgrade a cluster to v1.1 without replacing all zero containers with the ones that are running the upgraded version after performing the export from the alpha node (after which, we import from the Zero leader).

Wait… Are you saying that we should not perform the bulk load until we have already upgraded the cluster?

The upgrade is:

  • From an old version you just export the data. Nothing more. Put the cluster with the old version down.
  • Do the bulkload with the New version.
  • When the bulkload finishes, don’t remove the zero instance. Only the Alphas (you should not have Alphas running oO) or even Ratel (but no need to remove Ratel tho).
  • Now you start the rest of the cluster with the zero instance from the bulkload.

If you need to automate this, you can get inspiration from this repo https://github.com/MichelDiz/Dgraph-Bulk-Script

1 Like

Thanks. I think we might have accidentally run the bulk import before we upgraded the Zero nodes. We will try removing the data directories, re-deploying the Zero nodes, re-running the bulk loader, re-copying the data to the p directories, and re-starting the Alpha nodes.

oh!! you’re using the standalone! never use the standalone in production! please! there’s a log informing this.

It’s okay if you’re just testing and export from it. But if you use this image to import or do something else, can be tricky.

1 Like

Do you have a link to the log? Do you know what kinds of issues it can cause? I’ll need to share this information with my team.

When you start the standalone image it shows the log informing that you should not use it in prod. You could, but need to edit the image and add scripts to make it work with bulkload if that the case.

1 Like

We ran through the entire process again, and things are working now. Thanks for the help!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.