It can’t be, otherwise it would happen to me too. However, I am using it locally. Via binaries. I am not using docker to test it. In this case, it’s something related to Docker if so. Your config, that’s why I’ve asked about the port.
In the diagram, blue nodes are existing nodes, and red nodes/edges represent nodes that can be created or edited via an upsert.
My concern is that it appears that the entire root node is getting locked from all of the upsert cases (as per the diagram). This behavior would be equivalent to locking an entire table when only specific rows need to be edited or added, like in the diagram below:
Can someone please confirm if this is not the case? If it’s the case, then it’s a serious architectural/design flaw in Dgraph that needs to be urgently corrected.
Dgraph does not do locking for transactions, instead it executes the mutations and during commit checks if there was a conflict or not. These commit checks involve checking for conflicts. So, we do create conflict keys. If all the nodes you’re operating on are connected to one root node in your dataset, it’s possible that multiple concurrently executing transactions can conflict with each other.
Adding @upsert directive would allow for broader conflict keys, because it is designed to include indices in conflict detection as well.
No, that should not be the case. Can you share your schema?
Initially, we excluded the @upsert directive on all fields in the schema. After we included the @upsert directive on our predicates, however, we actually noticed a slight decrease in the ratio of transactions that were throwing the TxnConflictException (to approximately 60% of all transactions, a noticeable decrease from approximately 80% of all transactions).
Most of the incoming messages for the upsert involve different ProductId values, so it’s very surprising that we would see so many transaction conflicts.
If Dgraph will meet our performance requirements, we are planning on running Dgraph at-scale. Our initial production requirements require us to process approximately 1,700 transactions per second with near-time latency (preferably under 300 ms), and we plan to process significantly more transactions per second (one to two orders of magnitude more) if Dgraph passes our phase 1 tests. So, meeting our scaling requirements will be critical for our use-cases.
You mentioned:
If all the nodes you’re operating on are connected to one root node in your dataset, it’s possible that multiple concurrently executing transactions can conflict with each other.
(where the integer and string values are provided by incoming messages.)
That dropped my transaction error rate down to 20.4% and dropped the average latency from ~300 ms to 41 ms.
I then deleted the Products node from the Dgraph schema.
After I waited a while, I then started my function again, and I noticed that the transaction error rate (java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry) went back up to 57.1%.
So, I’m at a loss for what could be going on here…
java.lang.RuntimeException: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:173) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_212]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_212]
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: The doRequest encountered an execution exception:
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ~[?:1.8.0_212]
… 5 more
Caused by: java.lang.RuntimeException: The doRequest encountered an execution exception:
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:212) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999658649ns
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_212]
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:180) ~[functions.jar:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_212]
… 5 more
Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999658649ns
at io.grpc.Status.asRuntimeException(Status.java:533) ~[functions.jar:?]
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:442) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700) ~[functions.jar:?]
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[functions.jar:?]
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[functions.jar:?]
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:507) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:627) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:515) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:686) ~[functions.jar:?]
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:675) ~[functions.jar:?]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[functions.jar:?]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[functions.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
I restarted the jaeger-collector nodes, and they’re giving me:
{"level":"info","ts":1573584984.721451,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1573584984.7215617,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14269}
{"level":"info","ts":1573584984.721596,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14269,"health-status":"unavailable"}
2019/11/12 18:56:24 gocql: unable to dial control conn 10.1.2.3: dial tcp 10.1.2.3:9042: connect: connection refused
{"level":"fatal","ts":1573584984.7342262,"caller":"collector/main.go:91","msg":"Failed to init storage factory","error":"gocql: unable to create session: control: unable to connect to initial hosts: dial tcp 10.1.2.3:9042: connect: connection refused","stacktrace":"main.main.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/collector/main.go:91\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:762\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/collector/main.go:182\nruntime.main\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/proc.go:200"}```
Also, I restarted the `dgraph-zero` nodes, and there's nothing in their logs about Jaeger at all.
I was able to successfully deploy a new cluster to another environment, so it seems like something is in a bad state. I'm not sure what else to check.
So… It turns out that all we needed to do was stop Dgraph, remove the Dgraph docker containers, and then redeploy them, and Jaeger started working in production.