Concurrent upserts creating aborted transactions

devinbost · October 24, 2019, 4:00pm

We added parallelism to the transaction discussed here: Upsert with multiple UIDs - #43 by devinbost

Out of 1248 transactions, 1001 of them threw this exception:

Showing stack trace: java.lang.RuntimeException: java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:173)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:208)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
… 5 more
Caused by: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
… 7 more

To quote the link above, this is the upsert:

upsert {
  query {
  getVals(func: has(products)) {
    productsUid as uid
  	products @filter(eq(productId, 19610626)) {
      productUid as uid
      options @filter(eq(optionId, 32661491)) {
        optionUid as uid
      }
    }
  }
}

  mutation {
    set {
      uid(productsUid) <products> uid(productUid) .
      uid(productsUid) <dgraph.type> "Products" .
      uid(productUid) <productId> "19610626" .
      uid(productUid) <options> uid(optionUid) .
      uid(productUid) <dgraph.type> "Product" .
      uid(optionUid) <color> "blue" .
      uid(optionUid) <dgraph.type> "Option" .
      uid(optionUid) <optionId> "32661491" .
    }
  }
}

Is the entire Products collection getting locked and preventing concurrent writes to its children?

devinbost · October 24, 2019, 7:20pm

What could be causing this issue?

MichelDiz · October 25, 2019, 3:43pm

Hard to tell, you should troubleshoot it with Jaeger https://docs.dgraph.io/deploy/#examining-traces-with-jaeger

devinbost · October 30, 2019, 9:32pm

What are the steps for getting the traces to show up in Jaeger? We spun up Jaeger in our cluster and started up Dgraph Zero and Alpha to use Jaeger’s collector endpoint. However, we’re not seeing any additional services show up in the Jaeger UI. Are there additional steps we need to take to see the Dgraph data in Jaeger?

MichelDiz · October 30, 2019, 10:01pm

Try this:

./jaeger-all-in-one

./dgraph zero --jaeger.collector=http://localhost:14268 --expose_trace
./dgraph alpha --lru_mb 8000 --jaeger.collector=http://localhost:14268 --expose_trace

devinbost · October 30, 2019, 10:16pm

We still see only the jaeger-query service listed:

Is there a way to troubleshoot the connection? Jaeger is running on a different node in the cluster.

devinbost · October 30, 2019, 10:20pm

I ran curl http://localhost:14269/metrics | grep jaeger_collector_spans (as per the Jaeger troubleshooting doc), and it appears that Jaeger isn’t receiving any information from Dgraph.

devinbost · October 30, 2019, 11:29pm

I just tried moving Jaeger to one of the nodes that’s running Dgraph Zero and Dgraph Alpha, but I’m still not seeing any data reach Jaeger.

MichelDiz · October 31, 2019, 12:20am

if you run ./jaeger-all-in-one you gonna set logs and in those, you can check the collector port needed to use in Dgraph. In general, is 14268 or what it prints in "http-port".

devinbost · October 31, 2019, 1:56am

It looks like we’re using the correct port (14268).

Here are my logs:

2019/10/30 23:09:52 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined
{“level”:“info”,“ts”:1572476992.1371222,“caller”:“flags/service.go:115”,“msg”:“Mounting metrics handler on admin server”,“route”:“/metrics”}
{“level”:“info”,“ts”:1572476992.1405323,“caller”:“flags/admin.go:108”,“msg”:“Mounting health check on admin server”,“route”:“/”}
{“level”:“info”,“ts”:1572476992.1406393,“caller”:“flags/admin.go:114”,“msg”:“Starting admin HTTP server”,“http-port”:14269}
{“level”:“info”,“ts”:1572476992.140683,“caller”:“flags/admin.go:100”,“msg”:“Admin server started”,“http-port”:14269,“health-status”:“unavailable”}
{“level”:“info”,“ts”:1572476992.149204,“caller”:“memory/factory.go:56”,“msg”:“Memory storage initialized”,“configuration”:{“MaxTraces”:0}}
{“level”:“info”,“ts”:1572476992.1587417,“caller”:“grpc/builder.go:108”,“msg”:“Agent requested insecure grpc connection to collector(s)”}
{“level”:“info”,“ts”:1572476992.1590357,“caller”:“grpc/clientconn.go:242”,“msg”:“parsed scheme: ""”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.159284,“caller”:“grpc/clientconn.go:248”,“msg”:“scheme "" not registered, fallback to default scheme”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1593347,“caller”:“grpc/resolver_conn_wrapper.go:113”,“msg”:“ccResolverWrapper: sending update to cc: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1599479,“caller”:“base/balancer.go:76”,“msg”:“base.baseBalancer: got new resolver state: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1602342,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc000424040, CONNECTING”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1604977,“caller”:“all-in-one/main.go:199”,“msg”:“Starting agent”}
{“level”:“info”,“ts”:1572476992.160641,“caller”:“app/agent.go:69”,“msg”:“Starting jaeger-agent HTTP server”,“http-port”:5778}
{“level”:“info”,“ts”:1572476992.1610487,“caller”:“grpc/clientconn.go:1139”,“msg”:“grpc: addrConn.createTransport failed to connect to {127.0.0.1:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:14250: connect: connection refused". Reconnecting…”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1611376,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc000424040, TRANSIENT_FAILURE”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1676862,“caller”:“all-in-one/main.go:242”,“msg”:“Starting jaeger-collector TChannel server”,“port”:14267}
{“level”:“info”,“ts”:1572476992.1677802,“caller”:“grpcserver/grpc_server.go:102”,“msg”:“Starting jaeger-collector gRPC server”,“grpc-port”:“14250”}
{“level”:“info”,“ts”:1572476992.1678443,“caller”:“all-in-one/main.go:260”,“msg”:“Starting jaeger-collector HTTP server”,“http-port”:14268}
{“level”:“info”,“ts”:1572476992.1678705,“caller”:“querysvc/query_service.go:130”,“msg”:“Archive storage not created”,“reason”:“archive storage not supported”}
{“level”:“info”,“ts”:1572476992.1679666,“caller”:“all-in-one/main.go:342”,“msg”:“Archive storage not initialized”}
{“level”:“info”,“ts”:1572476992.1687665,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“ready”}
{“level”:“info”,“ts”:1572476992.168798,“caller”:“app/server.go:135”,“msg”:“Starting CMUX server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.1687727,“caller”:“app/server.go:112”,“msg”:“Starting HTTP server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.1688309,“caller”:“app/server.go:125”,“msg”:“Starting GRPC server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.203854,“caller”:“all-in-one/main.go:298”,“msg”:“Listening for Zipkin HTTP traffic”,“zipkin.http-port”:9411}
{“level”:“info”,“ts”:1572478360.8977518,“caller”:“flags/service.go:145”,“msg”:“Shutting down”}
{“level”:“info”,“ts”:1572478360.8978343,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“unavailable”}
{“level”:“info”,“ts”:1572478360.8982806,“caller”:“flags/service.go:153”,“msg”:“Shutdown complete”}
2019/10/30 23:32:59 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined
{“level”:“info”,“ts”:1572478379.9698474,“caller”:“flags/service.go:115”,“msg”:“Mounting metrics handler on admin server”,“route”:“/metrics”}
{“level”:“info”,“ts”:1572478379.973902,“caller”:“flags/admin.go:108”,“msg”:“Mounting health check on admin server”,“route”:“/”}
{“level”:“info”,“ts”:1572478379.9739938,“caller”:“flags/admin.go:114”,“msg”:“Starting admin HTTP server”,“http-port”:14269}
{“level”:“info”,“ts”:1572478379.974023,“caller”:“flags/admin.go:100”,“msg”:“Admin server started”,“http-port”:14269,“health-status”:“unavailable”}
{“level”:“info”,“ts”:1572478379.9798672,“caller”:“memory/factory.go:56”,“msg”:“Memory storage initialized”,“configuration”:{“MaxTraces”:0}}
{“level”:“info”,“ts”:1572478379.9880934,“caller”:“grpc/builder.go:108”,“msg”:“Agent requested insecure grpc connection to collector(s)”}
{“level”:“info”,“ts”:1572478379.988265,“caller”:“grpc/clientconn.go:242”,“msg”:“parsed scheme: ""”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.988538,“caller”:“grpc/clientconn.go:248”,“msg”:“scheme "" not registered, fallback to default scheme”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9885778,“caller”:“grpc/resolver_conn_wrapper.go:113”,“msg”:“ccResolverWrapper: sending update to cc: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.988817,“caller”:“base/balancer.go:76”,“msg”:“base.baseBalancer: got new resolver state: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9888725,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc0003c40b0, CONNECTING”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9895947,“caller”:“all-in-one/main.go:199”,“msg”:“Starting agent”}
{“level”:“info”,“ts”:1572478379.9895508,“caller”:“grpc/clientconn.go:1139”,“msg”:“grpc: addrConn.createTransport failed to connect to {127.0.0.1:14250 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:14250: connect: connection refused". Reconnecting…”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9897275,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc0003c40b0, TRANSIENT_FAILURE”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9897609,“caller”:“app/agent.go:69”,“msg”:“Starting jaeger-agent HTTP server”,“http-port”:5778}
{“level”:“info”,“ts”:1572478379.993895,“caller”:“all-in-one/main.go:242”,“msg”:“Starting jaeger-collector TChannel server”,“port”:14267}
{“level”:“info”,“ts”:1572478379.9939818,“caller”:“grpcserver/grpc_server.go:102”,“msg”:“Starting jaeger-collector gRPC server”,“grpc-port”:“14250”}
{“level”:“info”,“ts”:1572478379.9940424,“caller”:“all-in-one/main.go:260”,“msg”:“Starting jaeger-collector HTTP server”,“http-port”:14268}
{“level”:“info”,“ts”:1572478379.9940631,“caller”:“querysvc/query_service.go:130”,“msg”:“Archive storage not created”,“reason”:“archive storage not supported”}
{“level”:“info”,“ts”:1572478379.9941854,“caller”:“all-in-one/main.go:342”,“msg”:“Archive storage not initialized”}
{“level”:“info”,“ts”:1572478379.9947898,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“ready”}
{“level”:“info”,“ts”:1572478379.9948037,“caller”:“app/server.go:112”,“msg”:“Starting HTTP server”,“port”:16686}
{“level”:“info”,“ts”:1572478379.9948165,“caller”:“app/server.go:135”,“msg”:“Starting CMUX server”,“port”:16686}
{“level”:“info”,“ts”:1572478379.9948158,“caller”:“app/server.go:125”,“msg”:“Starting GRPC server”,“port”:16686}
{“level”:“info”,“ts”:1572478380.0205536,“caller”:“all-in-one/main.go:298”,“msg”:“Listening for Zipkin HTTP traffic”,“zipkin.http-port”:9411}

It’s not affected by port offset, is it?

MichelDiz · October 31, 2019, 2:05am

I just started the cluster (Zero and Alpha) with the flags and the right port. And just works. Did nothing else.

devinbost · October 31, 2019, 4:33am

In a cluster with multiple Zero and Alpha nodes, where should Jaeger be run?

MichelDiz · October 31, 2019, 3:57pm

It depends on what you want, all or only one.

Instances simply send data to Jaeger. There are no specific rules for clusters or groups. You will see for each instance the specific activities of those instances and not groups or something. Then I would recommend you put all of them.

devinbost · October 31, 2019, 5:35pm

Meanwhile, I added the @upsert directive to all of the predicates in my schema (except the ones for [uid] since I don’t know if those can have the directive), but I’m still getting a very large quantity of these messages:
java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry

devinbost · October 31, 2019, 5:35pm

If we can’t establish parallelism, then my company will not be able to use Dgraph going forward.

devinbost · October 31, 2019, 5:53pm

I ran tcpdump on port 14268, and there is no traffic, so it appears that Dgraph is not sending the data to that port.

MichelDiz · October 31, 2019, 6:29pm

Please, share the whole cluster configs and stats (machine) to understand your context.

Dunno what it can be, I can run it just fine. Must be something very specific in your context.

Have you started the cluster with the flag --jaeger.collector? have you inserted the right port in the first time?

devinbost · October 31, 2019, 8:43pm

We’re running Dgraph Zero and Alpha on a 5 node VM cluster.
Each node is running an instance of Dgraph Zero and an instance of Dgraph Alpha. We’re running the production Dgraph containers.

Here is the script we run to start Dgraph Zero as a docker container on the host named app12. (app08 is our leader):

/usr/local/bin/dgraph-zero :

#!/bin/bash

. /usr/local/bin/docker-functions

DATA=/opt/data/dgraph-zero
IMAGE=docker.*obfuscated*.com/dgraph/dgraph:v1.1.0
OFFSET=2
REPLICAS=3
ZERO_HOST=app08.*obfuscated*.com
ZERO_PORT=5080
ZERO_OFFSET=`expr ${ZERO_PORT} + ${OFFSET}`
IDX=5
JAEGER_COLLECTOR=http://app08.*obfuscated*.com:14268

export DOCKER_RUN=" -d \
--name dgraph-zero \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v ${DATA}:/dgraph \
${IMAGE} dgraph zero \
--replicas ${REPLICAS} \
--port_offset ${OFFSET} \
--my $(hostname -f):${ZERO_OFFSET} \
--jaeger.collector=${JAEGER_COLLECTOR} \
--expose_trace \
--idx ${IDX}"

if [ ${IDX} -gt 1 ]; then
    DOCKER_RUN+=" --peer ${ZERO_HOST}:${ZERO_OFFSET}"
fi

run "$@"

Here is the script we run to start Dgraph Alpha as a docker container:

/usr/local/bin/dgraph-alpha :

#!/bin/bash

. /usr/local/bin/docker-functions

DATA=/opt/data/dgraph-alpha
IMAGE=docker.*obfuscated*.com/dgraph/dgraph:v1.1.0
OFFSET=2
ZERO_HOST=app08.*obfuscated*.com
ZERO_PORT=5080
ALPHA_PORT=7080
LRU_MB=20000
ALPHA_OFFSET=`expr ${ALPHA_PORT} + ${OFFSET}`
ZERO_OFFSET=`expr ${ZERO_PORT} + ${OFFSET}`
JAEGER_COLLECTOR=http://app08.*obfuscated*.com:14268

export DOCKER_RUN=" -d \
--name dgraph-alpha \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v ${DATA}:/dgraph \
${IMAGE} dgraph alpha \
--port_offset ${OFFSET} \
--lru_mb ${LRU_MB} \
--zero ${ZERO_HOST}:${ZERO_OFFSET} \
--jaeger.collector=${JAEGER_COLLECTOR} \
--expose_trace \
--my $(hostname -f):${ALPHA_OFFSET}"

run "$@"

We’re running Jaeger jaegertracing/all-in-one:1.14.0 on the app08 host.
The script we’re running to start Jaeger is here:

/usr/local/bin/jaeger :

#!/bin/bash
. /usr/local/bin/docker-functions

export DOCKER_RUN=" -d \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v /etc/jaeger:/opt/jaeger/config \
-e "COLLECTOR_ZIPKIN_HTTP_PORT=9411" \
docker.*obfuscated*.com/jaegertracing/all-in-one:1.14.0"

run "$@"

That Jaeger instance is only running on app08. We have Dgraph Zero and Alpha running on all instances in the cluster (app08 - app12).

The hosts are running CentOS 7 (Core).
The Linux Kernel is: Linux 3.10.0-957.10.1.el7.x86_64
The architecture is: x86-64

Is there any additional information that will be helpful?

MichelDiz · October 31, 2019, 9:14pm

Are you exposing the port 14268 in Docker?

This bellow is a comment I made in another issue - Try to do it for you case.

if* you have a lot of RAM available, try to run your cluster with these flags (Just for this situation):

--badger.tables=ram
--badger.vlog=mmap

dgraph alpha -h

 --badger.tables string  [ram, mmap, disk] 
Specifies how Badger LSM tree is stored. Option sequence consume most to least RAM while providing best to worst read performance respectively. (default "mmap")
 --badger.vlog string  [mmap, disk] 
Specifies how Badger Value log is stored. mmap consumes more RAM, but provides better performance. (default "mmap")

What I have indicated is a way of assessing whether it could be bottleneck by I/O.

I would then recommend that you have an actual cluster, where each group has its own resources. A single instance with slow HDD tends to have poor performance. But if you have multiple instances with own resources, even with slow HDDs and well-defined groups, there is no performance loss.

Also, Dgraph/BadgerDB works better with SSD/NVMe.

In general Docker can have a bottleneck by I/O.

devinbost · November 5, 2019, 11:55pm

Yes, we’re exposing port 14268 in Docker.
I actually just configured and deployed an entire Jaeger cluster with a Jaeger Agent on each Dgraph host, several Jaeger Collector nodes, and a Jaeger Query node with the Cassandra backend in an attempt to resolve any possible I/O issues.
However, we are still not seeing any Dgraph data appear in the Jaeger cluster.

I think there is a bug in Dgraph.

Topic		Replies	Views
Transaction has been aborted Dgraph Clients untagged , dgraph4j	9	512	July 11, 2020
About transaction abort error while persist edges Users	9	2036	March 13, 2018
Upsert Failed for unique data Dgraph	12	542	November 3, 2020
Transactions with single mutations failing Dgraph	3	634	July 30, 2020
Production Upsert - UNKNOWN: Uid: [834751] cannot be greater than lease: [0] Users mutation , java-client	19	2023	November 22, 2019

Concurrent upserts creating aborted transactions

Related Topics