Concurrent upserts creating aborted transactions

java-client
mutation

(Devin Bost) #1

We added parallelism to the transaction discussed here: Upsert with multiple UIDs

Out of 1248 transactions, 1001 of them threw this exception:

Showing stack trace: java.lang.RuntimeException: java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
at io.dgraph.AsyncTransaction.lambda$doRequest$2(AsyncTransaction.java:173)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
at io.dgraph.DgraphAsyncClient.lambda$runWithRetries$2(DgraphAsyncClient.java:208)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
… 5 more
Caused by: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry
… 7 more

To quote the link above, this is the upsert:

upsert {
  query {
  getVals(func: has(products)) {
    productsUid as uid
  	products @filter(eq(productId, 19610626)) {
      productUid as uid
      options @filter(eq(optionId, 32661491)) {
        optionUid as uid
      }
    }
  }
}

  mutation {
    set {
      uid(productsUid) <products> uid(productUid) .
      uid(productsUid) <dgraph.type> "Products" .
      uid(productUid) <productId> "19610626" .
      uid(productUid) <options> uid(optionUid) .
      uid(productUid) <dgraph.type> "Product" .
      uid(optionUid) <color> "blue" .
      uid(optionUid) <dgraph.type> "Option" .
      uid(optionUid) <optionId> "32661491" .
    }
  }
}

Is the entire Products collection getting locked and preventing concurrent writes to its children?


(Devin Bost) #2

What could be causing this issue?


(Michel Conrado) #3

Hard to tell, you should troubleshoot it with Jaeger https://docs.dgraph.io/deploy/#examining-traces-with-jaeger


(Devin Bost) #4

What are the steps for getting the traces to show up in Jaeger? We spun up Jaeger in our cluster and started up Dgraph Zero and Alpha to use Jaeger’s collector endpoint. However, we’re not seeing any additional services show up in the Jaeger UI. Are there additional steps we need to take to see the Dgraph data in Jaeger?


(Michel Conrado) #5

Try this:

./jaeger-all-in-one

./dgraph zero --jaeger.collector=http://localhost:14268 --expose_trace
./dgraph alpha --lru_mb 8000 --jaeger.collector=http://localhost:14268 --expose_trace


(Devin Bost) #6

We still see only the jaeger-query service listed:

Is there a way to troubleshoot the connection? Jaeger is running on a different node in the cluster.


(Devin Bost) #7

I ran curl http://localhost:14269/metrics | grep jaeger_collector_spans (as per the Jaeger troubleshooting doc), and it appears that Jaeger isn’t receiving any information from Dgraph.


(Devin Bost) #8

I just tried moving Jaeger to one of the nodes that’s running Dgraph Zero and Dgraph Alpha, but I’m still not seeing any data reach Jaeger.


(Michel Conrado) #9

if you run ./jaeger-all-in-one you gonna set logs and in those, you can check the collector port needed to use in Dgraph. In general, is 14268 or what it prints in "http-port".


(Devin Bost) #10

It looks like we’re using the correct port (14268).

Here are my logs:

2019/10/30 23:09:52 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined
{“level”:“info”,“ts”:1572476992.1371222,“caller”:“flags/service.go:115”,“msg”:“Mounting metrics handler on admin server”,“route”:"/metrics"}
{“level”:“info”,“ts”:1572476992.1405323,“caller”:“flags/admin.go:108”,“msg”:“Mounting health check on admin server”,“route”:"/"}
{“level”:“info”,“ts”:1572476992.1406393,“caller”:“flags/admin.go:114”,“msg”:“Starting admin HTTP server”,“http-port”:14269}
{“level”:“info”,“ts”:1572476992.140683,“caller”:“flags/admin.go:100”,“msg”:“Admin server started”,“http-port”:14269,“health-status”:“unavailable”}
{“level”:“info”,“ts”:1572476992.149204,“caller”:“memory/factory.go:56”,“msg”:“Memory storage initialized”,“configuration”:{“MaxTraces”:0}}
{“level”:“info”,“ts”:1572476992.1587417,“caller”:“grpc/builder.go:108”,“msg”:“Agent requested insecure grpc connection to collector(s)”}
{“level”:“info”,“ts”:1572476992.1590357,“caller”:“grpc/clientconn.go:242”,“msg”:“parsed scheme: “””,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.159284,“caller”:“grpc/clientconn.go:248”,“msg”:“scheme “” not registered, fallback to default scheme”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1593347,“caller”:“grpc/resolver_conn_wrapper.go:113”,“msg”:“ccResolverWrapper: sending update to cc: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1599479,“caller”:“base/balancer.go:76”,“msg”:“base.baseBalancer: got new resolver state: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1602342,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc000424040, CONNECTING”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1604977,“caller”:“all-in-one/main.go:199”,“msg”:“Starting agent”}
{“level”:“info”,“ts”:1572476992.160641,“caller”:“app/agent.go:69”,“msg”:“Starting jaeger-agent HTTP server”,“http-port”:5778}
{“level”:“info”,“ts”:1572476992.1610487,“caller”:“grpc/clientconn.go:1139”,“msg”:“grpc: addrConn.createTransport failed to connect to {127.0.0.1:14250 0 }. Err :connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:14250: connect: connection refused”. Reconnecting…”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1611376,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc000424040, TRANSIENT_FAILURE”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572476992.1676862,“caller”:“all-in-one/main.go:242”,“msg”:“Starting jaeger-collector TChannel server”,“port”:14267}
{“level”:“info”,“ts”:1572476992.1677802,“caller”:“grpcserver/grpc_server.go:102”,“msg”:“Starting jaeger-collector gRPC server”,“grpc-port”:“14250”}
{“level”:“info”,“ts”:1572476992.1678443,“caller”:“all-in-one/main.go:260”,“msg”:“Starting jaeger-collector HTTP server”,“http-port”:14268}
{“level”:“info”,“ts”:1572476992.1678705,“caller”:“querysvc/query_service.go:130”,“msg”:“Archive storage not created”,“reason”:“archive storage not supported”}
{“level”:“info”,“ts”:1572476992.1679666,“caller”:“all-in-one/main.go:342”,“msg”:“Archive storage not initialized”}
{“level”:“info”,“ts”:1572476992.1687665,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“ready”}
{“level”:“info”,“ts”:1572476992.168798,“caller”:“app/server.go:135”,“msg”:“Starting CMUX server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.1687727,“caller”:“app/server.go:112”,“msg”:“Starting HTTP server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.1688309,“caller”:“app/server.go:125”,“msg”:“Starting GRPC server”,“port”:16686}
{“level”:“info”,“ts”:1572476992.203854,“caller”:“all-in-one/main.go:298”,“msg”:“Listening for Zipkin HTTP traffic”,“zipkin.http-port”:9411}
{“level”:“info”,“ts”:1572478360.8977518,“caller”:“flags/service.go:145”,“msg”:“Shutting down”}
{“level”:“info”,“ts”:1572478360.8978343,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“unavailable”}
{“level”:“info”,“ts”:1572478360.8982806,“caller”:“flags/service.go:153”,“msg”:“Shutdown complete”}
2019/10/30 23:32:59 maxprocs: Leaving GOMAXPROCS=32: CPU quota undefined
{“level”:“info”,“ts”:1572478379.9698474,“caller”:“flags/service.go:115”,“msg”:“Mounting metrics handler on admin server”,“route”:"/metrics"}
{“level”:“info”,“ts”:1572478379.973902,“caller”:“flags/admin.go:108”,“msg”:“Mounting health check on admin server”,“route”:"/"}
{“level”:“info”,“ts”:1572478379.9739938,“caller”:“flags/admin.go:114”,“msg”:“Starting admin HTTP server”,“http-port”:14269}
{“level”:“info”,“ts”:1572478379.974023,“caller”:“flags/admin.go:100”,“msg”:“Admin server started”,“http-port”:14269,“health-status”:“unavailable”}
{“level”:“info”,“ts”:1572478379.9798672,“caller”:“memory/factory.go:56”,“msg”:“Memory storage initialized”,“configuration”:{“MaxTraces”:0}}
{“level”:“info”,“ts”:1572478379.9880934,“caller”:“grpc/builder.go:108”,“msg”:“Agent requested insecure grpc connection to collector(s)”}
{“level”:“info”,“ts”:1572478379.988265,“caller”:“grpc/clientconn.go:242”,“msg”:“parsed scheme: “””,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.988538,“caller”:“grpc/clientconn.go:248”,“msg”:“scheme “” not registered, fallback to default scheme”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9885778,“caller”:“grpc/resolver_conn_wrapper.go:113”,“msg”:“ccResolverWrapper: sending update to cc: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.988817,“caller”:“base/balancer.go:76”,“msg”:“base.baseBalancer: got new resolver state: {[{127.0.0.1:14250 0 }] }”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9888725,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc0003c40b0, CONNECTING”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9895947,“caller”:“all-in-one/main.go:199”,“msg”:“Starting agent”}
{“level”:“info”,“ts”:1572478379.9895508,“caller”:“grpc/clientconn.go:1139”,“msg”:“grpc: addrConn.createTransport failed to connect to {127.0.0.1:14250 0 }. Err :connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:14250: connect: connection refused”. Reconnecting…”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9897275,“caller”:“base/balancer.go:130”,“msg”:“base.baseBalancer: handle SubConn state change: 0xc0003c40b0, TRANSIENT_FAILURE”,“system”:“grpc”,“grpc_log”:true}
{“level”:“info”,“ts”:1572478379.9897609,“caller”:“app/agent.go:69”,“msg”:“Starting jaeger-agent HTTP server”,“http-port”:5778}
{“level”:“info”,“ts”:1572478379.993895,“caller”:“all-in-one/main.go:242”,“msg”:“Starting jaeger-collector TChannel server”,“port”:14267}
{“level”:“info”,“ts”:1572478379.9939818,“caller”:“grpcserver/grpc_server.go:102”,“msg”:“Starting jaeger-collector gRPC server”,“grpc-port”:“14250”}
{“level”:“info”,“ts”:1572478379.9940424,“caller”:“all-in-one/main.go:260”,“msg”:“Starting jaeger-collector HTTP server”,“http-port”:14268}
{“level”:“info”,“ts”:1572478379.9940631,“caller”:“querysvc/query_service.go:130”,“msg”:“Archive storage not created”,“reason”:“archive storage not supported”}
{“level”:“info”,“ts”:1572478379.9941854,“caller”:“all-in-one/main.go:342”,“msg”:“Archive storage not initialized”}
{“level”:“info”,“ts”:1572478379.9947898,“caller”:“healthcheck/handler.go:130”,“msg”:“Health Check state change”,“status”:“ready”}
{“level”:“info”,“ts”:1572478379.9948037,“caller”:“app/server.go:112”,“msg”:“Starting HTTP server”,“port”:16686}
{“level”:“info”,“ts”:1572478379.9948165,“caller”:“app/server.go:135”,“msg”:“Starting CMUX server”,“port”:16686}
{“level”:“info”,“ts”:1572478379.9948158,“caller”:“app/server.go:125”,“msg”:“Starting GRPC server”,“port”:16686}
{“level”:“info”,“ts”:1572478380.0205536,“caller”:“all-in-one/main.go:298”,“msg”:“Listening for Zipkin HTTP traffic”,“zipkin.http-port”:9411}

It’s not affected by port offset, is it?


(Michel Conrado) #11

I just started the cluster (Zero and Alpha) with the flags and the right port. And just works. Did nothing else.


(Devin Bost) #12

In a cluster with multiple Zero and Alpha nodes, where should Jaeger be run?


(Michel Conrado) #13

It depends on what you want, all or only one.

Instances simply send data to Jaeger. There are no specific rules for clusters or groups. You will see for each instance the specific activities of those instances and not groups or something. Then I would recommend you put all of them.


(Devin Bost) #14

Meanwhile, I added the @upsert directive to all of the predicates in my schema (except the ones for [uid] since I don’t know if those can have the directive), but I’m still getting a very large quantity of these messages:
java.util.concurrent.CompletionException: io.dgraph.TxnConflictException: Transaction has been aborted. Please retry


(Devin Bost) #15

If we can’t establish parallelism, then my company will not be able to use Dgraph going forward.


(Devin Bost) #16

I ran tcpdump on port 14268, and there is no traffic, so it appears that Dgraph is not sending the data to that port.


(Michel Conrado) #17

Please, share the whole cluster configs and stats (machine) to understand your context.

Dunno what it can be, I can run it just fine. Must be something very specific in your context.

Have you started the cluster with the flag --jaeger.collector? have you inserted the right port in the first time?


(Devin Bost) #18

We’re running Dgraph Zero and Alpha on a 5 node VM cluster.
Each node is running an instance of Dgraph Zero and an instance of Dgraph Alpha. We’re running the production Dgraph containers.

Here is the script we run to start Dgraph Zero as a docker container on the host named app12. (app08 is our leader):

/usr/local/bin/dgraph-zero :

#!/bin/bash

. /usr/local/bin/docker-functions

DATA=/opt/data/dgraph-zero
IMAGE=docker.*obfuscated*.com/dgraph/dgraph:v1.1.0
OFFSET=2
REPLICAS=3
ZERO_HOST=app08.*obfuscated*.com
ZERO_PORT=5080
ZERO_OFFSET=`expr ${ZERO_PORT} + ${OFFSET}`
IDX=5
JAEGER_COLLECTOR=http://app08.*obfuscated*.com:14268

export DOCKER_RUN=" -d \
--name dgraph-zero \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v ${DATA}:/dgraph \
${IMAGE} dgraph zero \
--replicas ${REPLICAS} \
--port_offset ${OFFSET} \
--my $(hostname -f):${ZERO_OFFSET} \
--jaeger.collector=${JAEGER_COLLECTOR} \
--expose_trace \
--idx ${IDX}"

if [ ${IDX} -gt 1 ]; then
    DOCKER_RUN+=" --peer ${ZERO_HOST}:${ZERO_OFFSET}"
fi

run "$@"

Here is the script we run to start Dgraph Alpha as a docker container:

/usr/local/bin/dgraph-alpha :

#!/bin/bash

. /usr/local/bin/docker-functions

DATA=/opt/data/dgraph-alpha
IMAGE=docker.*obfuscated*.com/dgraph/dgraph:v1.1.0
OFFSET=2
ZERO_HOST=app08.*obfuscated*.com
ZERO_PORT=5080
ALPHA_PORT=7080
LRU_MB=20000
ALPHA_OFFSET=`expr ${ALPHA_PORT} + ${OFFSET}`
ZERO_OFFSET=`expr ${ZERO_PORT} + ${OFFSET}`
JAEGER_COLLECTOR=http://app08.*obfuscated*.com:14268

export DOCKER_RUN=" -d \
--name dgraph-alpha \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v ${DATA}:/dgraph \
${IMAGE} dgraph alpha \
--port_offset ${OFFSET} \
--lru_mb ${LRU_MB} \
--zero ${ZERO_HOST}:${ZERO_OFFSET} \
--jaeger.collector=${JAEGER_COLLECTOR} \
--expose_trace \
--my $(hostname -f):${ALPHA_OFFSET}"

run "$@"

We’re running Jaeger jaegertracing/all-in-one:1.14.0 on the app08 host.
The script we’re running to start Jaeger is here:

/usr/local/bin/jaeger :

#!/bin/bash
. /usr/local/bin/docker-functions

export DOCKER_RUN=" -d \
--memory-swappiness=0 \
--stop-timeout=300 \
--restart=unless-stopped \
--net=host \
-v /etc/jaeger:/opt/jaeger/config \
-e "COLLECTOR_ZIPKIN_HTTP_PORT=9411" \
docker.*obfuscated*.com/jaegertracing/all-in-one:1.14.0"

run "$@" 

That Jaeger instance is only running on app08. We have Dgraph Zero and Alpha running on all instances in the cluster (app08 - app12).

The hosts are running CentOS 7 (Core).
The Linux Kernel is: Linux 3.10.0-957.10.1.el7.x86_64
The architecture is: x86-64

Is there any additional information that will be helpful?


(Michel Conrado) #19

Are you exposing the port 14268 in Docker?

This bellow is a comment I made in another issue - Try to do it for you case.

if* you have a lot of RAM available, try to run your cluster with these flags (Just for this situation):

--badger.tables=ram
--badger.vlog=mmap

dgraph alpha -h

 --badger.tables string  [ram, mmap, disk] 
Specifies how Badger LSM tree is stored. Option sequence consume most to least RAM while providing best to worst read performance respectively. (default "mmap")
 --badger.vlog string  [mmap, disk] 
Specifies how Badger Value log is stored. mmap consumes more RAM, but provides better performance. (default "mmap")

What I have indicated is a way of assessing whether it could be bottleneck by I/O.

I would then recommend that you have an actual cluster, where each group has its own resources. A single instance with slow HDD tends to have poor performance. But if you have multiple instances with own resources, even with slow HDDs and well-defined groups, there is no performance loss.

Also, Dgraph/BadgerDB works better with SSD/NVMe.

In general Docker can have a bottleneck by I/O.


(Devin Bost) #20

Yes, we’re exposing port 14268 in Docker.
I actually just configured and deployed an entire Jaeger cluster with a Jaeger Agent on each Dgraph host, several Jaeger Collector nodes, and a Jaeger Query node with the Cassandra backend in an attempt to resolve any possible I/O issues.
However, we are still not seeing any Dgraph data appear in the Jaeger cluster.

I think there is a bug in Dgraph.