Very slow and abborted Upserts

We read Records from Kafka (~50mio events) and want to upsert them in dgraph.
Our approach is this:
We use the Java Client and send JSON Upserts (up to 500 Customers per JSON):

query {cust0 as var(func: eq(Customer.id, "ABC12")) 
cust1 as var(func: eq(Customer.id, "XYZ12")) 
}
[ {
	"uid": "uid(cust0)",
	"cust.id": "ABC12",
	"cust.name": "Horst",
	"cust.lastname": "Müller",
	"dgraph.type": "Customer",
	"Customer.address": {
		"uid": "uid(cust0)",
		"Address.plz": "98755",
		"Address.ort": "London",
		"Address.strasse": "Main Street 1",
		"dgraph.type": "Address"
	},
	"Customer.contactdata": {
		"uid": "uid(cust0)",
		"Contactdata.text": "123",
		"Contactdata.typ": "CELLPHONE",
		"dgraph.type": "Contactdata"
	}
}, 
{
	"uid": "uid(cust1)",
	"Customer.id": "ABC12",
	"Customer.vorname": "Horst",
	"Customer.nachname": "Müller",
	"dgraph.type": "Customer",
	"Customer.adressen": {
		"uid": "uid(cust1)",
		"Address.plz": "12345",
		"Address.ort": "Paris",
		"Address.strasse": "Rue 1",
		"dgraph.type": "Address"
	},
	"Customer.contactdata": {
		"uid": "uid(cust1)",
		"Contactdata.text": "123",
		"Contactdata.typ": "MOBILE",
		"dgraph.type": "Contactdata"
	}
}]

We send the Request like this (query and mutation from above):

request = Request.newBuilder()
        .setQuery(query)
        .addMutations(newBuilder().setSetJson(copyFromUtf8(mutationJson))
            .build())
        .setCommitNow(true)
        .build();
AsyncTransaction aTxn = asyncClient.newTransaction();
    try {
      aTxn.doRequest(request);
    } finally {
      aTxn.discard();
    }

We have up to 500 Upsert-Records per Tx and we limited our application to sending only 2 parallel Txs to dgraph.

At the start we see avg timings about 20ms per record. But the Txs are getting slower and slower and are finally aborting.

We use the official standalone Helm-Chart (dgraph-single) with no modifications.

Any ideas how we could improve the throughput? We need to insert 50mio Customers (live or bulkloader are no options for us) and we are already facing problems with < 50k Customers.

This is an example of the latency after ~50Txs:

Statistics for 500 records:
parsing_ns: 1464296520
processing_ns: 125619898671
encoding_ns: 38998
assign_timestamp_ns: 62248811
total_ns: 127348347613
avg: 254 ms/req

For comparision: At the beginning the numbers were much better:

Statistics for 500 records:
parsing_ns: 472924261
processing_ns: 18814226338
encoding_ns: 53539
assign_timestamp_ns: 82250874
total_ns: 19375671982
avg: 38 ms/req

Our monitoring tools are showing >450k application-goroutines in dgraph-alpha at peek! Alpha used up to 6,4 GB of Heap.

Hey @Frank, Which version of Dgraph are you using?

v20.11.3

our monitoring tools also show that the “GO managed memory” poolname=Stack, pid=27 increases from 7MB to 1,9GB and then, after the Txs are aborted, falls back to 7MB.

Also: “go runtime system call count: 10k”
“Go to C (cgo) call count: 9k”
Parked Worker threads: 4 (instead of 136 during times with no update Txs)
Global Goroutine run queue size: 9k