When writing data, dgraph takes up too much memory

Problem

I use Java to update the Dgraph data. It runs about 1200 mutations per second. After a while, the machine gets stuck and the load average reaches 500.
There are already 800 million points and 1.6 billion edges in Dgraph.
The following is the memory usage

Cluster deployment

HA Cluster setup is in 3 hosts.
The replicas is 3, and each host deploys three Alphas.
Each machine has a 32-core CPU, 196G memory, and 1T SSD.

Dgraph Metadata

dgraph version
Dgraph version   : v20.11.0-rc5
raph codename  : tchalla
Dgraph SHA-256   : 95d845ecec057813d1a3fc94394ba1c18ada80f584120a024c19d0db668ca24e
Commit SHA-1     : b65a8b10c
Commit timestamp : 2020-12-14 19:09:28 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

Hey @zzl221000, Thanks for reporting. Can you please provide us the following to help us better understand what is going on.

  1. The alpha logs
  2. The zero logs
  3. Memory profile of alpha/zero. You can take a profile by curl http://localhost:8080/debug/pprof/heap --output heap.out

The following is the log at that time:

alpha3-leader.log (68.6 KB)
alpha2-leader.log (89.0 KB)
alpha1-leader.log (60.9 KB)
zero-leader.log (18.2 KB)

The faulty node has been restarted and the heap is no longer available

ping @ahsan

Add heap
alpha-max-memory-heap (267.8 KB)
This is the alpha heap that currently uses the most resources.

Temporarily use Docker’s memory limit mechanism to automatically restart alpha to ensure that the host is running.

hey @zzl221000, can you show us your java program? There is a known issue with doing too many txn.mutate calls. If you have the following pattern

txn = newTxn()
for .... {
    ...
    txn.mutate(...)
}
txn.commit

Then the time taken to complete the operation is the order of N^2.

If you can show us your java program, we can verify if you’re seeing the same issue.

hey @ibrahim,here is part of the code I used to write data to Dgraph

return Task.fromCompletionStage("do_write",() -> {
                AsyncTransaction txn = dgraphClient.newTransaction();
                DgraphProto.Mutation.Builder builder = DgraphProto.Mutation.newBuilder().setCommitNow(true);
                if (RLStringUtils.hasText(set)) {
                    builder.setSetNquads(ByteString.copyFromUtf8(set));
                }
                if (RLStringUtils.hasText(del)) {
                    builder.setDelNquads(ByteString.copyFromUtf8(del));
                }
                return txn.mutate(builder.build()).thenApply(response -> {
                    txn.discard();
                    return response.getUidsCount();
                });
            })

I might get it, to increase the application throughput.
I use asynchronous mode to operate the whole process, that should be too many mutation calls. Is there a way to fix it?
@ibrahim