zzl221000
(JimZhang)
January 15, 2021, 6:36am
1
Problem
I use Java to update the Dgraph data. It runs about 1200 mutations per second. After a while, the machine gets stuck and the load average reaches 500.
There are already 800 million points and 1.6 billion edges in Dgraph.
The following is the memory usage
Cluster deployment
HA Cluster setup is in 3 hosts.
The replicas is 3, and each host deploys three Alphas.
Each machine has a 32-core CPU, 196G memory, and 1T SSD.
Dgraph Metadata
dgraph version
Dgraph version : v20.11.0-rc5
raph codename : tchalla
Dgraph SHA-256 : 95d845ecec057813d1a3fc94394ba1c18ada80f584120a024c19d0db668ca24e
Commit SHA-1 : b65a8b10c
Commit timestamp : 2020-12-14 19:09:28 +0530
Branch : HEAD
Go version : go1.15.5
jemalloc enabled : true
ahsan
(Ahsan Barkati)
January 15, 2021, 11:08am
2
Hey @zzl221000 , Thanks for reporting. Can you please provide us the following to help us better understand what is going on.
The alpha logs
The zero logs
Memory profile of alpha/zero. You can take a profile by curl http://localhost:8080/debug/pprof/heap --output heap.out
zzl221000
(JimZhang)
January 15, 2021, 2:39pm
3
The following is the log at that time:
alpha3-leader.log (68.6 KB)
alpha2-leader.log (89.0 KB)
alpha1-leader.log (60.9 KB)
zero-leader.log (18.2 KB)
The faulty node has been restarted and the heap is no longer available
zzl221000
(JimZhang)
January 19, 2021, 3:34pm
6
Add heap
alpha-max-memory-heap (267.8 KB)
This is the alpha heap that currently uses the most resources.
Temporarily use Docker’s memory limit mechanism to automatically restart alpha to ensure that the host is running.
ibrahim
(Ibrahim Jarif)
January 20, 2021, 10:19am
7
hey @zzl221000 , can you show us your java program? There is a known issue with doing too many txn.mutate calls. If you have the following pattern
txn = newTxn()
for .... {
...
txn.mutate(...)
}
txn.commit
Then the time taken to complete the operation is the order of N^2.
If you can show us your java program, we can verify if you’re seeing the same issue.
zzl221000
(JimZhang)
January 20, 2021, 12:15pm
8
hey @ibrahim ,here is part of the code I used to write data to Dgraph
return Task.fromCompletionStage("do_write",() -> {
AsyncTransaction txn = dgraphClient.newTransaction();
DgraphProto.Mutation.Builder builder = DgraphProto.Mutation.newBuilder().setCommitNow(true);
if (RLStringUtils.hasText(set)) {
builder.setSetNquads(ByteString.copyFromUtf8(set));
}
if (RLStringUtils.hasText(del)) {
builder.setDelNquads(ByteString.copyFromUtf8(del));
}
return txn.mutate(builder.build()).thenApply(response -> {
txn.discard();
return response.getUidsCount();
});
})
zzl221000
(JimZhang)
January 20, 2021, 12:26pm
9
I might get it, to increase the application throughput.
I use asynchronous mode to operate the whole process, that should be too many mutation calls. Is there a way to fix it?
@ibrahim