Have you tried reproducing the issue with the latest release?
already the latest release
What is the hardware spec (RAM, OS)?
40 GB of RAM on Debian 9
Steps to reproduce the issue (command/config used to run Dgraph).
I have a 3 node cluster on 3 machines (a zero + an alpha in each node), running with docker-compose,
i try to add some data with GraphQL API, in my case a User model with a friendship relation with itself,
the schema of User model is something like this:
type User {
id: String! @id
username: String @search(by: [hash])
friends: [User]
}
Expected behaviour and actual result.
each addUser query, increases alpha memory usage significantly, so that after less that 1000 queries, alpha container crashes with “out of memory” error.
@mbande , could you provide details on how you are adding the users into the cluster? Is it curl, dgraph client or something else.
Also, as I understand, this is a 3 container setup, with each container having 1 zero and 1 alpha.
Please also share the docker-compose or equivalent you used to bring up the cluster. I am looking to see if we can reproduce this on our end.
it seems that memory consumption, is directly related with input size.
when i add a user with all of his friends (e.g. 500 friends) in one query, memory usage grows endlessly.
but when i add user friends chunk by chunk (e.g. of size 10), memory usage stops at some point.
In that issue, the write-ahead log grows with the mutations. Maybe bigger mutations cause a further impact in the size of the write-ahead log or make it harder to clean it up. I’ll try to reproduce the issue.
Yeah, it appears to be that. I see similar behavior in 20.07.1-rc1 with each mutation having 5000 unique nodes. But in the head of the release/v20.07 branch the memory consumption is more stable and does not seem to increase linearly with the number of mutations. In this branch we are loading the write-ahead log in mmap mode by default.
I checked out v20.07 branch and build it, then binded the generated dgraph binary in docker container.
the new container is much much better in memory usage. thank you all and keep up good work
memory usage history after about 2 hours of continuous mutations:
as you can see, memory leak would prevent alpha to run for a long time and it will crash for OOM. @martinmr is there any flag that can improve this? e.g. --badger.vlog=disk or something like that.
or i should let alpha crash and then restart it.
@mbande It would be great if you could try this in master. We have made a lot of memory optimizations including manual allocations since golang’s memory management is not aggressive enough.
I’ll try to do a longer test on my own. I didn’t see a big memory increase yesterday but I also didn’t run the cluster for a long period of time.
Ran the cluster on master using more mutations and I can reproduce the behavior. It looks like the issue is with zero. Below is the output of the memory profile:
go tool pprof -source_path /home/martinmr/go/src/github.com/dgraph-io -trim_path / http://localhost:6180/debug/pprof/heap <<<
Fetching profile over HTTP from http://localhost:6180/debug/pprof/heap
Saved profile in /home/martinmr/pprof/pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.032.pb.gz
File: dgraph
Build ID: df33deaa2cbf50e81c6b75cafb82e0beea5e4429
Type: inuse_space
Time: Sep 16, 2020 at 1:02pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2014.61MB, 99.54% of 2023.94MB total
Dropped 82 nodes (cum <= 10.12MB)
Showing top 10 nodes out of 18
flat flat% sum% cum cum%
1931.40MB 95.43% 95.43% 1931.40MB 95.43% github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit
83.20MB 4.11% 99.54% 83.20MB 4.11% github.com/dgraph-io/badger/v2/skl.newArena
0 0% 99.54% 83.20MB 4.11% github.com/dgraph-io/badger/v2.Open
0 0% 99.54% 83.20MB 4.11% github.com/dgraph-io/badger/v2/skl.NewSkiplist
0 0% 99.54% 83.20MB 4.11% github.com/dgraph-io/dgraph/dgraph/cmd.Execute
0 0% 99.54% 1935.45MB 95.63% github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Server).CommitOrAbort
0 0% 99.54% 1935.45MB 95.63% github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Server).commit
0 0% 99.54% 83.20MB 4.11% github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
0 0% 99.54% 83.20MB 4.11% github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
0 0% 99.54% 1935.45MB 95.63% github.com/dgraph-io/dgraph/protos/pb._Zero_CommitOrAbort_Handler
(pprof) list commit
Total: 1.98GB
ROUTINE ======================== github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit in dgraph/dgraph/cmd/zero/oracle.go
1.89GB 1.89GB (flat, cum) 95.43% of Total
. . 128: ki, err := strconv.ParseUint(k, 36, 64)
. . 129: if err != nil {
. . 130: glog.Errorf("Got error while parsing conflict key %q: %v\n", k, err)
. . 131: continue
. . 132: }
1.89GB 1.89GB 133: o.keyCommit[ki] = src.CommitTs // CommitTs is handed out before calling this func.
. . 134: }
. . 135: return nil
. . 136:}
. . 137:
. . 138:func (o *Oracle) currentState() *pb.OracleDelta {
the keyCommit map seems to grow without bound. @mrjn I believe you already noticed this issue before. Any ideas on how to deal with this issue?
@mbande Feel free to try to reproduce the issue on your own but I think we have enough info on our side. I believe this is an issue regardless of the size of the transactions but smaller transactions might just mask this issue better.
So I looked into how this map is populated. It’s populated with each transaction and it’s purged when a snapshot happens. Snapshots are controlled via the --snapshot_after flag which takes the number of transactions at which to take a snapshot. When your mutations are very big, this map fills up faster and takes more time to be purged, which is what @mbande observed.
So there two workarounds:
Have smaller mutations. Having smaller mutations is recommended anyways as it reduces the chances of conflicts.
Change the value of --snapshot_after to a smaller value so that the map is cleared more often.
@martinmr as it is documented that the default value for --snapshot_after is 5000 transactions, i do more qraphql mutations with no decrease in memory usage, is this expected?
We could implement an mmap based balanced binary search tree, instead of using Go map. I’m doing something similar for xidmap, implementing a trie in my branch mrjn/trie.
@martinmr: Would be a fun project for you to pick up – brush up on your CS fundamentals.