Seems to be a memory leak

mbande · September 14, 2020, 5:50pm

Report a Dgraph Bug

What version of Dgraph are you using?

20.07.1-rc1

Have you tried reproducing the issue with the latest release?

already the latest release

What is the hardware spec (RAM, OS)?

40 GB of RAM on Debian 9

Steps to reproduce the issue (command/config used to run Dgraph).

I have a 3 node cluster on 3 machines (a zero + an alpha in each node), running with docker-compose,
i try to add some data with GraphQL API, in my case a User model with a friendship relation with itself,
the schema of User model is something like this:

type User {
    id: String! @id
    username: String @search(by: [hash])
    friends: [User]
}

Expected behaviour and actual result.

each addUser query, increases alpha memory usage significantly, so that after less that 1000 queries, alpha container crashes with “out of memory” error.

Paras · September 14, 2020, 6:10pm

@mbande , could you provide details on how you are adding the users into the cluster? Is it curl, dgraph client or something else.

Also, as I understand, this is a 3 container setup, with each container having 1 zero and 1 alpha.
Please also share the docker-compose or equivalent you used to bring up the cluster. I am looking to see if we can reproduce this on our end.

mbande · September 14, 2020, 6:15pm

@Paras I’m using python requests like this:

def run_graphql(query, variables=None):
    request_body = dict({'query': query})
    if variables:
        request_body['variables'] = variables
    result = requests.post(
        url=dgraph_alpha_url,
        headers={'Content-type': 'application/json'},
        json=request_body
    )
    result.raise_for_status()
    return result.json()

and this is docker-compose.yml:

version: "3.2"
services:
  zero:
    image: dgraph/dgraph:v20.07.1-rc1
    volumes:
      - /data/dgraph/v20.07:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    command: dgraph zero --my=dgraph-2:5080 --replicas 1 --peer dgraph-1 --idx=2
  alpha:
    image: dgraph/dgraph:v20.07.1-rc1
    volumes:
      - /data/dgraph/v20.07:/dgraph
    ports:
      - 7080:7080
      - 8080:8080
      - 9080:9080
    depends_on:
      - zero
    command: dgraph alpha --my=dgraph-2:7080 --lru_mb=8192 --zero=dgraph-1:5080,dgraph-2:5080,dgraph-3:5080

mbande · September 14, 2020, 6:50pm

it seems that memory consumption, is directly related with input size.
when i add a user with all of his friends (e.g. 500 friends) in one query, memory usage grows endlessly.
but when i add user friends chunk by chunk (e.g. of size 10), memory usage stops at some point.

mrjn · September 14, 2020, 8:06pm

@Paras @martinmr Can you guys replicate this? If you can, @ibrahim and team can look at it in tomorrow morning’s session.

martinmr · September 14, 2020, 8:43pm

This appears similar to this issue: https://dgraph.atlassian.net/browse/DGRAPH-1898

In that issue, the write-ahead log grows with the mutations. Maybe bigger mutations cause a further impact in the size of the write-ahead log or make it harder to clean it up. I’ll try to reproduce the issue.

martinmr · September 14, 2020, 11:12pm

Yeah, it appears to be that. I see similar behavior in 20.07.1-rc1 with each mutation having 5000 unique nodes. But in the head of the release/v20.07 branch the memory consumption is more stable and does not seem to increase linearly with the number of mutations. In this branch we are loading the write-ahead log in mmap mode by default.

mbande · September 15, 2020, 5:05am

@martinmr i add --badger.vlog=mmap to alpha config, but the problem exists yet. even setting --badger.vlog=disk does not help.

martinmr · September 15, 2020, 1:26pm

That’s setting the value for the p directory. The latest commit on that branch has a way to set the value for the w directory.

mbande · September 16, 2020, 8:44am

I checked out v20.07 branch and build it, then binded the generated dgraph binary in docker container.
the new container is much much better in memory usage. thank you all and keep up good work

mbande · September 16, 2020, 10:12am

memory usage history after about 2 hours of continuous mutations:
Screenshot from 2020-09-16 14-36-30
as you can see, memory leak would prevent alpha to run for a long time and it will crash for OOM.
@martinmr is there any flag that can improve this? e.g. --badger.vlog=disk or something like that.
or i should let alpha crash and then restart it.

mbande · September 16, 2020, 11:12am

i checked both --badger.vlog=disk and --badger.tables=disk as alpha flag, without any improvement sign.

martinmr · September 16, 2020, 1:41pm

did you try the default --badger.tables=mmap,mmap?

Can you provide us with a memory profile? See instructions at https://dgraph.io/docs/howto/retrieving-debug-information/#memory-profile

mrjn · September 16, 2020, 4:16pm

Would be good to run this on master, because we have all the jemalloc optimizations there.

martinmr · September 16, 2020, 4:25pm

@mbande It would be great if you could try this in master. We have made a lot of memory optimizations including manual allocations since golang’s memory management is not aggressive enough.

Once this PR (build(jemalloc): Install Jemalloc if missing via Makefile. by jarifibrahim · Pull Request #6463 · dgraph-io/dgraph · GitHub) is merged, running make install will download jemalloc and compile Dgraph with it.

I’ll try to do a longer test on my own. I didn’t see a big memory increase yesterday but I also didn’t run the cluster for a long period of time.

martinmr · September 16, 2020, 8:11pm

Ran the cluster on master using more mutations and I can reproduce the behavior. It looks like the issue is with zero. Below is the output of the memory profile:

 go tool pprof -source_path /home/martinmr/go/src/github.com/dgraph-io -trim_path / http://localhost:6180/debug/pprof/heap                                                        <<<

Fetching profile over HTTP from http://localhost:6180/debug/pprof/heap
Saved profile in /home/martinmr/pprof/pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.032.pb.gz
File: dgraph
Build ID: df33deaa2cbf50e81c6b75cafb82e0beea5e4429
Type: inuse_space
Time: Sep 16, 2020 at 1:02pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2014.61MB, 99.54% of 2023.94MB total
Dropped 82 nodes (cum <= 10.12MB)
Showing top 10 nodes out of 18
      flat  flat%   sum%        cum   cum%
 1931.40MB 95.43% 95.43%  1931.40MB 95.43%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit
   83.20MB  4.11% 99.54%    83.20MB  4.11%  github.com/dgraph-io/badger/v2/skl.newArena
         0     0% 99.54%    83.20MB  4.11%  github.com/dgraph-io/badger/v2.Open
         0     0% 99.54%    83.20MB  4.11%  github.com/dgraph-io/badger/v2/skl.NewSkiplist
         0     0% 99.54%    83.20MB  4.11%  github.com/dgraph-io/dgraph/dgraph/cmd.Execute
         0     0% 99.54%  1935.45MB 95.63%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Server).CommitOrAbort
         0     0% 99.54%  1935.45MB 95.63%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Server).commit
         0     0% 99.54%    83.20MB  4.11%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
         0     0% 99.54%    83.20MB  4.11%  github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
         0     0% 99.54%  1935.45MB 95.63%  github.com/dgraph-io/dgraph/protos/pb._Zero_CommitOrAbort_Handler
(pprof) list commit
Total: 1.98GB
ROUTINE ======================== github.com/dgraph-io/dgraph/dgraph/cmd/zero.(*Oracle).commit in dgraph/dgraph/cmd/zero/oracle.go
    1.89GB     1.89GB (flat, cum) 95.43% of Total
         .          .    128:           ki, err := strconv.ParseUint(k, 36, 64)
         .          .    129:           if err != nil {
         .          .    130:                   glog.Errorf("Got error while parsing conflict key %q: %v\n", k, err)
         .          .    131:                   continue
         .          .    132:           }
    1.89GB     1.89GB    133:           o.keyCommit[ki] = src.CommitTs // CommitTs is handed out before calling this func.
         .          .    134:   }
         .          .    135:   return nil
         .          .    136:}
         .          .    137:
         .          .    138:func (o *Oracle) currentState() *pb.OracleDelta {

the keyCommit map seems to grow without bound. @mrjn I believe you already noticed this issue before. Any ideas on how to deal with this issue?

@mbande Feel free to try to reproduce the issue on your own but I think we have enough info on our side. I believe this is an issue regardless of the size of the transactions but smaller transactions might just mask this issue better.

martinmr · September 16, 2020, 11:08pm

So I looked into how this map is populated. It’s populated with each transaction and it’s purged when a snapshot happens. Snapshots are controlled via the --snapshot_after flag which takes the number of transactions at which to take a snapshot. When your mutations are very big, this map fills up faster and takes more time to be purged, which is what @mbande observed.

So there two workarounds:

Have smaller mutations. Having smaller mutations is recommended anyways as it reduces the chances of conflicts.
Change the value of --snapshot_after to a smaller value so that the map is cleared more often.

mbande · September 20, 2020, 2:10pm

@martinmr as it is documented that the default value for --snapshot_after is 5000 transactions, i do more qraphql mutations with no decrease in memory usage, is this expected?

mrjn · September 20, 2020, 6:02pm

We could implement an mmap based balanced binary search tree, instead of using Go map. I’m doing something similar for xidmap, implementing a trie in my branch mrjn/trie.

@martinmr: Would be a fun project for you to pick up – brush up on your CS fundamentals.

mbande · September 21, 2020, 6:33am

I upgraded to master version and started it with default options, with no substantial change in memory usage pattern:
Screenshot from 2020-09-21 10-03-21

Topic		Replies	Views
Extreme memory usage when constantly query and mutate data Dgraph	5	1756	February 5, 2020
Dgraph alpha node running out of memory Dgraph	5	852	September 28, 2020
How to prevent RAM usage of Alpha node from growing? Dgraph mutation	10	2042	July 2, 2020
Dgraph can't idle without being oomkilled after large data ingestion Dgraph	63	3861	September 14, 2020
Consistent Increase in memory usage for zero leader Dgraph area:performance	7	1396	October 13, 2020