Preventing OOM on alpha when doing large queries

nodeworks · July 19, 2020, 12:45am

I have a alpha instance running in docker that has available 41gb. I notice after running a few large queries it continues to tick up in memory. After about an hour or so it rises to 35gb+ in memory and never goes back down. Is there a way to release this memory? Docker stats is showing this usage btw. Eventually the process will get killed for OOM.

mrjn · July 19, 2020, 12:58am

If you can run the heap profiler when it is 35GB, that’d help identify what the issue might be. Also, which version of Dgraph are you using?

nodeworks · July 19, 2020, 1:01am

I’m using v20.03.1

nodeworks · July 19, 2020, 1:13am

This is at 22.43gb (docker stats)

nodeworks · July 19, 2020, 1:16am

From docker stats:

ashishgoswami · July 20, 2020, 9:04am

Hi @nodeworks,

As per profile we are seeing too many SST files at level 0. Would it be possible for you to share alpha logs with us?

nodeworks · July 21, 2020, 12:09am

alpha_logs.txt (40.5 KB)

zero_logs.txt (4.5 KB)

debug_vars.txt (11.2 KB)

Here is some more detailed info from Dynatrace. It’s almost up at 90% memory usage and not backing down at all.

And then lastly the go managed memory related to the dgraph process:

ashishgoswami · July 21, 2020, 9:22am

Hey @nodeworks, thanks for providing logs and above information. From the profile from I can see most of the memory is taken by SSTables.

         .          .    222:
         .          .    223:	switch opts.LoadingMode {
         .          .    224:	case options.LoadToRAM:
         .          .    225:		if _, err := t.fd.Seek(0, io.SeekStart); err != nil {
         .          .    226:			return nil, err
         .          .    227:		}
   43.66GB    43.66GB    228:		t.mmap = make([]byte, t.tableSize)
         .          .    229:		n, err := t.fd.Read(t.mmap)
         .          .    230:		if err != nil {
         .          .    231:			// It's OK to ignore fd.Close() error because we have only read from the file.
         .          .    232:			_ = t.fd.Close()
         .          .    233:			return nil, y.Wrapf(err, "Failed to load file into RAM")
         .          .    234:		}
         .          .    235:		if n != t.tableSize {
         .          .    236:			return nil, errors.Errorf("Failed to read all bytes from the file."+
         .          .    237:				"Bytes in file: %d Bytes actually Read: %d", t.tableSize, n)
         .          .    238:		}
         .          .    239:	case options.MemoryMap:
         .          .    240:		t.mmap, err = y.Mmap(fd, false, fileInfo.Size())
         .          .    241:		if err != nil {
         .          .    242:			_ = fd.Close()
         .          .    243:			return nil, y.Wrapf(err, "Unable to map file: %q", fileInfo.Name())
         .          .    244:		}

Since we open raft wal(w directory) badger’s tables with LoadToRAM option, hence I think it is coming from there. Can you tell us your w directory size? Also would be nice if you can run dgraph debug on your w directory. I want to see your snapshot index and last index.
You can run debug using below command:

dgraph debug -w $(w directory location)

You should see something like below:

rids: map[1:true]
gids: map[1:true]
Iterating with Raft Id = 1 Groupd Id = 1

Snapshot Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:1234875 Term:2 XXX_unrecognized:[]}
Snapshot Alpha: {Context:id:1 group:1 addr:"localhost:7080"  Index:1234875 ReadTs:1645291 Done:false SinceTs:0 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Hardstate: {Term:4 Vote:1 Commit:1244600 XXX_unrecognized:[]}
Checkpoint: 1244600
Last Index: 1244600 . Num Entries: 9725 .

nodeworks · July 21, 2020, 1:37pm

I have the structure of:
out/0/p – 281MB
w - 340K
zw - 112K

Running the dgraph debug -w w/ -p out/0/p/ command I get this log file:

ashishgoswami · July 21, 2020, 1:52pm

Hey @nodeworks, it cannot be run on a running Dgraph instance.

Your w directory size is very small. Profile shared by you suggests that most of the RAM is taken by SSTs, I thought it taken by w directory badger SST tables.

Let me think of other ways which can cause this.

nodeworks · July 21, 2020, 2:37pm

pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz (52.3 KB)

ashishgoswami · July 21, 2020, 3:05pm

Thank you @nodeworks for discussing above issue over the call.

Somehow I mixed up profile shared by you earlier with some other profile and made wrong assumptions/conclusions.

Just summering things we discussed on the call here:

Usage reported by heap profile are very different from reported by docker stats. We also saw in grafana that inuse heap is very less as compared to RES(which is not decreasing). This is something we have seen in the past, where Go runtime is not very aggressive in returning memory to OS. We are continuously working on to improve memory usage by Dgraph.
Queries run by you had aggregations and those might be trying to read more data and increasing memory usage.

I will look at the new profile shared by you and see if we can optimise somethings there.

Topic		Replies	Views
How to prevent RAM usage of Alpha node from growing? Dgraph mutation	10	2028	July 2, 2020
When writing data, dgraph takes up too much memory Dgraph area:performance	7	814	January 20, 2021
Execute query, alpha takes up high memory and ends up OOM Dgraph kind:question , dgraph	4	867	August 3, 2022
Dgraph Alpha Eating Up All RAM Dgraph	7	585	September 9, 2021
Seems to be a memory leak Dgraph status:accepted , ticket:created	26	2180	September 29, 2020

Preventing OOM on alpha when doing large queries

Related topics