Context
Different Customers / Users reported that Dgraph was still using memory even if their cluster was idle. The issue seems to be related to golang runtime that is not releasing that memory, either because it will try to reuse it.
This is not an issue with Dgraph itself. This is a known and common question/issue in the Go community:
Purpose of the tests
We wanted to test how golang was releasing memory when using the environment variable GODEBUG=madvdontneed=1 and see if it has any downside on Dgraph. Based on the results/comments we will decide if we should advise our users to use this underlying environment or not
Test setup
It was performed today (2020-06-22) using Dgraph compiled from master branch:
Dgraph version : v2.0.0-rc1-425-gad0914a0c
Dgraph SHA-256 : 13bd1272550ca1d20d8126ca2f8058d4b7c958a22eadcbf330b6df35b5f531d5
Commit SHA-1 : ad0914a0c
Commit timestamp : 2020-06-22 13:20:38 +0530
Branch : master
Go version : go1.14.4
Create a cluster
bulk load a large dataset
take backups (alpha here will use enough memory for our test case)
Test1: GODEBUG=madvdontneed=1 (turned on)
This test is divided into two parts:
Before the red line it shows up alpha mem usage with backups taken every 4 minutes, after the red line it shows us mem usage with backups taken every 10min
Test2: GODEBUG=madvdontneed=0 (turned off) This test was run for a shorter time period as the results were already visible after a bunch of backups
Results:
We should be careful about these changes since it will have significant performance implications. The advdontneed should cause a lot of page faults (theoretically). @omar Do you have some metrics about the performance? How long did the backup process take with the flag enabled and disabled?
Let me explain a bit about what the flag you changed does.
Madvise - The memory usage advisor
The mdvisemadvise(2) - Linux manual page system call is used to advise the kernel about how should it treat the memory used by the process (golang runtime in this case).
The golang runtime uses MADV_FREE by default
This means, the GC will mark freed up memory spans as FREE but it is up to the kernel to decide if it should reclaim the memory or not. The kernel will free up this memory if needed. So If I’m running only dgraph on a 64 GB machine that has idle memory of 30 GB, the kernel might still not reclaim it since there’s no other process requesting memory. This is done to improve performance since the kernel can take away the memory when it needs but the process (golang runtime) can hold on to it until someone else needs it.
The change you did GODEBUG=madvdontneed=1 (turned on) will advise the kernel to treat the marked spans as MADV_DONTNEED which means that take away these spans immediately. If you were to look at the RSS of the process, you’d see the memory usage drop instantly. This helps when you want to free up the memory instantly, but this would cause page faults since we gave away all the memory runtime had. This should affect the performance by a huge margin (theoretically).
The 1st test (with GODEBUG=madvdontneed=1) has been running for 5+ hours and the 2nd test (with GODEBUG=madvdontneed=0) has been running for 1+ hour
If needed I can leave the test running for more time and provide charts accordingly.
If needed I can do another test, where I set GODEBUG=madvdontneed=0 and run a parallel process (not dgraph) that requires enough memory, in this way we can verify if the kernel will reclaim back some memory from dgraph to use it for the other process.
Did you mean when setting GODEBUG=madvdontneed=1 (turned on), wondering if it was a typo ? As the memory usage does not drop off at all when the undelying env variable is turned off (=0)
The important bit would be about how long did the backup take? The flag will release the memory but my concern is about how fast/slow it does it make everything.
Yes, sorry there was a typo. I’ll fix my comment. advise_dont_need means to free up the memory instantly. GODEBUG=madvdontneed=1 will instantly free up the memory.
Go 1.16 will revert back to MADV_DONTNEED by default (https://go-review.googlesource.com/c/go/+/267100/). So, we’ll do the same in Dgraph starting in Dgraph v20.11 and the next patch releases (e.g., v20.07.3, v20.03.7).