In our case to handle about 6M nodes needs 30GB of memory on each of 3 Apha’s running in the cluster. This is too much and it increase as amount if data increases.
As shown on the picture below, extreme value was with about 6M nodes in database. Memory dropped when I dropped data from database. This shows that memory is getting up and up as more and more data are in database.
Is there possibility to someone take a look of this and do memory profiling or testing on your test environment?
This is show stopper for us as we will have a way more data than 6M and having 200GB of RAM to handle only 20M nodes will be too expensive and non sense.
In addition, might be useful. I realized that mutations are using memory. On the picture you will see A point where memory dropped when turned off mutations (so, same amount of data, same number of queries).
Point B happened when I executed roll restart of all Alpha nodes.
Can you try setting the environment variable GODEBUG=madvdontneed=1 when running the Dgraph binaries?
I asked around in the Gophers performance Slack channel and was pointed to this open issue report about memory not being released by Go runtime:
Go is releasing memory to the OS, but that isn’t reflected in the resident set size calculations. You can check the estimated memory counted as LazyFree by checking /proc/<pid>/smaps. The Linux documentation for LazyFree memory says this:
The memory isn’t freed immediately with madvise(). It’s freed in memory
pressure if the memory is clean.
Below, you’ll see the memory charts for the same workload to a regular dgraph alpha (blue line) and a GODEBUG=madvdontneed=1 dgraph alpha (orange line). The process memory of the orange line goes down.
After a day of working with GODEBUG=madvdontneed=1 looks like nothing changed, still the memory that is seen as used by Kubernetes is about 6-7 GB higher then real used by Alpha nodes.
It looked like with v1.1.1 it was behaving correct (not perfect), but correct. For instance, you had to be careful what queries you are running because some heavy query really kills the node. Like DGraph does not have any mechanism to block or stop query that takes a lot of memory.
After we have upgraded to v1.2.0 we started facing new issue. Looks like in some intervals DGraph is doing some things in background that takes a lot of memory and our nodes simple hit the limits that again kill the nodes.