Dgraph alpha node running out of memory

Report a Dgraph Bug

What version of Dgraph are you using?


We are using v20.03.4 in production but we have also tried the master branch docker image as well as the newer v20.03 releases.

Have you tried reproducing the issue with the latest release?


Yes

What is the hardware spec (RAM, OS)?


In production when the issue was observed we were running 3 dgraph alpha nodes and 1 zero node each with 256 GB of memory running CentOS 7.

Steps to reproduce the issue (command/config used to run Dgraph).


dgraph version information:

Dgraph version   : v2.0.0-rc1-726-g0c7f9a426
Dgraph codename  :
Dgraph SHA-256   : be551d773bca16181fb117444c2153ac1651bf0460577b4b321f0f5f70b40867
Commit SHA-1     : 0c7f9a426
Commit timestamp : 2020-09-17 11:27:36 -0700
Branch           : master
Go version       : go1.14.1
jemalloc enabled : false


docker-compose.yml file:

version: "3.2"
services:
  zero:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    container_name: dg_zero
    logging:
      options:
        max-size: ${DOCKER_MAX_LOG_SIZE}
    volumes:
      - type: bind
        source: /mnt/tmp/dgraph
        target: /dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - net3
    restart: on-failure
    command: dgraph zero --my=zero:5080 --rebalance_interval ${DGRAPH_REBALANCE_INTERVAL} --v ${DGRAPH_LOG_LEVEL}
  alpha1:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    container_name: dg_alpha1
    logging:
      options:
        max-size: ${DOCKER_MAX_LOG_SIZE}
    volumes:
      - type: bind
        source: /mnt/tmp/dgraph
        target: /dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - net3
    restart: on-failure
    command: dgraph alpha --my=alpha1:7080 --lru_mb=${DGRAPH_LRU_MB} --zero=zero:5080 --idx=1 -p p1 -w w1 --v ${DGRAPH_LOG_LEVEL} --whitelist 0.0.0.0/0
  alpha2:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    container_name: dg_alpha2
    logging:
      options:
        max-size: ${DOCKER_MAX_LOG_SIZE}
    volumes:
      - type: bind
        source: ${DGRAPH_DATA_PATH}
        target: /dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - net3
    restart: on-failure
    command: dgraph alpha --my=alpha2:7081 --lru_mb=${DGRAPH_LRU_MB} --zero=zero:5080 -o 1 --idx=2 -p p2 -w w2 --v ${DGRAPH_LOG_LEVEL} --whitelist 0.0.0.0/0
  alpha3:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    container_name: dg_alpha3
    logging:
      options:
        max-size: ${DOCKER_MAX_LOG_SIZE}
    volumes:
      - type: bind
        source: ${DGRAPH_DATA_PATH}
        target: /dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - net3
    restart: on-failure
    command: dgraph alpha --my=alpha3:7082 --lru_mb=${DGRAPH_LRU_MB} --zero=zero:5080 -o 2 --idx=3 -p p3 -w w3 --v ${DGRAPH_LOG_LEVEL} --whitelist 0.0.0.0/0
  ratel:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    container_name: dg_ratel
    logging:
      options:
        max-size: ${DOCKER_MAX_LOG_SIZE}
    volumes:
      - type: bind
        source: ${DGRAPH_DATA_PATH}
        target: /dgraph
    ports:
      - 18000:8000
    networks:
      - net3
    command: dgraph-ratel
​
networks:
  net3:

​ .env file:

DOCKER_MAX_LOG_SIZE=50m

DGRAPH_VERSION=master
DGRAPH_LRU_MB=2048
DGRAPH_REBALANCE_INTERVAL=8m
DGRAPH_LOG_LEVEL=2
DGRAPH_DATA_PATH=/mnt/tmp/dgraph

Expected behavior and actual result.


We were expecting to be able to generate a large graph and then continually mutate and query the graph afterwards. In production the graph was created using the bulk loader and 3 alpha nodes were initialized using this data. After the alpha nodes are started we expected to be able to mutate and query this graph without issue. We created a stress test to reproduce the memory issues we were observing in production.

Using the docker-compose.yml file and associated .env file above, along with our dgraph-stress-test code (GitHub - JoelWesleyReed/dgraph-stress-test: Test dgraph scalability by loading synthetic data.), to generate a large graph will eventually result in one of the alpha nodes to be killed by the OS due to excessive memory usage. We have been able to capture a pprof heap file shortly before the OS kills the process and it is included below. If the stress test code is stopped and dgraph is restarted, alpha memory utilization is initially low, but as queries are executed via ratel, memory utilization again rises until dgraph becomes non-responsive and an alpha is eventually killed by the OS.


In our production stack each alpha node has 256 GB of memory availabe and dgraph seems to be consuming all of it even though the heap profile is not reporting more than ~100 GB of usage. From our prometheus / grafana profiling data we can see that go_memstats_heap_sys_bytes reports ~214 GB right before the alpha node crashes and by observing memory usage through htop and the process_resident_memory_bytes profiling the RES memory does not appear to align with the heap profiling because the RES memory is much higher (~221 GB) and consuming most of the systems resources. We have been unable to determine what is causing so much memory to be used during mutations. Each mutation has ~1000 quads with many of them referencing an upsert query including in the transaction.


pprof top20 and list:

(pprof) top20
Showing nodes accounting for 78801.12MB, 99.20% of 79436.92MB total
Dropped 170 nodes (cum <= 397.18MB)
Showing top 20 nodes out of 24
      flat  flat%   sum%        cum   cum%
75612.27MB 95.19% 95.19% 75612.27MB 95.19%  github.com/dgraph-io/dgraph/posting.(*List).Uids.func1
 1431.72MB  1.80% 96.99%  2181.75MB  2.75%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  750.03MB  0.94% 97.93%   750.03MB  0.94%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
  733.22MB  0.92% 98.85%   733.22MB  0.92%  github.com/dgraph-io/dgraph/protos/pb.(*UidBlock).Unmarshal
  240.38MB   0.3% 99.16%   973.60MB  1.23%  github.com/dgraph-io/dgraph/protos/pb.(*UidPack).Unmarshal
   32.50MB 0.041% 99.20%  1452.66MB  1.83%  github.com/dgraph-io/dgraph/posting.(*pIterator).next
    0.50MB 0.00063% 99.20%   974.10MB  1.23%  github.com/dgraph-io/dgraph/protos/pb.(*PostingList).Unmarshal
    0.50MB 0.00063% 99.20%  2182.79MB  2.75%  github.com/dgraph-io/badger/v2/table.OpenTable
         0     0% 99.20%   974.10MB  1.23%  github.com/dgraph-io/badger/v2.(*Item).Value
         0     0% 99.20%  2182.79MB  2.75%  github.com/dgraph-io/badger/v2.newLevelsController.func1
         0     0% 99.20%  2181.75MB  2.75%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).XXX_Unmarshal
         0     0% 99.20%  2181.75MB  2.75%  github.com/dgraph-io/badger/v2/table.(*Table).initBiggestAndSmallest
         0     0% 99.20%  2181.75MB  2.75%  github.com/dgraph-io/badger/v2/table.(*Table).initIndex
         0     0% 99.20%  2181.75MB  2.75%  github.com/dgraph-io/badger/v2/table.(*Table).readTableIndex
         0     0% 99.20% 77064.92MB 97.01%  github.com/dgraph-io/dgraph/posting.(*List).Uids
         0     0% 99.20% 77064.92MB 97.01%  github.com/dgraph-io/dgraph/posting.(*List).iterate
         0     0% 99.20%   974.60MB  1.23%  github.com/dgraph-io/dgraph/posting.(*List).readListPart
         0     0% 99.20%   989.64MB  1.25%  github.com/dgraph-io/dgraph/posting.(*pIterator).moveToNextPart
         0     0% 99.20%   989.64MB  1.25%  github.com/dgraph-io/dgraph/posting.(*pIterator).moveToNextValidPart
         0     0% 99.20%   974.10MB  1.23%  github.com/dgraph-io/dgraph/posting.unmarshalOrCopy
(pprof) list Uids.func1
Total: 77.58GB
ROUTINE ======================== github.com/dgraph-io/dgraph/posting.(*List).Uids.func1 in dgraph/posting/list.go
   73.84GB    73.84GB (flat, cum) 95.19% of Total
         .          .   1108:		return out, nil
         .          .   1109:	}
         .          .   1110:
         .          .   1111:	err := l.iterate(opt.ReadTs, opt.AfterUid, func(p *pb.Posting) error {
         .          .   1112:		if p.PostingType == pb.Posting_REF {
   73.84GB    73.84GB   1113:			res = append(res, p.Uid)
         .          .   1114:		}
         .          .   1115:		return nil
         .          .   1116:	})
         .          .   1117:	l.RUnlock()
         .          .   1118:	if err != nil {
(pprof) quit

Thanks for the detailed report.

From the memory profile, it looks like the issue is with the query stage not the mutation stage. What’s the query you are using for doing upserts? Queries that are too general will generate a lot of results. I took a quick a look at the code and the query seemed to be using dgraph.type which is not a good choice if there are a lot of nodes with the same type. Is that correct? If there are a lot of nodes to query appending the results will use a lot of memory. Is it possible to change the upsert query to a more specific one?

We have been addressing memory issues in Dgraph and we’ll take a look at this one. But a smarter/smaller upsert query should improve things.

The discrepancies between what go reports and what the OS reports are a known issue. The golang runtime holds on to a lot of memory, even after it’s no longer needed (presumably to reuse it instead of having to ask the OS for more memory later). Unfortunately there’s not a lot of options or flags we can use to change the behavior.

1 Like

Thanks for your help.

Below is a small example of a what is included in each dgraph transaction. It has 5 upsert queries, where the default full size transaction would have 51 similar upsert queries. The upsert queries are used to ensure a node of a type and name attribute is unique. Each upsert query should return either 0 or 1 result. I have also included the schema below in case that is of interest.

Example transaction (upserts and quads):

# Upsert Query
query {
	qu0(func: eq(name, "Node-0.1")) @filter(type(Node1)) {
		upsert_id_2 as uid
	}
	qu1(func: eq(name, "Node-0.3")) @filter(type(Node3)) {
		upsert_id_4 as uid
	}
	qu2(func: eq(name, "Node-1.0")) @filter(type(Node0)) {
		upsert_id_1 as uid
	}
	qu3(func: eq(name, "Node-0.2")) @filter(type(Node2)) {
		upsert_id_3 as uid
	}
	qu4(func: eq(name, "Node-0.0")) @filter(type(Node0)) {
		upsert_id_0 as uid
	}
}

# Set Quads
uid(upsert_id_0) dgraph.type "Node0" .
uid(upsert_id_0) name "Node-0.0" .
uid(upsert_id_0) pred0 "pppppppppppppppppppp" .
uid(upsert_id_0) pred1 "LLLLLLLLLLLLLLLLLLLL" .
uid(upsert_id_1) dgraph.type "Node0" .
uid(upsert_id_1) name "Node-1.0" .
uid(upsert_id_0) NEXT uid(upsert_id_1) .
uid(upsert_id_0) LINK0 uid(upsert_id_0) .
uid(upsert_id_0) LINK1 uid(upsert_id_2) .
uid(upsert_id_0) LINK2 uid(upsert_id_3) .
uid(upsert_id_0) LINK3 uid(upsert_id_4) .
uid(upsert_id_2) dgraph.type "Node1" .
uid(upsert_id_2) name "Node-0.1" .
uid(upsert_id_2) pred0 "nnnnnnnnnnnnnnnnnnnn" .
uid(upsert_id_2) pred1 "ffffffffffffffffffff" .
uid(upsert_id_2) LINK0 uid(upsert_id_0) .
uid(upsert_id_2) LINK1 uid(upsert_id_2) .
uid(upsert_id_2) LINK2 uid(upsert_id_3) .
uid(upsert_id_2) LINK3 uid(upsert_id_4) .
uid(upsert_id_3) dgraph.type "Node2" .
uid(upsert_id_3) name "Node-0.2" .
uid(upsert_id_3) pred0 "gggggggggggggggggggg" .
uid(upsert_id_3) pred1 "DDDDDDDDDDDDDDDDDDDD" .
uid(upsert_id_3) LINK0 uid(upsert_id_0) .
uid(upsert_id_3) LINK1 uid(upsert_id_2) .
uid(upsert_id_3) LINK2 uid(upsert_id_3) .
uid(upsert_id_3) LINK3 uid(upsert_id_4) .
uid(upsert_id_4) dgraph.type "Node3" .
uid(upsert_id_4) name "Node-0.3" .
uid(upsert_id_4) pred0 "ssssssssssssssssssss" .
uid(upsert_id_4) pred1 "cccccccccccccccccccc" .
uid(upsert_id_4) LINK0 uid(upsert_id_0) .
uid(upsert_id_4) LINK1 uid(upsert_id_2) .
uid(upsert_id_4) LINK2 uid(upsert_id_3) .
uid(upsert_id_4) LINK3 uid(upsert_id_4) .

Schema:

type Node0 {
	name
	pred0
	pred1
	NEXT
	LINK0
	LINK1
	LINK2
	LINK3
}

type Node1 {
	name
	pred0
	pred1
	NEXT
	LINK0
	LINK1
	LINK2
	LINK3
}

type Node2 {
	name
	pred0
	pred1
	NEXT
	LINK0
	LINK1
	LINK2
	LINK3
}

type Node3 {
	name
	pred0
	pred1
	NEXT
	LINK0
	LINK1
	LINK2
	LINK3
}

name: string @index(term) .
pred0: string @index(hash) .
pred1: string @index(hash) .
NEXT: [uid] .
LINK0: [uid] .
LINK1: [uid] .
LINK2: [uid] .
LINK3: [uid] .

Yeah, this is definitely a really big transaction. Not only does each query have to traverse the whole name index to find the right value, then we’d have to also do the filtering. Then you have to do this 51 times.

Right now I am unsure if the filter will try to read all the nodes of the given type and then run a union with the results from the eq query or just do the filtering on the results from the eq filter. If the first option is what’s happening, then we can certainly optimize that to avoid querying all the nodes of a certain type. @pawan do you know what Dgraph does for a query like this?

Ideally upsert queries are meant to be small, otherwise the mutations will take a while and since mutations have to be applied in linearized fashion, you’d end up slowing down the whole cluster. Even if we fix the OOM issues and optimize the queries, this kind of logic would be much faster if it was implemented in your application. If you have a map of node names to Dgraph uids then you would not have to run the upsert queries and can use mutations directly. Right now, you have to get the uid every time you do a mutation and end up repeating a lot of work. With something like a hash map in your application you’d only have to run the work to find the right uid once, and can retrieve the uid in constant time for subsequent mutations.

I think it will read the index key for filtering here. That is because of the code at dgraph/worker/task.go at master · dgraph-io/dgraph · GitHub. This happens because the number of tokens which in this case is just 1 is equal to the number of uids (which would also just be 1 here for Node-0.1). I agree we shouldn’t be reading the index key here and should just fetch the value and do a filter here.

Ok. What the optimization should be then? I am thinking that we should always look at the data key if the number of uids is one (or some other low threshold). This would cover a bunch of cases since eq queries are often used to get a single uid. I think the problem here is that the dgraph.type indexes have a lot of uids if the type is shared by a lot of nodes in the graph.

@pawan, what do you think?