[K8s / Devops] Need to investigate OOMs of pods with setup done using Helm charts

Moved from GitHub dgraph/4384

Posted by prashant-shahi:

What version of Dgraph are you using?

master and v1.1.0

What is the hardware spec (RAM, OS)?

3 node GKE cluster
45 GB memory

Steps to reproduce the issue (command/config used to run Dgraph).

Go to contrib/config/kubernetes/helm.

cd contrib/config/kubernetes/helm
helm install my-release ./ --set alpha.service.type="LoadBalancer"

Use JS Flock and point the endpoint to the Dgraph alpha load balancer IP.

Expected behaviour and actual result.

OOMs of pods after a while of running the JS Flock app, multiple times.
There doesn’t seem to be any limit set on the CPU or memory. It is worth looking into what is causing it.

hackintoshrao commented :

@prashant-shahi : Can you add the monitoring metrics too?

prashant-shahi commented :

@hackintoshrao Sure.

Here are some of the metrics from GKE dashboard.

I had also periodically collected memory profiles for the same. Do ping me if anyone needs it.

For the helm charts, you definitely want to put some limits on the pods, or else it can consume enough memory and disrupt services like kubelet and taint the node so that other pods cannot be scheduled on it.

I also wonder about the behavior using NEG instead of LoadBalancer service type. This can be configured in service annotations.