Dgraph crash loop on aws

We’re currently in the process of pivoting from GCP to AWS, and struggling to get dgraph running properly on AWS.

We have a relatively small dataset (60M rdf, but plan on expanding it to ~150M rdf after the migration). We are deploying on kubernetes, using the helm chart, with 1 dgraph group, 3 replicas. Dgraph version 20.03.1.

alpha: 40GB io1-fast-retain storage class (EBS)
zero: 4GB io1-fast-retain storage class (EBS)

We’ve been having issues with the dgraph pods crashing and then finally getting into a restart loop when loading data. We have tried loading data both with the live loader and with our internal data generation script (directly inserting to dgraph in batches of 5_000 with the py client).

The live loader reports a healthy load and then after about 40 minutes prints Killed and exits. When we use our internal loader, eventually one or more of the pods crash, and then continue to crash after they are restarted. We see the following in the logs:

zero:

I0604 23:04:00.515506      20 log.go:34] raft.node: 3 elected leader 2 at term 4
W0604 23:07:08.266729      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 257ms. Breakdown: [{disk 257ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:25:34.598864      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 288ms. Breakdown: [{disk 288ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:33:46.928453      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 418ms. Breakdown: [{disk 0s} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:35:51.495267      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 385ms. Breakdown: [{disk 385ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:37:53.830954      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 221ms. Breakdown: [{disk 221ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 00:09:01.627581      20 node.go:674] [0x3] Read index context timed out
W0605 00:10:42.361160      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 351ms. Breakdown: [{disk 351ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 00:18:57.302604      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 292ms. Breakdown: [{advance 292ms} {disk 0s} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 00:31:13.819653      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 400ms. Breakdown: [{advance 400ms} {disk 0s} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 00:38:52.627478      20 node.go:674] [0x3] Read index context timed out
W0605 00:38:55.807380      20 pool.go:254] Connection lost with dgraph-live-dgraph-alpha-0.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080. Error: rpc error: code = Unavailable desc = transport is closing
W0605 00:48:19.630822      20 node.go:674] [0x3] Read index context timed out
W0605 00:50:25.627825      20 node.go:674] [0x3] Read index context timed out
W0605 01:12:19.534424      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 524ms. Breakdown: [{proposals 524ms} {disk 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:12:20.310197      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 382ms. Breakdown: [{disk 382ms} {proposals 1ms} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:20:32.594602      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 356ms. Breakdown: [{disk 356ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:32:47.306791      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 224ms. Breakdown: [{advance 212ms} {disk 12ms} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 01:38:58.913003      20 raft.go:733] Raft.Ready took too long to process: Timer Total: 203ms. Breakdown: [{proposals 203ms} {disk 0s} {advance 0s}]. Num entries: 0. MustSync: false
I0605 01:47:36.127451      20 log.go:34] 3 is starting a new election at term 4

alpha:

E0605 18:30:03.489600      15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:09.477349      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:30:19.577293      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
I0605 18:30:28.077278      15 draft.go:1307] Found 1 old transactions. Acting to abort them.
I0605 18:30:28.088283      15 draft.go:1268] TryAbort 1 txns with start ts. Error: <nil>
I0605 18:30:28.088301      15 draft.go:1284] TryAbort: No aborts found. Quitting.
I0605 18:30:28.088308      15 draft.go:1310] Done abortOldTransactions for 1 txns. Error: <nil>
W0605 18:30:29.677443      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:30:31.490179      15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:39.777415      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:30:44.977219      15 node.go:420] Unable to send message to peer: 0x3. Error: EOF
W0605 18:30:49.877320      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:30:59.490827      15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:59.977333      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:31:10.077329      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:31:20.177292      15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:31:27.491457      15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
I0605 18:31:28.077215      15 draft.go:1307] Found 1 old transactions. Acting to abort them.
I0605 18:31:28.088069      15 draft.go:1268] TryAbort 1 txns with start ts. Error: <nil>

another alpha on restart:

Dgraph version   : v20.03.1
Dgraph SHA-256   : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8
Commit SHA-1     : c201611d6
Commit timestamp : 2020-04-24 13:53:41 -0700
Branch           : HEAD
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit https://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0605 18:30:36.903089      18 run.go:609] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I0605 18:30:36.903138      18 run.go:610] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:dgraph-live-dgraph-alpha-1.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080 ZeroAddr:[dgraph-live-dgraph-zero-0.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080] RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-06-05 18:30:36.226962605 +0000 UTC m=+0.089461880 LudicrousMode:false BadgerKeyFile:}
I0605 18:30:36.903188      18 run.go:611] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: BadgerCompressionLevel:3 WALDir:w MutationsMode:0 AuthToken: AllottedMemory:2048 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
I0605 18:30:36.907167      18 server_state.go:75] Setting Badger Compression Level: 3
I0605 18:30:36.907185      18 server_state.go:84] Setting Badger table load option: mmap
I0605 18:30:36.907191      18 server_state.go:96] Setting Badger value log load option: mmap
I0605 18:30:36.907202      18 server_state.go:141] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 18:30:38.793573      18 log.go:34] All 6 tables opened in 1.73s
I0605 18:30:38.805952      18 log.go:34] Replaying file id: 0 at offset: 957849371
I0605 18:30:49.142937      18 log.go:34] Replay took: 10.336955479s
I0605 18:30:49.145182      18 log.go:34] Replaying file id: 1 at offset: 0
I0605 18:31:20.287500      18 log.go:34] Replay took: 31.142296612s
I0605 18:31:20.293362      18 log.go:34] Replaying file id: 2 at offset: 0
I0605 18:31:48.684231      18 log.go:34] Replay took: 28.390477856s
I0605 18:31:48.685961      18 log.go:34] Replaying file id: 3 at offset: 0
I0605 18:31:58.447189      18 log.go:34] Replay took: 9.760887038s
I0605 18:31:58.450924      18 server_state.go:75] Setting Badger Compression Level: 3
I0605 18:31:58.451229      18 server_state.go:84] Setting Badger table load option: mmap
I0605 18:31:58.451471      18 server_state.go:96] Setting Badger value log load option: mmap
I0605 18:31:58.451775      18 server_state.go:160] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 18:32:04.256014      18 log.go:34] 43 tables out of 83 opened in 3.026s
I0605 18:32:07.326065      18 log.go:34] 67 tables out of 83 opened in 6.096s
I0605 18:32:09.507096      18 log.go:34] All 83 tables opened in 8.277s
I0605 18:32:09.628863      18 log.go:34] Replaying file id: 73 at offset: 0
I0605 18:32:27.423486      18 log.go:34] Replay took: 17.794596998s
I0605 18:32:27.434057      18 log.go:34] Replaying file id: 74 at offset: 0
I0605 18:32:55.694783      18 log.go:34] Replay took: 28.260207343s
I0605 18:32:55.698983      18 log.go:34] Replaying file id: 75 at offset: 0
I0605 18:33:07.481943      18 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0605 18:33:07.482013      18 log.go:34] Running for level: 0
I0605 18:33:11.063700      18 log.go:34] Replay took: 15.364686825s
I0605 18:33:11.065335      18 log.go:34] Replaying file id: 76 at offset: 0
I0605 18:33:17.924071      18 log.go:34] Replay took: 6.858697639s
I0605 18:33:17.924636      18 log.go:34] Replaying file id: 77 at offset: 0
I0605 18:33:24.894635      18 log.go:34] Replay took: 6.969981227s
I0605 18:33:24.897840      18 log.go:34] Replaying file id: 78 at offset: 0
I0605 18:33:31.694015      18 log.go:34] Replay took: 6.796152436s
I0605 18:33:31.695474      18 log.go:34] Replaying file id: 79 at offset: 0
I0605 18:33:38.523017      18 log.go:34] Replay took: 6.827319391s
I0605 18:33:38.529853      18 log.go:34] Replaying file id: 80 at offset: 0
I0605 18:33:51.051489      18 log.go:34] Replay took: 12.521373342s
I0605 18:33:51.052336      18 log.go:34] Replaying file id: 81 at offset: 0
I0605 18:33:52.044519      18 log.go:34] Replay took: 991.928658ms
I0605 18:33:54.088433      18 groups.go:107] Current Raft Id: 0x1
I0605 18:33:54.103640      18 run.go:480] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0605 18:33:54.103678      18 run.go:481] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0605 18:33:54.103709      18 run.go:512] gRPC server started.  Listening on port 9080
I0605 18:33:54.103722      18 run.go:513] HTTP server started.  Listening on port 8080
I0605 18:33:54.103746      18 worker.go:96] Worker listening at address: [::]:7080
I0605 18:33:54.192151      18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-0.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.303595      18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-1.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.317525      18 groups.go:135] Connected to group zero. Assigned group: 0
I0605 18:33:54.317546      18 groups.go:137] Raft Id after connection to Zero: 0x1
I0605 18:33:54.317601      18 pool.go:160] CONNECTING to dgraph-live-dgraph-alpha-0.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080
I0605 18:33:54.317627      18 pool.go:160] CONNECTING to dgraph-live-dgraph-alpha-2.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080
I0605 18:33:54.317673      18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-2.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.317741      18 draft.go:200] Node ID: 0x1 with GroupID: 1
I0605 18:33:54.318455      18 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc1481c7b00 Applied:115 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x282e510 DisableProposalForwarding:false}
I0605 18:33:56.470893      18 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:115 Term:2 XXX_unrecognized:[]}
I0605 18:33:56.470933      18 node.go:317] Found hardstate: {Term:52 Vote:0 Commit:6907 XXX_unrecognized:[]}
I0605 18:33:59.115242      18 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0605 18:33:59.115275      18 admin.go:520] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.
I0605 18:34:04.121342      18 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0605 18:34:04.121386      18 admin.go:520] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.

We’re trying to figure out what to try next. It seems possible that the volumes are not fast enough, but we have no logs or data that make this clear. Another option would be splitting the cluster into multiple groups.

Three questions:
(1) How do we know if we need faster volumes?
(2) Does adding more groups to spread out predicates make the cluster more stable?
(3) Saas is checked on your roadmap in github for the first half of this year… is it out yet?

Thanks!

1 Like

Hey @jake-nyquist, Welcome to Dgraph!
While I take a look at this, can you try once with the latest release: Dgraph version 20.03.3. It would help us narrow down the issue.

1 Like

Will do, rolling out the upgrade now.

1 Like

Great, also it would be awesome if you could share you helm overrides (values.yaml) and stateful sets for alpha and zero.

I feel the volumes you specified are fine.

Don’t know, let me find out! :slight_smile:

It is :wink: ! But it is in closed beta :shushing_face: Tagging @zhenni and @dereksfoster99 who can help you with that. :slight_smile:

deployment_dgraph_values.yaml (9.1 KB) This is the values.yaml that we’re running.

1 Like

How did you create your EKS cluster?

We’re using eksctl… 3 m5ad.xlarge instances in us-east-1, one instance in each availability zone. Trying to get dgraph nodes to be 1 per az.

By the way did upgrade help? Also can you give me the cluster logs?

It still seems to be crashing here,

. What do you think the memory requirements are? I will try to pull the cluster logs out of cloudwatch.

Here are some of the logs, but missing previous ones because of the restarts.
zero0.log (20.9 KB) zero1.log (21.2 KB) zero2.log (27.7 KB) alpha0.log (34.3 KB) alpha2.log (72.9 KB)

Hello.

Assuming you installed this using:

HELM_RELEASE_NAME=my-dgraph
helm install $HELM_RELEASE_NAME -f deployment_dgraph_values.yaml dgraph/dgraph

It looks like you are using the default values. I noticed that the deployment_dgraph_values.yaml doesn’t actually set any values, because everything is under the dgraph key, so the default values will not be overridden in this case. That may not be the desired intention.

For the EKS cluster, with eksctl, what is the command line or cluster.yaml that you are using? For example, I have spun up a cluster before using these values:

For size, we typically use i3.large (source: https://dgraph.io/docs/deploy/#using-kubernetes), so this could be a good one to start with. Typically, 16gb should be sufficient, at least until a point your needs may grow.

I am curious from K8S perspective, what may happen from resource perspective, so getting describes, especially of any failing pods, to see if this provides any further info, e.g.

kubectl describe sts/$HELM_RELEASE_NAME-alpha
kubectl describe sts/$HELM_RELEASE_NAME-zero
kubectl describe pod/$HELM_RELEASE_NAME-alpha-{0..2}
kubectl describe pod/$HELM_RELEASE_NAME-zero-{0..2}

This would allow us to tell of any events, such as lack of resources that might have caused the node to failed.

2 Likes

I will pull the helm command in a moment… we have modified this slightly because we are deploying dgraph to two namespaces on the same cluster.

were seeing the following in the control plane logs:

6m24s       Warning   Evicted                  pod/dgraph-dev-dgraph-alpha-1           The node was low on resource: memory. Container dgraph-dev-dgraph-alpha was using 4484260Ki, which exceeds its request of 100Mi.

As you can see in the chart above, Dgraph is consuming ~7gb and then appears to be getting evicted. We’re going to try setting a hard limit of 6gb which perhaps will help kubernetes properly schedule the pod and prevent evictions.

Ok, we applied a new resource limit, and kube is no longer evicting the alphas. Now, we are seeing restarts.

alpha-1

kubectl describe pod/dgraph-dev-dgraph-alpha-1
Name:           dgraph-dev-dgraph-alpha-1
Namespace:      dev
Priority:       0
Node:           ip-10-145-60-127.ec2.internal/10.145.60.127
Start Time:     Fri, 05 Jun 2020 15:10:25 -0700
Labels:         app=dgraph
                chart=dgraph-0.0.4
                component=alpha
                controller-revision-hash=dgraph-dev-dgraph-alpha-999655658
                release=dgraph-dev
                statefulset.kubernetes.io/pod-name=dgraph-dev-dgraph-alpha-1
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/path: /debug/prometheus_metrics
                prometheus.io/port: 8080
                prometheus.io/scrape: true
Status:         Running
IP:             10.145.48.133
Controlled By:  StatefulSet/dgraph-dev-dgraph-alpha
Containers:
  dgraph-dev-dgraph-alpha:
    Container ID:  docker://e1be335d428c1d8ab23c89e5db598c8066fed449268faf79815199a543223318
    Image:         docker.io/dgraph/dgraph:v20.03.3
    Image ID:      docker-pullable://dgraph/dgraph@sha256:1497b8eda8141857906a9b1412615f457e6a92fbf645276a9b5813fbf3342f19
    Ports:         7080/TCP, 8080/TCP, 9080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      bash
      -c
      set -ex
      dgraph alpha --my=$(hostname -f):7080 --lru_mb 2048 --zero dgraph-dev-dgraph-zero-0.dgraph-dev-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080

    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 05 Jun 2020 15:29:26 -0700
      Finished:     Fri, 05 Jun 2020 15:33:12 -0700
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 05 Jun 2020 15:22:54 -0700
      Finished:     Fri, 05 Jun 2020 15:27:52 -0700
    Ready:          False
    Restart Count:  5
    Limits:
      memory:  7Gi
    Requests:
      cpu:     2
      memory:  4Gi
    Environment:
      POD_NAMESPACE:  dev (v1:metadata.namespace)
    Mounts:
      /dgraph from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zkmwp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-dgraph-dev-dgraph-alpha-1
    ReadOnly:   false
  default-token-zkmwp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zkmwp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  role=primary
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                 From                                    Message
  ----     ------                  ----                ----                                    -------
  Normal   NotTriggerScaleUp       27m (x2 over 27m)   cluster-autoscaler                      pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 1 max limit reached, 2 node(s) had taints that the pod didn't tolerate
  Normal   NotTriggerScaleUp       25m (x5 over 26m)   cluster-autoscaler                      pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached, 2 node(s) had taints that the pod didn't tolerate, 1 node(s) had no available volume zone
  Normal   NotTriggerScaleUp       23m (x3 over 27m)   cluster-autoscaler                      pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 2 node(s) had taints that the pod didn't tolerate, 1 max limit reached
  Normal   NotTriggerScaleUp       23m (x15 over 27m)  cluster-autoscaler                      pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) had taints that the pod didn't tolerate, 1 node(s) had no available volume zone, 1 max limit reached
  Warning  FailedScheduling        23m (x5 over 27m)   default-scheduler                       0/5 nodes are available: 1 node(s) didn't match node selector, 2 Insufficient cpu, 2 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               23m                 default-scheduler                       Successfully assigned dev/dgraph-dev-dgraph-alpha-1 to ip-10-145-60-127.ec2.internal
  Normal   SuccessfulAttachVolume  23m                 attachdetach-controller                 AttachVolume.Attach succeeded for volume "pvc-3c5dad2c-bc2d-4f64-9c74-9d6310d055d3"
  Normal   Pulled                  10m (x5 over 22m)   kubelet, ip-10-145-60-127.ec2.internal  Container image "docker.io/dgraph/dgraph:v20.03.3" already present on machine
  Normal   Created                 10m (x5 over 22m)   kubelet, ip-10-145-60-127.ec2.internal  Created container dgraph-dev-dgraph-alpha
  Normal   Started                 10m (x5 over 22m)   kubelet, ip-10-145-60-127.ec2.internal  Started container dgraph-dev-dgraph-alpha
  Warning  BackOff                 14s (x14 over 17m)  kubelet, ip-10-145-60-127.ec2.internal  Back-off restarting failed container

zero-1

Name:           dgraph-dev-dgraph-zero-1
Namespace:      dev
Priority:       0
Node:           ip-10-145-60-127.ec2.internal/10.145.60.127
Start Time:     Fri, 05 Jun 2020 12:22:46 -0700
Labels:         app=dgraph
                chart=dgraph-0.0.4
                component=zero
                controller-revision-hash=dgraph-dev-dgraph-zero-58b4c564bc
                release=dgraph-dev
                statefulset.kubernetes.io/pod-name=dgraph-dev-dgraph-zero-1
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/path: /debug/prometheus_metrics
                prometheus.io/port: 6080
                prometheus.io/scrape: true
Status:         Running
IP:             10.145.33.6
Controlled By:  StatefulSet/dgraph-dev-dgraph-zero
Containers:
  dgraph-dev-dgraph-zero:
    Container ID:  docker://5ba1f21307c1b6c5591583d2cf3e85bdf1788b9f07a77cebc4a6ff83ca661707
    Image:         docker.io/dgraph/dgraph:v20.03.3
    Image ID:      docker-pullable://dgraph/dgraph@sha256:1497b8eda8141857906a9b1412615f457e6a92fbf645276a9b5813fbf3342f19
    Ports:         5080/TCP, 6080/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      bash
      -c
      set -ex
      [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
       ordinal=${BASH_REMATCH[1]}
       idx=$(($ordinal + 1))
       if [[ $ordinal -eq 0 ]]; then
         exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 5
       else
         exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-dev-dgraph-zero-0.dgraph-dev-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 5
       fi 
      
    State:          Running
      Started:      Fri, 05 Jun 2020 12:22:54 -0700
    Ready:          True
    Restart Count:  0
    Requests:
      memory:  100Mi
    Environment:
      POD_NAMESPACE:  dev (v1:metadata.namespace)
    Mounts:
      /dgraph from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zkmwp (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-dgraph-dev-dgraph-zero-1
    ReadOnly:   false
  default-token-zkmwp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zkmwp
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  role=primary
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

alpha stateful set:

Name:               dgraph-dev-dgraph-alpha
Namespace:          dev
CreationTimestamp:  Thu, 04 Jun 2020 13:48:30 -0700
Selector:           app=dgraph,chart=dgraph-0.0.4,component=alpha,release=dgraph-dev
Labels:             app=dgraph
                    chart=dgraph-0.0.4
                    component=alpha
                    heritage=Tiller
                    release=dgraph-dev
Annotations:        <none>
Replicas:           3 desired | 3 total
Update Strategy:    RollingUpdate
Pods Status:        3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       app=dgraph
                chart=dgraph-0.0.4
                component=alpha
                release=dgraph-dev
  Annotations:  prometheus.io/path: /debug/prometheus_metrics
                prometheus.io/port: 8080
                prometheus.io/scrape: true
  Containers:
   dgraph-dev-dgraph-alpha:
    Image:       docker.io/dgraph/dgraph:v20.03.3
    Ports:       7080/TCP, 8080/TCP, 9080/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      bash
      -c
      set -ex
      dgraph alpha --my=$(hostname -f):7080 --lru_mb 2048 --zero dgraph-dev-dgraph-zero-0.dgraph-dev-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080
      
    Limits:
      memory:  7Gi
    Requests:
      cpu:     2
      memory:  4Gi
    Environment:
      POD_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /dgraph from datadir (rw)
  Volumes:
   datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir
    ReadOnly:   false
Volume Claims:
  Name:          datadir
  StorageClass:  io1-fast-retain
  Labels:        <none>
  Annotations:   volume.alpha.kubernetes.io/storage-class=anything
  Capacity:      40Gi
  Access Modes:  [ReadWriteOnce]
Events:
  Type     Reason               Age                  From                    Message
  ----     ------               ----                 ----                    -------
  Warning  RecreatingFailedPod  56m (x145 over 22h)  statefulset-controller  StatefulSet dev/dgraph-dev-dgraph-alpha is recreating failed Pod dgraph-dev-dgraph-alpha-0
  Normal   SuccessfulDelete     45m (x194 over 23h)  statefulset-controller  delete Pod dgraph-dev-dgraph-alpha-1 in StatefulSet dgraph-dev-dgraph-alpha successful
  Normal   SuccessfulDelete     40m (x8 over 23h)    statefulset-controller  delete Pod dgraph-dev-dgraph-alpha-2 in StatefulSet dgraph-dev-dgraph-alpha successful
  Warning  RecreatingFailedPod  34m (x194 over 23h)  statefulset-controller  StatefulSet dev/dgraph-dev-dgraph-alpha is recreating failed Pod dgraph-dev-dgraph-alpha-1
  Normal   SuccessfulDelete     28m (x149 over 22h)  statefulset-controller  delete Pod dgraph-dev-dgraph-alpha-0 in StatefulSet dgraph-dev-dgraph-alpha successful
  Normal   SuccessfulCreate     23m (x142 over 25h)  statefulset-controller  create Pod dgraph-dev-dgraph-alpha-0 in StatefulSet dgraph-dev-dgraph-alpha successful

crash-looping alpha-1 logs:

++ hostname -f
+ dgraph alpha --my=dgraph-dev-dgraph-alpha-1.dgraph-dev-dgraph-alpha-headless.dev.svc.cluster.local:7080 --lru_mb 2048 --zero dgraph-dev-dgraph-zero-0.dgraph-dev-dgraph-zero-headless.dev.svc.cluster.local:5080
[Decoder]: Using assembly version of decoder
[Sentry] 2020/06/05 22:35:58 Integration installed: ContextifyFrames
[Sentry] 2020/06/05 22:35:58 Integration installed: Environment
[Sentry] 2020/06/05 22:35:58 Integration installed: Modules
[Sentry] 2020/06/05 22:35:58 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
[Sentry] 2020/06/05 22:35:58 Integration installed: ContextifyFrames
[Sentry] 2020/06/05 22:35:58 Integration installed: Environment
[Sentry] 2020/06/05 22:35:58 Integration installed: Modules
[Sentry] 2020/06/05 22:35:58 Integration installed: IgnoreErrors
I0605 22:35:58.417064      17 init.go:99] 

Dgraph version   : v20.03.3
Dgraph SHA-256   : 08424035910be6b6720570427948bab8352a0b5a6d59a0d20c3ec5ed29533121
Commit SHA-1     : fa3c19120
Commit timestamp : 2020-06-02 16:47:25 -0700
Branch           : HEAD
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit https://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0605 22:35:58.417546      17 run.go:608] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I0605 22:35:58.417582      17 run.go:609] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr:dgraph-dev-dgraph-alpha-1.dgraph-dev-dgraph-alpha-headless.dev.svc.cluster.local:7080 ZeroAddr:[dgraph-dev-dgraph-zero-0.dgraph-dev-dgraph-zero-headless.dev.svc.cluster.local:5080] RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-06-05 22:35:58.133536044 +0000 UTC m=+0.013440182 LudicrousMode:false BadgerKeyFile:}
I0605 22:35:58.417623      17 run.go:610] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: BadgerCompressionLevel:3 WALDir:w MutationsMode:0 AuthToken: AllottedMemory:2048 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
I0605 22:35:58.417693      17 server_state.go:75] Setting Badger Compression Level: 3
I0605 22:35:58.417712      17 server_state.go:84] Setting Badger table load option: mmap
I0605 22:35:58.417717      17 server_state.go:96] Setting Badger value log load option: mmap
I0605 22:35:58.417723      17 server_state.go:141] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x28325d0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 KeepBlockIndicesInCache:false KeepBlocksInCache:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 22:35:58.512423      17 log.go:34] All 3 tables opened in 87ms
I0605 22:35:58.675073      17 log.go:34] Replaying file id: 1320 at offset: 8242372
I0605 22:35:58.675631      17 log.go:34] Replay took: 522.537µs
I0605 22:35:58.676094      17 log.go:34] Replaying file id: 1321 at offset: 0
I0605 22:35:58.737192      17 log.go:34] Replay took: 61.072561ms
I0605 22:35:58.737710      17 log.go:34] Replaying file id: 1322 at offset: 0
I0605 22:35:58.807857      17 log.go:34] Replay took: 70.119278ms
I0605 22:35:58.808547      17 log.go:34] Replaying file id: 1323 at offset: 0
I0605 22:35:58.868399      17 log.go:34] Replay took: 59.813756ms
I0605 22:35:58.868984      17 log.go:34] Replaying file id: 1324 at offset: 0
I0605 22:35:58.919768      17 log.go:34] Replay took: 50.757499ms
I0605 22:35:58.920311      17 log.go:34] Replaying file id: 1325 at offset: 0
I0605 22:35:58.973864      17 log.go:34] Replay took: 53.529235ms
I0605 22:35:58.974393      17 log.go:34] Replaying file id: 1326 at offset: 0
I0605 22:35:59.112355      17 log.go:34] Replay took: 137.920994ms
I0605 22:35:59.112676      17 server_state.go:75] Setting Badger Compression Level: 3
I0605 22:35:59.112691      17 server_state.go:84] Setting Badger table load option: mmap
I0605 22:35:59.112697      17 server_state.go:96] Setting Badger value log load option: mmap
I0605 22:35:59.112707      17 server_state.go:164] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x28325d0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 KeepBlockIndicesInCache:true KeepBlocksInCache:true managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 22:36:00.048454      17 log.go:34] All 91 tables opened in 909ms
I0605 22:36:00.063875      17 log.go:34] Replaying file id: 23 at offset: 0
I0605 22:36:01.315087      17 log.go:34] Got compaction priority: {level:1 score:1.2208955474197865 dropPrefix:[]}
I0605 22:36:01.315258      17 log.go:34] Running for level: 1
I0605 22:36:01.595335      17 log.go:34] Got compaction priority: {level:1 score:1.1243503205478191 dropPrefix:[]}
I0605 22:36:01.595616      17 log.go:34] Running for level: 1
I0605 22:36:08.627094      17 log.go:34] LOG Compact 1->2, del 5 tables, add 4 tables, took 7.311798658s
I0605 22:36:08.627315      17 log.go:34] Compaction for level: 1 DONE
I0605 22:36:08.627400      17 log.go:34] Got compaction priority: {level:1 score:1.0796349346637726 dropPrefix:[]}
I0605 22:36:08.627552      17 log.go:34] Running for level: 1
I0605 22:36:11.573709      17 log.go:34] Replay took: 11.509798048s
I0605 22:36:11.575815      17 log.go:34] Replaying file id: 25 at offset: 0
I0605 22:36:13.977067      17 log.go:34] LOG Compact 1->2, del 7 tables, add 7 tables, took 12.381274342s
I0605 22:36:13.977216      17 log.go:34] Compaction for level: 1 DONE
I0605 22:36:16.903854      17 log.go:34] Replay took: 5.327982588s
I0605 22:36:16.904719      17 log.go:34] Replaying file id: 26 at offset: 0
I0605 22:36:17.066554      17 log.go:34] LOG Compact 1->2, del 6 tables, add 5 tables, took 8.43892995s
I0605 22:36:17.066619      17 log.go:34] Compaction for level: 1 DONE
I0605 22:36:17.066825      17 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0605 22:36:17.066857      17 log.go:34] Running for level: 0
I0605 22:36:24.540751      17 log.go:34] Replay took: 7.636001835s
I0605 22:36:24.546334      17 log.go:34] Replaying file id: 28 at offset: 0
I0605 22:36:29.129834      17 log.go:34] Replay took: 4.583456501s
I0605 22:36:29.130446      17 log.go:34] Replaying file id: 29 at offset: 0
I0605 22:36:33.364181      17 log.go:34] LOG Compact 0->1, del 15 tables, add 11 tables, took 16.297298796s
I0605 22:36:33.364247      17 log.go:34] Compaction for level: 0 DONE
I0605 22:36:33.364282      17 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0605 22:36:33.364310      17 log.go:34] Running for level: 0
I0605 22:36:33.594709      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:33.769234      17 log.go:34] Replay took: 4.638753754s
I0605 22:36:33.769948      17 log.go:34] Replaying file id: 30 at offset: 0
I0605 22:36:34.602232      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:35.589766      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:36.591415      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:37.282908      17 log.go:34] Replay took: 3.51294179s
I0605 22:36:37.283441      17 log.go:34] Replaying file id: 32 at offset: 0
I0605 22:36:37.588969      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:38.589098      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:39.588473      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:40.588861      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:41.588809      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:42.590654      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:43.588529      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:44.588765      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:45.588622      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:46.589379      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:47.588576      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:48.588775      17 log.go:34] Got compaction priority: {level:1 score:1.0463079996407032 dropPrefix:[]}
I0605 22:36:48.969306      17 log.go:34] LOG Compact 0->1, del 16 tables, add 12 tables, took 15.604972375s
I0605 22:36:48.969375      17 log.go:34] Compaction for level: 0 DONE
I0605 22:36:48.969405      17 log.go:34] Got compaction priority: {level:1 score:1.1713070794939995 dropPrefix:[]}
I0605 22:36:48.969469      17 log.go:34] Running for level: 1
I0605 22:36:49.588466      17 log.go:34] Got compaction priority: {level:1 score:1.0756612867116928 dropPrefix:[]}
I0605 22:36:49.588575      17 log.go:34] Running for level: 1
I0605 22:36:53.624815      17 log.go:34] Replay took: 16.341356021s
I0605 22:36:53.626226      17 log.go:34] Replaying file id: 33 at offset: 0
I0605 22:36:59.661144      17 log.go:34] LOG Compact 1->2, del 6 tables, add 5 tables, took 10.072529789s
I0605 22:36:59.661242      17 log.go:34] Compaction for level: 1 DONE
I0605 22:36:59.661270      17 log.go:34] Got compaction priority: {level:0 score:1.8 dropPrefix:[]}
I0605 22:36:59.661307      17 log.go:34] Running for level: 0
I0605 22:37:00.060919      17 log.go:34] Replay took: 6.434672385s
I0605 22:37:00.065517      17 log.go:34] Replaying file id: 34 at offset: 0
I0605 22:37:02.622775      17 log.go:34] LOG Compact 1->2, del 7 tables, add 6 tables, took 13.653276632s
I0605 22:37:02.623321      17 log.go:34] Compaction for level: 1 DONE
I0605 22:37:02.623619      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:03.314549      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:04.314390      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:05.313986      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:05.847783      17 log.go:34] Replay took: 5.78197229s
I0605 22:37:05.849232      17 log.go:34] Replaying file id: 35 at offset: 0
I0605 22:37:06.315342      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:07.314154      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:08.314644      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:09.316532      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:10.313992      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:11.146590      17 log.go:34] Replay took: 5.297328412s
I0605 22:37:11.147670      17 log.go:34] Replaying file id: 36 at offset: 0
I0605 22:37:11.313941      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:12.314003      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:13.313949      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:14.313937      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:15.313961      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:16.313984      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:17.047058      17 log.go:34] Replay took: 5.899285557s
I0605 22:37:17.048969      17 log.go:34] Replaying file id: 37 at offset: 0
I0605 22:37:17.315118      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:18.313960      17 log.go:34] Got compaction priority: {level:1 score:1.0207197666168213 dropPrefix:[]}
I0605 22:37:18.875483      17 log.go:34] LOG Compact 0->1, del 19 tables, add 12 tables, took 19.214139767s
I0605 22:37:18.875558      17 log.go:34] Compaction for level: 0 DONE
I0605 22:37:18.875589      17 log.go:34] Got compaction priority: {level:1 score:1.1544597744941711 dropPrefix:[]}
I0605 22:37:18.878380      17 log.go:34] Running for level: 1
I0605 22:37:19.314242      17 log.go:34] Got compaction priority: {level:1 score:1.0579145476222038 dropPrefix:[]}
I0605 22:37:19.314427      17 log.go:34] Running for level: 1
[Sentry] 2020/06/05 22:37:27 ModuleIntegration wasn't able to extract modules: module integration failed
[Sentry] 2020/06/05 22:37:27 Sending fatal event [1841064565464d7bae2df7acfd25dd70] to o318308.ingest.sentry.io project: 1805390
2020/06/05 22:37:27 Unable to replay logfile. Path=p/000037.vlog. Error=read p/000037.vlog: cannot allocate memory
During db.vlog.open
github.com/dgraph-io/badger/v2/y.Wrapf
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/y/error.go:82
github.com/dgraph-io/badger/v2.Open
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/db.go:381
github.com/dgraph-io/badger/v2.OpenManaged
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/managed_db.go:26
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:167
github.com/dgraph-io/dgraph/worker.InitServerState
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:57
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:612
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:90
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
main.main
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
runtime.main
	/usr/local/go/src/runtime/proc.go:203
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
Error while creating badger KV posting store
github.com/dgraph-io/dgraph/x.Checkf
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:168
github.com/dgraph-io/dgraph/worker.InitServerState
	/ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:57
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:612
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:90
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
main.main
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
runtime.main
	/usr/local/go/src/runtime/proc.go:203
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373

For the data size, we recommend 16GB or 32GB of memory.

Could you get a memory profile:

It looks like with the current limits:

It looks like there’s not enough memory, as the oomkilled:

and Sentry log

We will provision larger instances over the weekend and grab a memory profile from those instances.

1 Like

I’m VERY interested in the SAAS too. @zhenni & @dereksfoster99

2 Likes

@Mentioum please DM me your email address and I will add you to our waiting list. :slight_smile: