Cluster fails to accept connection requests, weird things in logs

bug

(Charles Lanahan) #1

Through some regular usage our dgraph cluster became unresponsive and has logs that look different than usual. It seems dgraph-zero-0 can’t be connected by any of the other zeros or alphas and this is causing some kind of contention. We restarted the cluster and the issue persisted. We can recreate teh cluster but wanted to see how we can troubleshoot this type of thing in the future to understand why it happened.

Tails of logs

curl to the /state endpoint

{"errors":[{"code":"Error","message":"context deadline exceeded"}]}```

Making this topic as requested by @hackintoshrao after his help in troubleshooting on slack.  https://dgraph.slack.com/archives/C13LH03RR/p1564408224283800

(Charles Lanahan) #2

Updates:

Cluster restarted doesn’t seem to fix the issue (as other issues in the past seem to have been fixed) More troubleshooting steps taken by team below.

Elie 12:10 PM

I was querying the DB. Everything was fine and suddenly it crashed. The query wasn’t requesting a lot of info. Just mentioning this in case it is useful.

Paul 12:14 PM

Not sure what happened. But, from log: dgraph-alpha-0 1/1 Running 0 16d
dgraph-alpha-1 1/1 Running 0 177m
dgraph-alpha-2 1/1 Running 0 176m
dgraph-ratel-8599875574-69m2f 1/1 Running 0 3h4m
dgraph-zero-0 1/1 Running 0 177m
dgraph-zero-1 1/1 Running 0 16d
dgraph-zero-2 1/1 Running 0 177m

12:15 PM

So, most of the pods refreshed around 3 or 4 hours ago.

/health returns nothing.
I used “/usr/local/bin/kubectl delete --all pods --namespace=default”

12:48 PM

Now, all of the pods are re-started.

12:48 PM

Can someone help check whether it is working or not.

12:53 PM

Still, the /health for all 3 alpha return 503

12:53 PM

state for zero-0:

12:53 PM

localhost:6080/state

    "counter": "3060",
    "groups": {
        "1": {
            "members": {
                "1": {
                    "id": "1",
                    "groupId": 1,
                    "addr": "dgraph-alpha-1.dgraph-alpha.default.svc.cluster.local:7080",
                    "lastUpdate": "1562964393"
                },
                "2": {
                    "id": "2",
                    "groupId": 1,
                    "addr": "dgraph-alpha-2.dgraph-alpha.default.svc.cluster.local:7080",
                    "leader": true,
                    "lastUpdate": "1564404867"
                },
                "3": {
                    "id": "3",
                    "groupId": 1,
                    "addr": "dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080"
                }
            },
            "tablets": {
                "B": {
                    "groupId": 1,
                    "predicate": "B"
                },
                "K": {
                    "groupId": 1,
                    "predicate": "K"
                },
                "__predicate__": {
                    "groupId": 1,
                    "predicate": "__predicate__"
                },
                "_predicate_": {
                    "groupId": 1,
                    "predicate": "_predicate_",
                    "space": "28969510318"
                },
                "connectionCount": {
                    "groupId": 1,
                    "predicate": "connectionCount",
                    "space": "101403260"
                },
                "dgraph.group.acl": {
                    "groupId": 1,
                    "predicate": "dgraph.group.acl",
                    "space": "39"
                },
                "dgraph.password": {
                    "groupId": 1,
                    "predicate": "dgraph.password",
                    "space": "37"
                },
                "dgraph.user.group": {
                    "groupId": 1,
                    "predicate": "dgraph.user.group",
                    "space": "43"
                },
                "dgraph.xid": {
                    "groupId": 1,
                    "predicate": "dgraph.xid",
                    "space": "36"
                },
                "first": {
                    "groupId": 1,
                    "predicate": "first",
                    "space": "1075104768"
                },
                "k": {
                    "groupId": 1,
                    "predicate": "k",
                    "space": "418196427"
                },
                "kvalue": {
                    "groupId": 1,
                    "predicate": "kvalue",
                    "space": "2845039814"
                },
                "last": {
                    "groupId": 1,
                    "predicate": "last",
                    "space": "1074753534"
                },
                "lnkdns": {
                    "groupId": 1,
                    "predicate": "lnkdns",
                    "space": "34848071"
                },
                "name": {
                    "groupId": 1,
                    "predicate": "name",
                    "space": "342999430"
                },
                "profileUrl": {
                    "groupId": 1,
                    "predicate": "profileUrl",
                    "space": "137552039"
                },
                "raw": {
                    "groupId": 1,
                    "predicate": "raw",
                    "space": "1164637358"
                },
                "rawUrl": {
                    "groupId": 1,
                    "predicate": "rawUrl",
                    "space": "133276891"
                },
                "relatedPerson": {
                    "groupId": 1,
                    "predicate": "relatedPerson",
                    "space": "4928043531"
                },
                "relationship": {
                    "groupId": 1,
                    "predicate": "relationship",
                    "space": "2627918240"
                },
                "relationshipsubtype": {
                    "groupId": 1,
                    "predicate": "relationshipsubtype",
                    "space": "6287191971"
                },
                "relationshiptype": {
                    "groupId": 1,
                    "predicate": "relationshiptype",
                    "space": "6061156018"
                },
                "sk": {
                    "groupId": 1,
                    "predicate": "sk",
                    "space": "820229922"
                },
                "skvalue": {
                    "groupId": 1,
                    "predicate": "skvalue",
                    "space": "1213685113"
                },
                "t": {
                    "groupId": 1,
                    "predicate": "t",
                    "space": "2516093355"
                },
                "ts_cuid": {
                    "groupId": 1,
                    "predicate": "ts_cuid"
                },
                "ts_identifier": {
                    "groupId": 1,
                    "predicate": "ts_identifier"
                },
                "tvalue": {
                    "groupId": 1,
                    "predicate": "tvalue",
                    "space": "104"
                },
                "type": {
                    "groupId": 1,
                    "predicate": "type",
                    "space": "14316408120"
                },
                "validAsOf": {
                    "groupId": 1,
                    "predicate": "validAsOf",
                    "space": "1103574432"
                }
            },
            "snapshotTs": "19452",
            "checksum": "8025261322048805177"
        }
    },
    "zeros": {
        "1": {
            "id": "1",
            "addr": "dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080"
        },
        "3": {
            "id": "3",
            "addr": "dgraph-zero-2.dgraph-zero.default.svc.cluster.local:5080",
            "leader": true
        }
    },
    "maxLeaseId": "30320000",
    "maxTxnTs": "30000",
    "maxRaftId": "3",
    "cid": "a0b4514d-1491-41cb-807d-17bd42af9cf7"
}```