Diverged replicas

We are running Dgraph 1.0.11 using Kubernetes and our configuration is the same as https://github.com/dgraph-io/dgraph/blob/master/contrib/config/kubernetes/helm/values.yaml (without shardReplicaCount: 5). Some time ago we noticed unusual system behavior and it turned out that the same query returns different results on different nodes, which basically means that our nodes have diverged. How is it possible? What is the best way to fix it now and have all the nodes in the same, correct state? How to prevent similar situations in the future?

Hi,

Could you share the query that returns different result along with example dataset?

Thanks

Hi, I will provide this info.

Example query:

{
  result (func: has(childOf)) {
    count(uid)
  }
}

On one alpha we get a result of count: 1344 and on two other count: 719.

Also while querying for:

{
  result (func: has(childOf)) {
    uid
    created
    childOf {
      uid
      created
    }
  }
}

We would get strange empty uid results like this:

{
  "data": {
    "result": [
      {
        "uid": "0x31409",
        "created": "2020-01-06T15:38:16.276Z",
        "childOf": [
          {
            "uid": "0x31408",
            "created": "2019-02-19T12:38:58.874Z"
          }
        ]
      },
      {
        "uid": "0x3140a",
        "created": "2020-01-06T15:41:31.918Z",
        "childOf": [
          {
            "uid": "0x31408",
            "created": "2019-02-19T12:38:58.874Z"
          }
        ]
      },
      {
        "uid": "0x31430"
      },
      {
        "uid": "0x31458"
      },
      {
        "uid": "0x31459"
      },
      {
        "uid": "0x35b85",
        "created": "2020-01-07T19:33:02.038Z",
        "childOf": [
          {
            "uid": "0x314ff",
            "created": "2020-01-06T22:01:19.028Z"
          }
        ]
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 14974,
      "processing_ns": 23365794,
      "encoding_ns": 31544993
    },
    "txn": {
      "start_ts": 1570337
    }
  }
}

Also we think it may be related to alpha node crash that we observed at this time. Looks like the alphas had a lot of problems with electing the leader. Messages exactly like this repeat in broken alpha logs for around 10 minutes:

I0114 16:12:13.585220       1 node.go:83] 1 is starting a new election at term 55
I0114 16:12:13.585249       1 node.go:83] 1 became pre-candidate at term 55
I0114 16:12:13.585256       1 node.go:83] 1 received MsgPreVoteResp from 1 at term 55
I0114 16:12:13.585398       1 node.go:83] 1 [logterm: 55, index: 1308004] sent MsgPreVote request to 2 at term 55
I0114 16:12:13.585523       1 node.go:83] 1 [logterm: 55, index: 1308004] sent MsgPreVote request to 3 at term 55

After that it became the follower. Received the snapshot:

I0114 16:25:49.966787       1 predicate.go:86] Snapshot writes DONE. Sending ACK
I0114 16:25:49.966963       1 predicate.go:91] Populated snapshot with 400447 keys.
I0114 16:25:49.968977       1 draft.go:702] ---> Retrieve snapshot: OK.

Then it went back to normal (I guess) and now it has broken data:

I0114 16:25:50.055968       1 draft.go:319] Skipping snapshot at 1308343, because found one at 1308343
I0114 16:27:34.958684       1 draft.go:323] Creating snapshot at index: 1308394. ReadTs: 1568685.
I0114 16:30:05.627445       1 draft.go:323] Creating snapshot at index: 1308454. ReadTs: 1568727.
I0114 16:31:09.866920       1 stream.go:255] Rolling up Sent 27502 keys
I0114 16:31:09.872337       1 draft.go:836] Rollup on disk done. Rolling up 27502 keys in LRU cache now

This is a very old issue that is solved in latest releases. Just upgrade your cluster.

2 Likes

Thank you @MichelDiz .