Diverged replicas

wojtekk · January 10, 2020, 3:59pm

We are running Dgraph 1.0.11 using Kubernetes and our configuration is the same as https://github.com/dgraph-io/dgraph/blob/master/contrib/config/kubernetes/helm/values.yaml (without shardReplicaCount: 5). Some time ago we noticed unusual system behavior and it turned out that the same query returns different results on different nodes, which basically means that our nodes have diverged. How is it possible? What is the best way to fix it now and have all the nodes in the same, correct state? How to prevent similar situations in the future?

amanmangal · January 11, 2020, 5:17am

Hi,

Could you share the query that returns different result along with example dataset?

Thanks

ledwonmat · January 14, 2020, 6:29pm

Hi, I will provide this info.

Example query:

{
  result (func: has(childOf)) {
    count(uid)
  }
}

On one alpha we get a result of count: 1344 and on two other count: 719.

Also while querying for:

{
  result (func: has(childOf)) {
    uid
    created
    childOf {
      uid
      created
    }
  }
}

We would get strange empty uid results like this:

{
  "data": {
    "result": [
      {
        "uid": "0x31409",
        "created": "2020-01-06T15:38:16.276Z",
        "childOf": [
          {
            "uid": "0x31408",
            "created": "2019-02-19T12:38:58.874Z"
          }
        ]
      },
      {
        "uid": "0x3140a",
        "created": "2020-01-06T15:41:31.918Z",
        "childOf": [
          {
            "uid": "0x31408",
            "created": "2019-02-19T12:38:58.874Z"
          }
        ]
      },
      {
        "uid": "0x31430"
      },
      {
        "uid": "0x31458"
      },
      {
        "uid": "0x31459"
      },
      {
        "uid": "0x35b85",
        "created": "2020-01-07T19:33:02.038Z",
        "childOf": [
          {
            "uid": "0x314ff",
            "created": "2020-01-06T22:01:19.028Z"
          }
        ]
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 14974,
      "processing_ns": 23365794,
      "encoding_ns": 31544993
    },
    "txn": {
      "start_ts": 1570337
    }
  }
}

Also we think it may be related to alpha node crash that we observed at this time. Looks like the alphas had a lot of problems with electing the leader. Messages exactly like this repeat in broken alpha logs for around 10 minutes:

I0114 16:12:13.585220       1 node.go:83] 1 is starting a new election at term 55
I0114 16:12:13.585249       1 node.go:83] 1 became pre-candidate at term 55
I0114 16:12:13.585256       1 node.go:83] 1 received MsgPreVoteResp from 1 at term 55
I0114 16:12:13.585398       1 node.go:83] 1 [logterm: 55, index: 1308004] sent MsgPreVote request to 2 at term 55
I0114 16:12:13.585523       1 node.go:83] 1 [logterm: 55, index: 1308004] sent MsgPreVote request to 3 at term 55

After that it became the follower. Received the snapshot:

I0114 16:25:49.966787       1 predicate.go:86] Snapshot writes DONE. Sending ACK
I0114 16:25:49.966963       1 predicate.go:91] Populated snapshot with 400447 keys.
I0114 16:25:49.968977       1 draft.go:702] ---> Retrieve snapshot: OK.

Then it went back to normal (I guess) and now it has broken data:

I0114 16:25:50.055968       1 draft.go:319] Skipping snapshot at 1308343, because found one at 1308343
I0114 16:27:34.958684       1 draft.go:323] Creating snapshot at index: 1308394. ReadTs: 1568685.
I0114 16:30:05.627445       1 draft.go:323] Creating snapshot at index: 1308454. ReadTs: 1568727.
I0114 16:31:09.866920       1 stream.go:255] Rolling up Sent 27502 keys
I0114 16:31:09.872337       1 draft.go:836] Rollup on disk done. Rolling up 27502 keys in LRU cache now

MichelDiz · January 14, 2020, 6:38pm

This is a very old issue that is solved in latest releases. Just upgrade your cluster.

ledwonmat · January 14, 2020, 6:44pm

Thank you @MichelDiz .

Topic		Replies	Views
Skewed Query Results with Ongoing Mutations Dgraph dgraph , untagged	1	425	April 20, 2020
Divergent data on Alpha nodes when deleting predicates Dgraph dgraph , investigate , kind:bug	1	632	July 16, 2020
Data inconsistency between nodes Dgraph	8	878	May 16, 2023
Issues with Dgraph running in Kubernetes (K8 Loadbalancing?) Dgraph kind:bug	6	1117	October 7, 2020
Queries returning empty results randomly Dgraph java-client , query-processing , kind:bug , area:data-loss , dql	11	1837	August 19, 2020

Diverged replicas

Related topics