Transient null values during reads in healthy clusters

Moved from GitHub dgraph/4546

Posted by aphyr:

What version of Dgraph are you using?


Have you tried reproducing the issue with the latest release?

Yes, 1.1.1-65-g2851e2d9a is the latest development release.

What is the hardware spec (RAM, OS)?

A 5-node EC2 m4.large cluster.

Steps to reproduce the issue (command/config used to run Dgraph).

With Jepsen 0b6b5f858ae04053a014c4fea1c21b7b762db2d2, try

lein run test --username admin --nodes-file ~/nodes --concurrency 2n --local-binary dgraph-1.1.1-65-g2851e2d9a --time-limit 600 -w bank --test-count 5

Expected behaviour and actual result.

In bank tests, even without any nemesis activity, Dgraph appears to return null for some (but not all) values in read queries. Nulls appear randomly distributed through time. For instance, take this test run, where in roughly 600 seconds, we observed 52 nulls (45 in reads, 7 in transfers). Because transfers crash when given a null balance for an account, this bug did not manifest as logical state corruption; only transient read errors.

For instance, here, account 1’s key came back null, rather than 1:

{:type :ok, :f :read, :process 14, :time 31349508708, :value {0 21, 1 5, 2 4, 3 33, 4 9, 5 7, 6 7, 7 14}, :index 9220}
{:type :invoke, :f :read, :process 14, :time 31349651962, :index 9221}
{:type :ok, :f :read, :process 10, :time 31350933004, :value {nil 5, 0 21, 2 4, 3 33, 4 9, 5 7, 6 7, 7 14}, :message {:type :unexpected-key, :unexpected (nil), :span-id "SpanId{spanId=bb30948dedadace5}", :trace-id "TraceId{traceId=8511be855f530c9ed1a36e3f707fe7ee}"}, :error :checker-violation, :index 9222}

Here, account 1 had a null balance:

{:type :ok, :f :read, :process 21, :time 72493654542, :value {0 12, 1 nil, 2 42, 3 6, 4 3, 5 1, 6 20, 7 5}, :message {:type :nil-balance, :nils {1 nil}, :span-id "SpanId{spanId=450a9e72f6296204}", :trace-id "TraceId{traceId=c180b44ac232ce5629e9472c5cb21ca3}"}, :error :checker-violation, :index 24694}

aphyr commented :

This might be the same issue as #4540, where we saw null values with tablet moves.

aphyr commented :

In #4534, we saw identical behavior, but it was specifically associated with tablet moves. These might have the same underlying cause–or maybe they’re separate issues which manifest in the same way.

manishrjain commented :

This was caused by a new commit: Perform incremental rollups instead of rollups at snapshot (#4410) · dgraph-io/dgraph@42b2a60 · GitHub . I’ve reverted that commit. Bank test is back to passing. @danielmai can verify.

danielmai commented :

I ran these tests with the release/v1.1 branch (e18986f1c3feb0acfba15423827639f813eadbaf) that will be released as v1.1.2. It does not contain 42b2a60.

Dgraph version   : v1.1.1-56-ge18986f1c
Dgraph SHA-256   : 36eecf05e86803b38a4a77da4a2410b11a7ec8d1c558d661a8024e4d0914d919
Commit SHA-1     : e18986f1c
Commit timestamp : 2020-01-11 17:49:16 -0800
Branch           : release/v1.1
Go version       : go1.13.5

This build passes 10 individual 600-second test runs of the bank test with no nemesis activity.

aphyr commented :

Note to self: not a dup of #4534 or #4540!

manishrjain commented :

#4540 is caused by the same new commit as mentioned above. Reverting that commit fixes both the issues.

#4534 was fixed already by Fix move tablet Jepsen test failure by manishrjain · Pull Request #4496 · dgraph-io/dgraph · GitHub, but not released yet.

aphyr commented :

I am so, SO sorry to report this, but we’re still getting transient null values in 1.1.1-59-g191232226. This one happened with just ring network partitions:

manishrjain commented :

This might be a potential fix: Throw errors returned by retrieveValuesAndFacets (#4966) · dgraph-io/dgraph@c44e7e1 · GitHub

Can we run and test again?