Transient null values during reads in healthy clusters

diggy · January 10, 2020, 8:31pm

Moved from GitHub dgraph/4546

Posted by aphyr:

What version of Dgraph are you using?

1.1.1-65-g2851e2d9a.

Have you tried reproducing the issue with the latest release?

Yes, 1.1.1-65-g2851e2d9a is the latest development release.

What is the hardware spec (RAM, OS)?

A 5-node EC2 m4.large cluster.

Steps to reproduce the issue (command/config used to run Dgraph).

With Jepsen 0b6b5f858ae04053a014c4fea1c21b7b762db2d2, try

lein run test --username admin --nodes-file ~/nodes --concurrency 2n --local-binary dgraph-1.1.1-65-g2851e2d9a --time-limit 600 -w bank --test-count 5

Expected behaviour and actual result.

In bank tests, even without any nemesis activity, Dgraph appears to return null for some (but not all) values in read queries. Nulls appear randomly distributed through time. For instance, take this test run, where in roughly 600 seconds, we observed 52 nulls (45 in reads, 7 in transfers). Because transfers crash when given a null balance for an account, this bug did not manifest as logical state corruption; only transient read errors.

For instance, here, account 1’s key came back null, rather than 1:

{:type :ok, :f :read, :process 14, :time 31349508708, :value {0 21, 1 5, 2 4, 3 33, 4 9, 5 7, 6 7, 7 14}, :index 9220}
{:type :invoke, :f :read, :process 14, :time 31349651962, :index 9221}
{:type :ok, :f :read, :process 10, :time 31350933004, :value {nil 5, 0 21, 2 4, 3 33, 4 9, 5 7, 6 7, 7 14}, :message {:type :unexpected-key, :unexpected (nil), :span-id "SpanId{spanId=bb30948dedadace5}", :trace-id "TraceId{traceId=8511be855f530c9ed1a36e3f707fe7ee}"}, :error :checker-violation, :index 9222}

Here, account 1 had a null balance:

{:type :ok, :f :read, :process 21, :time 72493654542, :value {0 12, 1 nil, 2 42, 3 6, 4 3, 5 1, 6 20, 7 5}, :message {:type :nil-balance, :nils {1 nil}, :span-id "SpanId{spanId=450a9e72f6296204}", :trace-id "TraceId{traceId=c180b44ac232ce5629e9472c5cb21ca3}"}, :error :checker-violation, :index 24694}

diggy · January 10, 2020, 8:48pm

aphyr commented :

This might be the same issue as #4540, where we saw null values with tablet moves.

diggy · January 10, 2020, 8:59pm

aphyr commented :

In #4534, we saw identical behavior, but it was specifically associated with tablet moves. These might have the same underlying cause–or maybe they’re separate issues which manifest in the same way.

diggy · January 12, 2020, 12:46am

manishrjain commented :

This was caused by a new commit: Perform incremental rollups instead of rollups at snapshot (#4410) · dgraph-io/dgraph@42b2a60 · GitHub . I’ve reverted that commit. Bank test is back to passing. @danielmai can verify.

diggy · January 12, 2020, 6:40pm

danielmai commented :

I ran these tests with the release/v1.1 branch (e18986f1c3feb0acfba15423827639f813eadbaf) that will be released as v1.1.2. It does not contain 42b2a60.

Dgraph version   : v1.1.1-56-ge18986f1c
Dgraph SHA-256   : 36eecf05e86803b38a4a77da4a2410b11a7ec8d1c558d661a8024e4d0914d919
Commit SHA-1     : e18986f1c
Commit timestamp : 2020-01-11 17:49:16 -0800
Branch           : release/v1.1
Go version       : go1.13.5

This build passes 10 individual 600-second test runs of the bank test with no nemesis activity.

20200112T084553.000Z.zip

diggy · January 13, 2020, 11:18pm

aphyr commented :

Note to self: not a dup of #4534 or #4540!

diggy · January 14, 2020, 2:22am

manishrjain commented :

#4540 is caused by the same new commit as mentioned above. Reverting that commit fixes both the issues.

#4534 was fixed already by Fix move tablet Jepsen test failure by manishrjain · Pull Request #4496 · dgraph-io/dgraph · GitHub, but not released yet.

diggy · January 31, 2020, 12:51am

aphyr commented :

I am so, SO sorry to report this, but we’re still getting transient null values in 1.1.1-59-g191232226. This one happened with just ring network partitions:

20200130T220054.000Z.zip

diggy · March 23, 2020, 9:33pm

manishrjain commented :

This might be a potential fix: Throw errors returned by retrieveValuesAndFacets (#4966) · dgraph-io/dgraph@c44e7e1 · GitHub

Can we run and test again?

Topic		Replies	Views
Yet more missing values with move-tablet Dgraph dgraph , status:accepted , kind:bug , area:testing:jepsen	1	555	January 16, 2020
Read skew leading to logical data corruption with tablet moves Dgraph dgraph , status:accepted , kind:bug , area:testing:jepsen	8	554	January 29, 2020
Dgraph transactions violated snapshot isolation Dgraph	4	1404	February 6, 2023
Transactions can read stale values Dgraph kind:question , kind:bug	0	470	July 7, 2022
Dgraph Jepsen test results released Dgraph	6	1324	October 14, 2019

Transient null values during reads in healthy clusters

What version of Dgraph are you using?

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.

Related topics