Yet more missing values with move-tablet

Moved from GitHub dgraph/4575

Posted by aphyr:

What version of Dgraph are you using?

1.1.1-56-ge18986f1c

Have you tried reproducing the issue with the latest release?

1.1.1-56-ge189861fc is yesterday’s dev build; I haven’t started with today’s yet.

What is the hardware spec (RAM, OS)?

5-node LXC Jepsen cluster, on a 48-way Xeon, 128GB of ram.

Steps to reproduce the issue (command/config used to run Dgraph).

This appears to be an infrequent error. Spurious nulls were extremely common in 1.1.1 and subsequent development builds, but reproducing it with ge189 appears more difficult. With Jepsen b7c7bdc5f8476e009d591927060e7e8e786f015a, try:

lein run test --workload bank --nemesis move-tablet --time-limit 600 --test-count 20

Expected behaviour and actual result.

In this test run, Dgraph returned a single, transient nil value in response to a read of all accounts during what looks like a tablet move. I’m wondering if there could be a rare race condition here? Something that slips through before the move-ts checks kick in and prevent reads from executing on a group that doesn’t host a tablet?

; The value of account 0 is ~$22; the total of all accounts is $100
{:type :ok, :f :read, :process 109, :time 323160578607, :value {0 22, 1 2, 2 14, 3 3, 4 11, 5 10, 6 32, 7 6}, :index 31059}
{:type :invoke, :f :read, :process 109, :time 323160720211, :index 31060}
{:type :fail, :f :transfer, :value {:from 4, :to 0, :amount 1}, :process 105, :time 323229577276, :error :conflict, :index 31061}
{:type :invoke, :f :transfer, :value {:from 7, :to 3, :amount 1}, :process 105, :time 323229862994, :index 31062}
{:type :fail, :f :transfer, :value {:from 7, :to 6, :amount 5}, :process 103, :time 323245905458, :error :conflict, :index 31063}
{:type :ok, :f :transfer, :value {:from 5, :to 1, :amount 2}, :process 106, :time 323246165119, :index 31064}
{:type :invoke, :f :transfer, :value {:from 7, :to 1, :amount 1}, :process 103, :time 323246360212, :index 31065}
{:type :invoke, :f :transfer, :value {:from 7, :to 4, :amount 3}, :process 106, :time 323246525060, :index 31066}
{:type :fail, :f :transfer, :value {:from 0, :to 3, :amount 3}, :process 104, :time 323364199257, :error :conflict, :index 31067}
{:type :invoke, :f :read, :process 104, :time 323364634847, :index 31068}
; We fail to observe a value for account 0; the observed total is only 78
{:type :ok, :f :read, :process 109, :time 323371946833, :value {0 nil, 1 2, 2 14, 3 3, 4 11, 5 10, 6 32, 7 6}, :message {:type :nil-balance, :nils {0 nil}, :span-id "SpanId{spanId=1c8e896d950daad6}", :trace-id "TraceId{traceId=cbd2f037e1e266644d3bb3964b5b43ee}"}, :error :checker-violation, :index 31069}
{:type :invoke, :f :read, :process 109, :time 323372719597, :index 31070}
{:type :fail, :f :read, :process 107, :time 323459240076, :error :tablet-not-served-by-instance, :index 31071}
{:type :invoke, :f :transfer, :value {:from 3, :to 5, :amount 1}, :process 107, :time 323459733773, :index 31072}
; Reads fail because the tablet isn't served by the instance we're talking to
{:type :fail, :f :read, :process 100, :time 323463996489, :error :tablet-not-served-by-instance, :index 31073}
{:type :invoke, :f :transfer, :value {:from 6, :to 0, :amount 3}, :process 100, :time 323464587264, :index 31074}
{:type :ok, :f :transfer, :value {:from 7, :to 3, :amount 1}, :process 105, :time 323521941137, :index 31075}
{:type :invoke, :f :read, :process 105, :time 323522250533, :index 31076}
{:type :fail, :f :transfer, :value {:from 7, :to 4, :amount 1}, :process 102, :time 323542895467, :error :conflict, :index 31077}
{:type :invoke, :f :read, :process 102, :time 323543269600, :index 31078}
{:type :fail, :f :transfer, :value {:from 3, :to 5, :amount 1}, :process 107, :time 323569064832, :error :conflict, :index 31079}
{:type :invoke, :f :transfer, :value {:from 5, :to 0, :amount 5}, :process 107, :time 323569537170, :index 31080}
; Dgraph prevents a transaction from executing on a tablet that's no longer held
{:type :info, :f :read, :process 108, :time 323641205824, :exception {:via [{:type io.grpc.StatusRuntimeException, :message "UNKNOWN: : StartTs: 17843 is from before MoveTs: 17853 for pred: \"amount_2\"", ...

aphyr commented :

I haven’t been able to see this on 1.1.1-57-g4abba793a–maybe that helps us narrow down a potential cause?