Dgraph transactions violated snapshot isolation

We are reporting snapshot isolation anomalies found in DGraph. In particular, Dgraph transactions have seen overwritten values.


Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph version : v21.12.0

Dgraph codename : zion

Dgraph SHA-256 : 078c75df9fa1057447c8c8afc10ea57cb0a29dfb22f9e61d8c334882b4b4eb37

Commit SHA-1 : d62ed5f15

Commit timestamp : 2021-12-02 21:20:09 +0530

Branch : HEAD

Go version : go1.17.3

jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

Yes.

What is the hardware spec (RAM, OS)?

  • Spec: Aliyun ecs.c6e.large cloud VM

  • OS: Ubuntu 20.04 LTS

  • Environment: Docker dgraph/dgraph:latest

  • RAM: 4G

Steps to reproduce the issue (command/config used to run Dgraph).

We are using the docker-compose to run dgraph [1] (see the code block at the end for reference).

We setup the database to simulate a key-value store, using uid as the key [2].


val: int .

type KV {

val

}

The initial values are first inserted into the database [3]. We then spawn a number of threads (sessions) to do random reads and writes to the database, and record the results [4]. The values written to a single uid are unique. We then use a verifier [5] to check where are violations of snapshot isolation (SI) in the results.

The complete script to run the tests can be found at [6].

Please note that the chance of reproducing this varies on deployment: On the cloud VM listed in the hardware spec. section, the anomaly occurs almost every run, while it takes about 30 mins to find an anomaly on a laptop with 16GB memory and a 6-core CPU.

Expected behaviour and actual result.

As per the docs [7], dgraph should support snapshot isolation, and all commits before a transaction should be visible to it. However, we have found violations of SI in our tests. An instance is shown below:

Each transaction in this graph is identified by a pair (session id, transaction id). Transactions with smaller ids are executed before those with bigger ones in a session. We use R(uid, value) and W(uid, value) to denote reads and writes. The start_ts of each transaction is also included in the graph. The edges in this graph means the ordering of transactions. There are session orders (SO, because transactions in a session are executed one after another), write-read order (WR, which means a value is written by one transaction and read by another), and write-write order (WW, two transactions have written to the same uid, so they can not execute concurrently under SI due to write conflict)

In this graph, transaction (9, 249) reads uid=457, value=2, which is written by (4, 167) and (10, 471). Regardless of which one commits first, this constitutes a violation of SI (shown in the graph as the two cycles). Note that it’s not possible for (9, 429) to start before (10, 471) commits because there is a path (10, 471) -> (10, 471) -> (1, 43) -> (9, 429). Judging by their timestamps, it appears that (10, 471) should have overwritten the value of uid=457, but the stale value is read by (9, 249).

The database logs and dump are attached in [8].


[1]: https://github.com/amnore/dbcop/blob/master/docker/dgraph/docker-compose.yml

[2]: https://github.com/amnore/dbcop/blob/d7f5e745ec0d24d259abaec7fbd1465ba588573b/examples/dgraph.rs#L105

[3]: https://github.com/amnore/dbcop/blob/a3bf2ea810de088d6057eec8d9f4b083d4085f57/examples/dgraph.rs#L122

[4]: https://github.com/amnore/dbcop/blob/d7f5e745ec0d24d259abaec7fbd1465ba588573b/examples/dgraph.rs#L56

[5]: https://github.com/amnore/CobraVerifier

[6]: https://github.com/amnore/dbcop/blob/master/script/test-dgraph.sh

[7]: https://dgraph.io/docs/design-concepts/consistency-model/#sidebar

[8]: https://1drv.ms/u/s!Ao9rNU5eah0xqlL8BM5yq_LILUs6

Out of curiosity, have you run the same test suite on the previous version (21.03.2)? I understand this is the version being used in Dgraph’s cloud offering since 21.12 is known to have many unresolved bugs.

Hi David,

We’ve also found anomalous histories with 21.03.2; see the issue reported here.

Do I get this right: Dgraph transactions can/will break SI when running in a self-hosted cloud environment, and such a database will almost certainly lose data integrity during write races?

How does Dgraph Cloud avoid this issue? Does Dgraph Cloud avoid this issue?

For resolution of this issue see Dgraph transactions violated causal consistency · Issue #8146 · dgraph-io/dgraph · GitHub. In short there was a bug in this script that created a new transaction for each read, which is not reflected in the diagrams posted here.