Transactions on Dgraph violated Causal Consistency

We found transactional causal consistency (TCC) anomalies on the Dgraph Cluster. According to link1 (see the following comment), Dgraph provides SNAPSHOT ISOLATION that is stronger than TCC. See the bottom for TCC details.

In our experiment we set up the cluster on a local machine with 3 server nodes. Here is the configuration information:

Dgraph Version == 21.03.2

server_node = 3
client_num(session_num) = 2
client_stub_num = 1
txn_per_session = 10
operation_per_txn = 10
key_number = 20
key_distribution = uniform

We are using a simple table schema, just containing key-value pairs. Here is the initialized table:

KEY VALUE
0 0
1 0
2 0

 0
19 0

Note that, for each write on a key, the value (generated by the workload generator) is unique.

One anomaly was found on five transactions from two sessions, where r/w(A,B) denotes read/write value B on key A:

session 0 session 1

txn4 txn11
w(19,3) r(10,3)
w(19, 25)

txn5 txn13
w(10,3) w(0,22)
w(19, 25)

txn6
r(0,22)
r(19,3)

txn4 and txn6 have “write-read” order on key 19, denoted by txn4 ->wr txn6; txn13 and txn6 have write-read order on key 0, i.e., txn13 → txn6.

To satisfy transactional causal consistency txn13 must be ordered before txn4 because we already know txn4 → txn6 and txn13 → txn6 and txn6 read the value of key 19 written by txn4. Thus we have commit order txn13 ->co txn4 (see the bottom for the definition of commit order). However, txn4 can reach txn13 via txn4 ->so txn5 ->wr txn11 ->so txn12 ->so txn13. Hence, there is a cycle between txn13 and txn4 that violates TCC.

*** Causal Consistency ***
The causal consistency we checked is defined in the paper published on SOSP’11 (“Don’t settle for eventual: scalable causal consistency for wide-area storage with COPS”). In this paper the authors define a consistency model—causal+. Causal+ is slightly stronger than causal consistency because it adds convergent conflict handling. Another paper from POPL’17 (On verifying causal consistency) also has the definition of this property as below:
“There is a total order between non-causally dependent operations and each site can execute operations only in that order (when it sees them). Therefore, a site is not allowed to revise its ordering of non-causally dependent operations, and all sites execute in the same order the operations that are visible to them.” Also, causal+ consistency is strictly weaker than SNAPSHOT ISOLATION.

*** Ordering Transactions ***
The “wr” order means two transactions contain write/read operations on the same key with the same value respectively, thus the transaction with write operation should happen before the transaction contains the read operation.

1 Like

link 1: dgraph. io/docs/design-concepts/consistency-model/

For full dataset, please visit github . com/20211202na/dgraph_data/blob/main/data.txt.

1 Like

I don’t understand the issue. Can you share the test code used that we can run and reproduce this?

1 Like

Hi Daniel, I have uploaded the code in github. com/20211202na/dgraph_data/blob/main/data_generation.py .

1 Like

Hi Daniel, below is the well-formatted counterexample (5 txns from 2 sessions) shown in the post:

How do you run your example and reproduce the issue?

1 Like
  1. We use link_a (see the following comment) to deal with the client-side logic (e.g., generating txn workloads)

  2. For each run, “histories” of txns (values read and/or written) for all clients are also collected and printed.

  3. We then invoke the checker (the function run_oopsla_graph) in link_b to verify if the history violates transactional causal consistency.

Note that, since we are doing random testing (e.g., workloads are generated probabilistically), it’s hard (or even impossible) to reproduce a specific anomaly. However, we observed 5 to 8 violating histories per 100 histories with the setup posted. We believe that anomalies manifest with sufficiently large number of runs. Additionally, more concurrency is expected to give more anomalies, e.g., with more clients, less keys, more txns, more ops per txn, etc.

Please let me know if any question! Thanks!

Links in the above reply:

link_a: github.com/20211202na/dgraph_data/blob/main/data_generation.py

link_b: github.com/20211202na/dgraph_data/blob/main/run_verification.py

Hi Daniel, not sure if you could still see all our posts/comments in this thread. Many of them got hidden
 How are we going to provide the links (to the code, the references, etc.) under such policy? Please help us out! Thanks!

@dmai In fact, after further investigation, this counterexample even violates “read atomicity” (RA), coined by Bailis et al. at SIGMOD’14 Scalable atomic visibility with RAMP transactions | Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. RA is even weaker than transactional causal consistency, which is in turn weaker than Snapshot Isolation.

The attached picture visualizes why our example violates RA (the red cycle). See Fig.2 (b) in https://dl.acm.org/doi/10.1145/3360591 for the formal definition, where t1, t3, and t2 are mapped to txn4, txn6, and txn13 in our case, respectively.

PS! txn11 in our example has only one operation r(10,3). Sorry for the bad formatting in previous comments.

For resolution of this issue see Dgraph transactions violated causal consistency · Issue #8146 · dgraph-io/dgraph · GitHub. In short, there was a bug in this script that created a new transaction for each read, which is not reflected in the diagrams posted here.