“Transaction has been aborted” occurred because of dfrat ts

Mrliu8023 · February 28, 2024, 11:18am

Report a Dgraph Client Bug

Transactions has been anborted after dgraph is started for a period of time.

Some there info:
I add some info to source code. And found the reason is:

What are the possible reasons for this problem? Is there any way to solve this problem?

What Dgraph client (and version) are you using?

(put “x” in the box to select)

Dgo
PyDgraph
Dgraph4J
Dgraph-js
Dgraph-js-http
Dgraph.NET

Version:
V230

What version of Dgraph are you using?

v23.1.0

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.

Damon · February 28, 2024, 7:12pm

Typically this message from a client means that concurrent modifications were made, so one of them was aborted. There are a couple ways to address this:

First, be aware that an abort error is retry-able and the intent is for the client to re-submit or not based on business requirements (e.g. retry if it is ok to write the data even though involved data changed on the server during the update).

Second, if this is happening a lot, review your code to see if the same values on the same entities (values or edges) are being updated by multiple concurrent threads. The aborts will be more frequent if the server is overloaded or slow, or transactions are very large, because those situations extend the time when transactions overlap and one may be rejected. So also monitor the server for CPU or IO overload and transaction durations.

Mrliu8023 · February 29, 2024, 3:41am

There should be no conflict with the increment command. This situation seems to occur after doing a lot of “Alter” schema operations.

Damon · February 29, 2024, 1:29pm

I am not deeply familiar with the processes that occur after an alter command, but particularly if you changed indexes there will be a period of updates to indexes. If your incoming transactional requests conflict with those updates, that might (unsure) cause a transaction conflict where the same index structure is updated both by the reindex process and by your transaction (e.g. increment), resulting in a conflict and an abort.

I would start by correlating the abort timestamps with Dgraph logs which may show relevant updates that take place in the background after an alter command. Is there substantial overlap with some process? Do the logs tell you which activity and predicate or index is involved?

Mrliu8023 · March 1, 2024, 2:14am

When Oracle performs conflict detection, it has not yet started predicate detection, and conflicts has occurred directly when checking the timestamp.

github.com

dgraph-io/dgraph/blob/3754e876e0a5c9889a7410b4d8d3a0751c10432d/dgraph/cmd/zero/oracle.go#L81


      
          func (o *Oracle) updateStartTxnTs(ts uint64) {
          	o.Lock()
          	defer o.Unlock()
          	o.startTxnTs = ts
          	o.keyCommit.Reset()
          }
          
          // TODO: This should be done during proposal application for Txn status.
          func (o *Oracle) hasConflict(src *api.TxnContext) bool {
          	// This transaction was started before I became leader.
          	if src.StartTs < o.startTxnTs {
          		return true
          	}
          	for _, k := range src.Keys {
          		ki, err := strconv.ParseUint(k, 36, 64)
          		if err != nil {
          			glog.Errorf("Got error while parsing conflict key %q: %v\n", k, err)
          			continue
          		}
          		if last := o.keyCommit.Get(ki); last > src.StartTs {
          			return true

Mrliu8023 · March 1, 2024, 2:51am

It’s strange. CheckpointTs of snapshot under stand-alone deployment > zero.server.next[Num_TXN_TS]

Damon · March 6, 2024, 4:38pm

I have not seen this before, but the code comment suggests that the zero that is checking for conflicting transactions is unable to ensure consistency because it became the leader during the transaction.

If the issue is due to the zero leader changing, you should see logs about the “election” “candidates” and “became leader” for all zeros. If the leadership is changing, other than very rarely, that indicates a serious problem where nodes cannot talk to each other, or are too overloaded to process the heartbeats they use to monitor one another.

Topic		Replies	Views
Issue with Dgo Dgraph Clients dgo	2	719	May 16, 2021
Dgo Automated Retries Dgraph Clients kind:feature	4	511	May 17, 2021
Getting "Transaction has been aborted. Please retry." Far Too Often Users	6	2647	April 21, 2018
Transaction has been aborted Dgraph Clients untagged , dgraph4j	9	592	July 11, 2020
Dgraph aborts transaction Dgraph kind:question , dgraph	0	615	January 27, 2021