Status on production blockers?

Hey guys,

It’s been disheartening to see the way dgraph is being maintained and considering early adopters feedback. There are a couple of high importance issues, been reported for more than 30 days, are not even being considered in the upcoming releases.

https://github.com/dgraph-io/dgraph/issues/2221
https://github.com/dgraph-io/dgraph/issues/2326
https://github.com/dgraph-io/dgraph/issues/2134

Because of these issues, users have to end up writing a shit ton of code around synchronising the writes and reads! A small example is, how multiple writers need to write on dgraph 1.0.4

// Method to perform mutation
func (d *Client) mutate(ctx context.Context, m *api.Mutation) error {
	d.lock.Lock()
	defer d.lock.Unlock()
	retry := 0
	var err error
	for retry < maxRetries {
		_, err = d.client.NewTxn().Mutate(ctx, m)
		if err != nil && err == y.ErrAborted {
			<-time.After(time.Second)
			// Retrying
			retry++
			continue
		}
		return nil
	}
	return err
}

Can anyone from #user:dgraph team update about status of any of these issues?

My colleague at Amazon tried dgraph last year and was very happy with the community and their pace of development. We are trying to evaluate it again whether it’s ready for production or not. I would prefer third party testimonials.
Dgraph is avoiding answering the question.

I went through the GitHub issues and it’s alarming. Data can be lost, ACID bugs.

I don’t see any benchmarks. There is very old blog post on how it’s faster than neo4j. But I couldn’t find any benchmarks with 1.0 release. I have been through couple of discuss posts. It would be great if dgraph team can share what is the biggest data set they have tested with and corresponding benchmarks.

@Amar255 Those bugs you’re seeing are part of the distributed system testing by Jepsen.
http://jepsen.io/

You can see the full list by Kyle (author of Jepsen) here, including the issues that we have closed out. https://github.com/dgraph-io/dgraph/issues?q=is%3Aissue+author%3Aaphyr+is%3Aclosed

So, the remaining bugs which indicate data loss or other issues would be resolved soon. We decided to use Github for Jepsen testing, so the issues are transparent to our user base. But, these should not be a cause of alarm – largely these are extreme edge cases, solving which makes Dgraph more robust.

@akshaydeo, we’re currently only prioritizing bugs, not improvements or features. Slightly more code at the user end to retry aborted transactions isn’t such a big issue that we need to address immediately. If there’s any bug that you need me to look into, I can do that.

"With server-side ordering, @upsert schemas, no crashes or network faults, roughly 10 inserts/sec, and no updates or deletes, Dgraph can occasionally (once every five hours or so) lose successfully inserted records: ".
Doesn’t seem like extreme edge case to me, I will wait for third party(jepsen) test results then.

Think it’s linked to one bug in shard moves. Anyways, will be fixing these bugs over the next couple of weeks, so stay tuned.

@mrjn Leave the external id part aside, retry aborted trasactions does not solve the issue!!

See the following test case:

func TestDgraph_TransactionAbortedIssue(t *testing.T) {
	d, err := setup()
	if err != nil {
		t.Fatal("error while setting up", err)
	}
	defer clearSetup(d)
	latch := sync.WaitGroup{}
	for i := 0; i < 100; i++ {
		latch.Add(1)
		go func() {
			defer latch.Done()
			for j := 0; j < 100; j++ {
				var q []*Quad
				q = append(q, NewQuad("1", "dog", "name", "jarvis"))
				q = append(q, NewQuad("2", "dog", "name", "polo"))
				q = append(q, NewQuad("2", "dog", "color", "white"))
				q = append(q, NewQuad("1", "dog", "color", "black"))
				fmt.Println("adding quad")
				if err := d.Add(context.TODO(), q...).Error(); err != nil {
					t.Fatal("error while putting", err)
					return
				}
				fmt.Println("done adding quad")
				<-time.After(1 * time.Millisecond)
			}
		}()
	}
	latch.Wait()
}

here d.Add is a mutation to insert the quad

This fails every time.

The core issue:

Dgraph fails when more than one trasactions, try to update the same node concurrently. Which is the basic expectation from the modern world db.
Refer: https://github.com/dgraph-io/dgraph/issues/2221

One of the side issues of this is, https://github.com/dgraph-io/dgraph/issues/2326. We have to keep on waiting till everything sync back and then again start querying the DB.

You know what is the 100% solution to it? Having an RWLock, and synchronous writes on dgraph. Building it is an easier job, but that’s really not what we (users of dgraph) want.

And these are definitely not features, but these are bugs. Correct me if I am wrong at any point!

I’ve been using dgraph in prod for more than 6 months. And have started figuring out options. I am ready to work closely with your devs to reproduce all the issues as we have simple test cases like above.

Cheers!

Hey @akshaydeo,

These sort of reproducible example test cases are great. Can you please file a Github issue? I’ll mark it as bug and resolve it over the next couple of weeks. You can rest assured that all bugs are going to be resolved quickly.

Just merged my fork with v1.0.5. Transactions aborted issue is getting handled with retries (exponential backoff + 30 retries).

I am quite happy with this for now as I dont have to serialize my mutations anymore(considering other advantages of the dgraph) :slight_smile:

Secondly, I have not observed Timestamp released issue anymore. I suspect it comes in clustered setup + leader going down case. Ill be testing that soon and will update here!

Sorry for being pain in the ass @mrjn :smiley:

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.