Mutation Size - Limits & Best Practice

I’m reposting from slack … I am curious about best practices for mutations with dgraph. I have a large batch process that generates a lot of triples. I’ve run into an issue with uploading the whole dataset as a single mutation.

It is generally a bad practice, regardless of database, to have such large insertions in a single transaction. Is there a mutation size limit? Also, is there are recommended mutation size?

1 Like

We use batch size as 1000 for dgraph live which is a good size. The actual size depends on the memory your instance has. If you try too large a batch size, since it all has to be loaded in memory, your instance might go OOM.

Could you please share some logs for the issue that you created on Github - Server crash & corruption with large mutations · Issue #2238 · dgraph-io/dgraph · GitHub

I can absolutely share the logs. What specific files do you need? I was just running the test locally with fabricated data so you can have whatever you’d like.

Was this a crash? Where there a crash log trace? Also, what do you mean that the database got corrupted?

I was not able to reproduce this so it seems like you had your container OOM killed.

The server instance (running in docker) terminated. The mutation takes quite awhile and then the server instance just terminates (I assume crashes). Afterwards, the database is not usable for anything.

I have just finished a loader process that maps ids (similar to your bulk loader) and partitions my data into roughly 1000 triples per request. The database (server) was able to complete quite a few requests but then crashed (terminated):

Here is the relevant logs from the server process:

2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: xid
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: type
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: msisdn
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: id
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: to
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: from
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: amount
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: net
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: at
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: Transfer
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: To
2018/03/21 21:28:14 draft.go:669: Writing snapshot at index: 243, applied mark: 253
2018/03/21 21:28:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:243 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:28:44 draft.go:669: Writing snapshot at index: 1043, applied mark: 1053
2018/03/21 21:28:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1043 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:29:14 draft.go:669: Writing snapshot at index: 1801, applied mark: 1811
2018/03/21 21:29:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1801 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:29:44 draft.go:669: Writing snapshot at index: 2041, applied mark: 2051
2018/03/21 21:29:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2041 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:30:14 draft.go:669: Writing snapshot at index: 2265, applied mark: 2275
2018/03/21 21:30:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2265 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:30:44 draft.go:669: Writing snapshot at index: 2385, applied mark: 2395
2018/03/21 21:30:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2385 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:31:14 draft.go:669: Writing snapshot at index: 2481, applied mark: 2492
2018/03/21 21:31:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2481 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:31:44 draft.go:669: Writing snapshot at index: 2633, applied mark: 2643
2018/03/21 21:31:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2633 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:32:01 groups.go:316: Asking if I can serve tablet for: dummy

I don’t know what “dummy” is as I don’t have a predicate named “dummy” in my data. I can certainly provide the p/w/zw data as well.

Meanwhile, the original problem (a large mutation post) looks like this:

2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: xid
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: type
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: msisdn
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: id
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: to
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: from
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: amount
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: net
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: at
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: Transfer
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: To

In both cases, the server process terminates and leaves the database files in some inconsistent state.

I am interested in getting the directories and seeing how the database becomes unusable. Could you share them with me on my email which is pawan AT dgraph DOT io?

Also, you can probably try increasing the memory allocated to Docker for your Mac as mentioned at What is Docker container exit code 137? · Issue #21083 · moby/moby · GitHub.

I tried increasing the memory and changed the way I run the server via docker. That seems to have helped but I still believe there is some kind of OOM problem. I have tested the server and it seems top have recovered from the crash when I restart the container now so the “corruption” may have been more of an issue of the way I ran it via docker and tried to restart.

I changed my bulk loader process to allow me to resume and I was able to restart the crashed docker container for the server and finish the import. With a small batch size for each transaction, that makes me wonder if memory is not be reclaimed fast enough. I hit the server in rapid succession as each chunk is ready.

Running the zero, server, and ratel as follows seemed to help my local testing:

Zero:

docker run -it -p 5080:5080 -p 6080:6080 -v ~/workspace/dgraph/data/zero:/dgraph --name dgraph-zero dgraph/dgraph dgraph zero --my=docker.for.mac.localhost:5080

Server:

docker run -it -p 7080:7080 -p 8080:8080 -v ~/workspace/dgraph/data/server-1:/dgraph --name dgraph-server-1 dgraph/dgraph dgraph server --memory_mb 4096 --zero docker.for.mac.localhost:5080

Ratel:

docker run -it -p 8000:8000 --name dgraph-ratel dgraph/dgraph dgraph-ratel

Note: the ‘docker.for.mac.localhost’ host is essential if you are trying to run it locally on your mac laptop! It might be good to add some notes to the documentation about testing on a Mac laptop.

1 Like

Oh, and I increased the amount of memory Docker had on the Mac as well…

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.