I’m reposting from slack … I am curious about best practices for mutations with dgraph. I have a large batch process that generates a lot of triples. I’ve run into an issue with uploading the whole dataset as a single mutation.
It is generally a bad practice, regardless of database, to have such large insertions in a single transaction. Is there a mutation size limit? Also, is there are recommended mutation size?
We use batch size as 1000 for dgraph live which is a good size. The actual size depends on the memory your instance has. If you try too large a batch size, since it all has to be loaded in memory, your instance might go OOM.
I can absolutely share the logs. What specific files do you need? I was just running the test locally with fabricated data so you can have whatever you’d like.
The server instance (running in docker) terminated. The mutation takes quite awhile and then the server instance just terminates (I assume crashes). Afterwards, the database is not usable for anything.
I have just finished a loader process that maps ids (similar to your bulk loader) and partitions my data into roughly 1000 triples per request. The database (server) was able to complete quite a few requests but then crashed (terminated):
Here is the relevant logs from the server process:
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: xid
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: type
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: msisdn
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: id
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: to
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: from
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: amount
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: net
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: at
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: Transfer
2018/03/21 21:28:05 groups.go:316: Asking if I can serve tablet for: To
2018/03/21 21:28:14 draft.go:669: Writing snapshot at index: 243, applied mark: 253
2018/03/21 21:28:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:243 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:28:44 draft.go:669: Writing snapshot at index: 1043, applied mark: 1053
2018/03/21 21:28:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1043 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:29:14 draft.go:669: Writing snapshot at index: 1801, applied mark: 1811
2018/03/21 21:29:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1801 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:29:44 draft.go:669: Writing snapshot at index: 2041, applied mark: 2051
2018/03/21 21:29:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2041 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:30:14 draft.go:669: Writing snapshot at index: 2265, applied mark: 2275
2018/03/21 21:30:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2265 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:30:44 draft.go:669: Writing snapshot at index: 2385, applied mark: 2395
2018/03/21 21:30:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2385 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:31:14 draft.go:669: Writing snapshot at index: 2481, applied mark: 2492
2018/03/21 21:31:14 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2481 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:31:44 draft.go:669: Writing snapshot at index: 2633, applied mark: 2643
2018/03/21 21:31:44 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:2633 Term:2 XXX_unrecognized:[]}, len(data): 27
2018/03/21 21:32:01 groups.go:316: Asking if I can serve tablet for: dummy
I don’t know what “dummy” is as I don’t have a predicate named “dummy” in my data. I can certainly provide the p/w/zw data as well.
Meanwhile, the original problem (a large mutation post) looks like this:
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: xid
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: type
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: msisdn
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: id
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: to
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: from
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: amount
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: net
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: at
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: Transfer
2018/03/21 21:43:07 groups.go:316: Asking if I can serve tablet for: To
In both cases, the server process terminates and leaves the database files in some inconsistent state.
I am interested in getting the directories and seeing how the database becomes unusable. Could you share them with me on my email which is pawan AT dgraph DOT io?
I tried increasing the memory and changed the way I run the server via docker. That seems to have helped but I still believe there is some kind of OOM problem. I have tested the server and it seems top have recovered from the crash when I restart the container now so the “corruption” may have been more of an issue of the way I ran it via docker and tried to restart.
I changed my bulk loader process to allow me to resume and I was able to restart the crashed docker container for the server and finish the import. With a small batch size for each transaction, that makes me wonder if memory is not be reclaimed fast enough. I hit the server in rapid succession as each chunk is ready.
Running the zero, server, and ratel as follows seemed to help my local testing:
Zero:
docker run -it -p 5080:5080 -p 6080:6080 -v ~/workspace/dgraph/data/zero:/dgraph --name dgraph-zero dgraph/dgraph dgraph zero --my=docker.for.mac.localhost:5080
Server:
docker run -it -p 7080:7080 -p 8080:8080 -v ~/workspace/dgraph/data/server-1:/dgraph --name dgraph-server-1 dgraph/dgraph dgraph server --memory_mb 4096 --zero docker.for.mac.localhost:5080
Ratel:
docker run -it -p 8000:8000 --name dgraph-ratel dgraph/dgraph dgraph-ratel
Note: the ‘docker.for.mac.localhost’ host is essential if you are trying to run it locally on your mac laptop! It might be good to add some notes to the documentation about testing on a Mac laptop.