Bulk loader crashes during reduce phase

jgoodall · July 30, 2020, 1:53pm

Report a Dgraph Bug

The Dgraph bulk loader crashes with newer versions (20.03.4 and 20.07.0) of dgraph during reduce phase, but works with previous version 20.03.3. The reported error is basically Request size offset X is bigger than maximum offset Y. This seems like it is related to badger, since 20.0.3, which works uses a different version (from go.mod):

20.03.3 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200528205344-e7b6e76f96e8
20.03.4 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200718033852-37ee16d8ad1c
20.07.0 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200718033852-37ee16d8ad1c

All of these seem like older versions of badger, which has latest release of v2.0.3.

What version of Dgraph are you using?

20.03.3 - works successfully
20.03.4 - fails
20.07.0 - fails

Have you tried reproducing the issue with the latest release?

Yes, failure.

What is the hardware spec (RAM, OS)?

64 cores, 256GB memory, running CentOS Linux release 7.8.2003 (kernel: 3.10.0-1127.13.1.el7.x86_64)

Steps to reproduce the issue (command/config used to run Dgraph).

In a docker swarm stack (with a dgraph-zero server) configured so the bulk loader reports the following configuration when it starts. (/fusion-dir is a volume mount.)

{
	"DataFiles": "/fusion-dir/rdf",
	"DataFormat": "",
	"SchemaFile": "/fusion-dir/rdf/output.schema",
	"GqlSchemaFile": "",
	"OutDir": "/fusion-dir/alphas",
	"ReplaceOutDir": true,
	"TmpDir": "/fusion-dir/tmp",
	"NumGoroutines": 24,
	"MapBufSize": 134217728,
	"SkipMapPhase": false,
	"CleanupTmp": true,
	"NumReducers": 1,
	"Version": false,
	"StoreXids": false,
	"ZeroAddr": "dgraph-zero:5080",
	"HttpAddr": "localhost:8080",
	"IgnoreErrors": false,
	"CustomTokenizers": "",
	"NewUids": false,
	"Encrypted": false,
	"MapShards": 3,
	"ReduceShards": 3,
	"BadgerKeyFile": "",
	"BadgerCompressionLevel": 1
}

Expected behaviour and actual result.

Using 20.03.3 the bulk loader succeeds in about 2.5 hours. Using the newer versions it fails early in the reduce phase with the following error:

2020/07/29 22:28:41 Request size offset 18905107590 is bigger than maximum offset 4294967295
github.com/dgraph-io/badger/v2.(*valueLog).validateWrites
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/value.go:1381
github.com/dgraph-io/badger/v2.(*valueLog).write
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/value.go:1413
github.com/dgraph-io/badger/v2.(*StreamWriter).Write
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/stream_writer.go:143
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).startWriting
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:336
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
github.com/dgraph-io/dgraph/x.Check
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).startWriting
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:336
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373

ibrahim · July 30, 2020, 2:04pm

Hey @jgoodall! We recently added a validation check in badger to ensure we don’t create huge requests.
https://github.com/dgraph-io/badger/commit/d981f47d93d3029db4aebd6777863f231d8f719c

You have a very large request size 18905107590 = 18 GB which badger wouldn’t be able to handle.
Would you be able to share the data that you are using in the bulk loader? I’d like to run it on my end to understand why we’re creating such big request batches.

Also, even if v20.03.3 works, such a big batch of data would cause uint32 to overflow and you won’t be able to read the data.

cc @balaji

jgoodall · July 30, 2020, 2:48pm

We have 64 RDF files, each about ~460mb.

is there some parameter i should be setting to avoid these kind of large requests?
i am not sure i can share the data, but how would you take delivery of something that large?

Thanks.

ibrahim · July 30, 2020, 3:28pm

is there some parameter i should be setting to avoid these kind of large requests?

I don’t think that’s possible from the command line flags. This is something we do internally in dgraph. @ashishgoswami @harshil_goel is there some parameter that can help here?

i am not sure i can share the data, but how would you take delivery of something that large?

Actually, I found a way to reproduce the issue. I’ll drop you an email if I need your data.

We’ll keep this issue updated.

jgoodall · July 30, 2020, 3:29pm

Great thanks. Since it works with previous version this is not a showstopper, but we were hoping to update since the newer version fixes a shortest path query bug that affected us.

ibrahim · July 30, 2020, 3:31pm

You can bulk load data using v20.03.3 and then switch to v20.03.4. They’re compatible.

jgoodall · August 4, 2020, 1:54pm

Any update on reproducing this issue?

jgoodall · August 14, 2020, 4:26pm

Any update on reproducing this issue?

mrjn · August 18, 2020, 2:32pm

Hey @jgoodall,

It would be great if you can privately share the data, so we can attempt to reproduce this issue. We can give you access to our Google Drive, so you can upload there. @dmai will follow up.

dmai · August 18, 2020, 2:53pm

Hey @jgoodall, I just shared with you a Google Drive folder from us where you can upload the data there.

jgoodall · August 26, 2020, 3:41pm

I cant share the data. In previous message @ibrahim said he was able to reproduce without it…

ibrahim · August 26, 2020, 6:29pm

I cant share the data. In previous message @ibrahim said he was able to reproduce without it…

Yes, but the crash I was seeing was because of an issue with my code (not the original bug you saw). We have identified a possible fix for this issue and @harshil_goel is working on it but since we cannot reproduce the crash, we cannot verify the fix.

@jgoodall would you be able to help us test it? I can share the branch with you once the PR is ready.

jgoodall · August 26, 2020, 6:56pm

Yes! We can absolutely help test it!

jokk33 · August 28, 2020, 4:59am

Totally same error.

Occur in v20.07.1-rc1.
Work fine in v20.07.1.
Also, I found schema will impact a lot.
If I drop all reverse index I can bulk load successfully.
(PS: my machine is 128RAM, 32 CPU)

relation1 [uid] .
relation2 [uid] .
…
success

relation1 [uid] @count @reverse .
relation2 [uid] @count @reverse .
…
OOM

But, if I can’t count edges, how can I do calculation?

I think Dgraph is not truly for big data.
All is fine for those small test data, like 500M?
While for our big dataset, bulk loader costs a lot of memory, like xidmap option, like reduce step.

omar · August 28, 2020, 10:31am

Hi @jgoodall, @jokk33

We have a potential fix for the issue you are having with bulk loader. The PR is the following one:

fix(bulk): Batch list in bulk loader to avoid panic by harshil-goel · Pull Request #6312 · dgraph-io/dgraph · GitHub

Basically we found out that we were writing a big list (posting list for a key) directly to disk which could cause an overflow. So now we batch the list before writing

It would be nice if you can compile it and test it against your dataset, let us know if you need help in compiling it.

Your feedback will be valuable for us

Thanks,
Omar Ayoubi

omar · September 1, 2020, 9:15am

Hi @jgoodall , @jokk33

I’d like to touch base here, checking if you had the chance to test out the fix I reported above.

Let us know if you need help in compiling and building the branch, happy to help

Again, your feedback is valuable for us

Best,
Omar Ayoubi

BlankRain · September 10, 2020, 1:55am

we also got this error…

jgoodall · September 10, 2020, 12:53pm

On Linux server with 128GB memory, I have tried to use the bulk loader using this patch now multiple times. It fails with an out of memory error each time. Nothing else of significance is running on the machine. The exact command I am running now is (the data will be distributed to a three node alpha cluster after loaded):

./dgraph bulk --zero=localhost:5080 --schema=./rdf/output.schema --files=./rdf --map_shards=3 --reduce_shards=3 --num_go_routines=2

My last attempt failed with no error message:

[00:12:22-0400] REDUCE 04h49m54s 2.65% edge_count:161.6M edge_speed:1.228M/sec plist_count:135.9M plist_speed:1.033M/sec. Num Encoding: 23

I am trying to run it again, I am not sure what other options I should be playing with to get it to finish loading. I think it is getting farther in the reduce phase than I was able to originally so this problem may be fixed, but it is hard to confirm that without having it run to completion.

jgoodall · September 10, 2020, 6:53pm

I tried again to load with dgraph bulk from the branch in the PR and again failed with out of memory, but I believe that PR did solve this specific issue, so I think it is good to merge. Maybe @jokk33 or @BlankRain can test. (I dont have access to the larger server right now where I originally got the error to test in the same environment as the original bug.)

mrjn · September 10, 2020, 7:02pm

We’ve fixed the bulk loader memory issues in master. @ibrahim can set you up with a binary. Or, you can just compile them yourself. You would need to compile with jemalloc though.

Topic		Replies	Views
Inconsistent bulk loader failures Dgraph dgraph , status:accepted , kind:bug , area:bulk-loader	14	882	January 27, 2021
Alpha crashes in 20.0.7 with loading data with130K+ data Dgraph	1	428	March 17, 2021
Bulk Loader REDUCE problem - it's very slow Dgraph dgraph , status:accepted , kind:bug , ticket:created	22	968	March 12, 2021
Current state of badger crashes Dev badger	8	1063	July 9, 2020
Dgraph Enhancement Proposal: bulk + live loader? Dgraph	2	602	August 9, 2019