Bulk loader crashes during reduce phase

Report a Dgraph Bug

The Dgraph bulk loader crashes with newer versions (20.03.4 and 20.07.0) of dgraph during reduce phase, but works with previous version 20.03.3. The reported error is basically Request size offset X is bigger than maximum offset Y. This seems like it is related to badger, since 20.0.3, which works uses a different version (from go.mod):

  • 20.03.3 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200528205344-e7b6e76f96e8
  • 20.03.4 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200718033852-37ee16d8ad1c
  • 20.07.0 - github.com/dgraph-io/badger/v2 v2.0.1-rc1.0.20200718033852-37ee16d8ad1c

All of these seem like older versions of badger, which has latest release of v2.0.3.

What version of Dgraph are you using?

  • 20.03.3 - works successfully
  • 20.03.4 - fails
  • 20.07.0 - fails

Have you tried reproducing the issue with the latest release?

Yes, failure.

What is the hardware spec (RAM, OS)?

64 cores, 256GB memory, running CentOS Linux release 7.8.2003 (kernel: 3.10.0-1127.13.1.el7.x86_64)

Steps to reproduce the issue (command/config used to run Dgraph).

In a docker swarm stack (with a dgraph-zero server) configured so the bulk loader reports the following configuration when it starts. (/fusion-dir is a volume mount.)

{
	"DataFiles": "/fusion-dir/rdf",
	"DataFormat": "",
	"SchemaFile": "/fusion-dir/rdf/output.schema",
	"GqlSchemaFile": "",
	"OutDir": "/fusion-dir/alphas",
	"ReplaceOutDir": true,
	"TmpDir": "/fusion-dir/tmp",
	"NumGoroutines": 24,
	"MapBufSize": 134217728,
	"SkipMapPhase": false,
	"CleanupTmp": true,
	"NumReducers": 1,
	"Version": false,
	"StoreXids": false,
	"ZeroAddr": "dgraph-zero:5080",
	"HttpAddr": "localhost:8080",
	"IgnoreErrors": false,
	"CustomTokenizers": "",
	"NewUids": false,
	"Encrypted": false,
	"MapShards": 3,
	"ReduceShards": 3,
	"BadgerKeyFile": "",
	"BadgerCompressionLevel": 1
}

Expected behaviour and actual result.

Using 20.03.3 the bulk loader succeeds in about 2.5 hours. Using the newer versions it fails early in the reduce phase with the following error:

2020/07/29 22:28:41 Request size offset 18905107590 is bigger than maximum offset 4294967295
github.com/dgraph-io/badger/v2.(*valueLog).validateWrites
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/value.go:1381
github.com/dgraph-io/badger/v2.(*valueLog).write
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/value.go:1413
github.com/dgraph-io/badger/v2.(*StreamWriter).Write
	/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200718033852-37ee16d8ad1c/stream_writer.go:143
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).startWriting
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:336
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
github.com/dgraph-io/dgraph/x.Check
	/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).startWriting
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:336
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373

Hey @jgoodall! We recently added a validation check in badger to ensure we don’t create huge requests.

You have a very large request size 18905107590 = 18 GB which badger wouldn’t be able to handle.
Would you be able to share the data that you are using in the bulk loader? I’d like to run it on my end to understand why we’re creating such big request batches.

Also, even if v20.03.3 works, such a big batch of data would cause uint32 to overflow and you won’t be able to read the data.

cc @balaji

1 Like

We have 64 RDF files, each about ~460mb.

  1. is there some parameter i should be setting to avoid these kind of large requests?
  2. i am not sure i can share the data, but how would you take delivery of something that large?

Thanks.

  1. is there some parameter i should be setting to avoid these kind of large requests?

I don’t think that’s possible from the command line flags. This is something we do internally in dgraph. @ashishgoswami @harshil_goel is there some parameter that can help here?

  1. i am not sure i can share the data, but how would you take delivery of something that large?

Actually, I found a way to reproduce the issue. I’ll drop you an email if I need your data.

We’ll keep this issue updated.

Great thanks. Since it works with previous version this is not a showstopper, but we were hoping to update since the newer version fixes a shortest path query bug that affected us.

You can bulk load data using v20.03.3 and then switch to v20.03.4. They’re compatible.

Any update on reproducing this issue?

Any update on reproducing this issue?