Dgraph bulk load panics due to buffer size exceeded


Report a Dgraph Bug

What version of Dgraph are you using?

v21.03.1

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

CentOS Linux release 7.9.2009
8-core / 512 GB RAM

Steps to reproduce the issue (command/config used to run Dgraph).

docker configuration for dgraph bulk loader:

...
    environment:
      DGRAPH_BULK_ZERO: dgraph-zero:5080
      DGRAPH_BULK_SCHEMA: /x/output.dql.schema
      DGRAPH_BULK_GRAPHQL_SCHEMA: /x/output.graphql.schema
      DGRAPH_BULK_FILES: /x/rdf
      DGRAPH_BULK_TMP: /x/tmp
      DGRAPH_BULK_OUT: /x/alphas
      DGRAPH_BULK_MAP_SHARDS: 3
      DGRAPH_BULK_MAPOUTPUT_MB: 128
      DGRAPH_BULK_PARTITION_MB: 8
      DGRAPH_BULK_NUM_GO_ROUTINES: 16
      DGRAPH_BULK_REDUCE_SHARDS: 3
      DGRAPH_BULK_REDUCERS: 1
      DGRAPH_BULK_IGNORE_ERRORS: "false"
      DGRAPH_BULK_REPLACE_OUT: "true"
    command: dgraph bulk
...

Expected behaviour and actual result.

I am getting the following error during the reduce phase of the bulk load:

[01:35:17Z] REDUCE 11h09m34s 14.29% edge_count:14.10G edge_speed:1.150M/sec plist_count:871.6M plist_speed:71.05k/sec. Num Encoding MBs: 0. jemalloc: 960 MiB
panic: z.Buffer max size exceeded: 68719476736 offset: 68719476708 grow: 50
goroutine 49259 [running]:
github.com/dgraph-io/ristretto/z.(*Buffer).Grow(0xc1691fc000, 0x32)
	/go/pkg/mod/github.com/dgraph-io/ristretto@v0.0.4-0.20210504190834-0bf2acd73aa3/z/buffer.go:180 +0x6f3
github.com/dgraph-io/ristretto/z.(*Buffer).SliceAllocate(0xc1691fc000, 0x2e, 0x3f, 0xc0e244fc40, 0x1f)
	/go/pkg/mod/github.com/dgraph-io/ristretto@v0.0.4-0.20210504190834-0bf2acd73aa3/z/buffer.go:266 +0x3d
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).Next(0xc00b37d3e0, 0xc1691fc000, 0xc0e244fc40, 0x1f, 0x20)
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:206 +0xf0
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).reduce.func2(0xc0004b06c0, 0xc0dbcff518, 0xc0cc3f6000, 0x5297, 0x6000, 0xc00028d540, 0xc0dccfed20)
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:486 +0x239
created by github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).reduce
	/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:476 +0x388

I modified the dgraph/cmd/bulk/reduce.go:436 WithMaxSize(64 << 30) value to be:

func getBuf(dir string) *z.Buffer {
	return z.NewBuffer(64<<20, "Reducer.GetBuf").
		WithAutoMmap(1<<30, filepath.Join(dir, bufferDir)).
		WithMaxSize(64 << 32)
}

and then I get this error:

panic: z.Buffer max size exceeded: 274877906944 offset: 274877906922 grow: 50

Looks like the data might be heavily skewed in some way, which is causing this issue.

I modified the function in dgraph/cmd/bulk/reduce.go:436 as below and got my bulk load to complete.

 func getBuf(dir string) *z.Buffer {
        return z.NewBuffer(64<<20, "Reducer.GetBuf").
                WithAutoMmap(1<<30, filepath.Join(dir, bufferDir)).
                WithMaxSize(0)
 }

We are getting this same error of the bluk loader with zion release (v21.12). Is there some configuration option to increase a buffer to get past this point?

I don’t think there would be a config option. You might have to modify the code.

The fix was straightforward, modify the following in dgraph/cmd/bulk/reduce.go:

func getBuf(dir string) *z.Buffer {
	return z.NewBuffer(64<<20, "Reducer.GetBuf").
		WithAutoMmap(1<<30, filepath.Join(dir, bufferDir)).
		WithMaxSize(0)
}

Should we submit a pull request?

Sure. Go for it and ping me.