When bulk finish the size of each directory is uneven and start alpha slow

I expect the size of each directory to be uniform . Is it my bulk json misconfiguration?
And If the directory is too large, the alpha startup time will be long .
the first group [out/0/p (314G)] startup cost time nearly an hour .
Can you give me some advice on how to start faster?

dgraph version

Dgraph version   : v1.0.11-18-g4ca67ca
Commit SHA-1     : 4ca67ca
Commit timestamp : 2018-12-24 18:53:10 -0800
Branch           : master
Go version       : go1.11

data size:

node size : 0.65 Billion
edges size: 0.83 Billion

bulk load config:

{
    "map_shards":100,
    "num_go_routines":50,
    "r":"/dgraph_tmp/rdf_data",
    "mapoutput_mb":64,
    "reduce_shards":6,
    "shufflers":6,
    "s":"/dgraph/schema/acn.schema",
    "http":"0.0.0.0:8180",
    "zero":"xxx.xxx.xxx.xxx:5080"
}

bulk out:

314G    out/0
95G     out/1
54G     out/2
12K     out/3
12K     out/4
12K     out/5
462G    out/

alpha config

{
    ...
    "badger.tables": "mmap", 
    "badger.vlog": "disk", 
    "bindall": true, 
    "idx": 1, 
    "lru_mb": 4096, 
    ...
}

Hey @relunctance,

Possible for you to provide us your dataset, so we can investigate?

CC: @MichelDiz

sorry . I can’t give you actual data in use .
But, I wrote an program example to build data , It’s a very similar program.
You can do itself, and network copy can be avoided.

demo dataset

  • befor you load , you should assign max uid (eg:650000000)
curl -s "http://you.zero.leader:6080/assign?what=uids&num=650000000"

usage:

go run main.go --help

-c int
    how many goroutine num for run (default 100)
-maxuid int
    the uid random boundary value (default 650000000)
-num int
    how many data you want to build (default 10000)
-type string
    which type should build  , the value you can select 'nodes' or 'edges'  (default "nodes")

build nodes:

go run main.go -c 1000 -type nodes

<0x399816484>   <md5>   "15af151e0a8900a6a5910e9ab7c3d1d9"      .
<0x88727215>    <md5>   "34aa20a185bad5c4978ef146ac22d8e8"      .
<0x123765589>   <md5>   "9f9ddcdf6d549db46604f293aeeb7056"      .
<0x202065029>   <md5>   "bcf23280fc6e057e8b174550f69064aa"      .
<0x150284490>   <md5>   "2dfa255dc0794fbfc06e9896a7ab7c8b"      .
<0x452831330>   <md5>   "32a09108de57850e15ed8714307e2ee8"      .
<0x467763304>   <md5>   "eec11dcb1b1df6f7d0054520c8b8e128"      .
...

build edges:

go run main.go -c 1000  -type edges 

<0x143545272>   <parent>        <0x75408763>    .
<0x61168611>    <parent>        <0x616076829>   .
<0x232395722>   <parent>        <0x34282357>    .
<0x269891389>   <parent>        <0x507092845>   .
<0x48371838>    <parent>        <0x366437176>   .
...

schema:

md5: string @index(hash) .
parent: uid @reverse @count .

Hey , guys
I has been found bulk is used predicate to shard .
In this case, the first group[0] predicate is parent ,
the second group[1] predicate is _predicate_ ,
and the group[2] predicate is md5.
bulk will allocate all same predicate data to only one group , That’s the reason make a groups gets too big.

most of our actual data are bigger single predicate , as things are , with a large group the alpha can not work well , that means I can’t solve this problem by adding more machines.

Is dgraph bulk can support auto split the same predicate?
In addition , has the group functionality mentioned in the document been finished?

We currently do NOT split up a single predicate across groups. Each predicate must lie fully within a group. We have thought about potentially splitting up a predicate across groups, but that’s not a priority for now.

In fact, _predicate_ itself would be removed, once we have a stronger type system; because it is expensive to keep, and not transactionally sound.

Thanks for you reply .
It is a great pity that dgraph can not support bigger single predicate with a good solution now.
If predicate performance improvements at the expense of transactions , We are pleasure to accept it.
I’m looking forward to seeing that dgraph has the ability to process large data.
Can this requirement be added to Roadmap 2019 ?

1 Like

I have the same problem. I have 18Billion edges and 8Billion RDFs ,when I start up the dgraph alpha ,it took nearly an hour , is there a good solution?

I have a similar problem when one of my predicates is really large. I wanted to run my dGraph cluster on lot of small servers (16GB RAM) but I had to change my strategy and switched from horizontal scaling to some kind of hybrid scaling with fewer servers and higher configuration (64GB RAM - vertical scaling).

At present, it can only be solved by improving the performance of single machine.
It’s going to be a very complicated job .
As you see , the team is great.