Problem with dgraph bulk -r command

At the first time I run
dgraph bulk -r /data/Mydata/ -s /data/goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

speed very slow :

MAP 02h18m36s rdf_count:309.1M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.6k/sec
MAP 02h18m37s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m38s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m39s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m40s rdf_count:309.2M rdf_speed:37.16k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m41s rdf_count:309.2M rdf_speed:37.16k/sec edge_count:1.044G edge_speed:125.5k/sec

But I run
dgraph bulk -r /data/Mydata/all.rdf -s /data/goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

edge_speed:800k/sec

Why is that? And How to improve the speed ?

Well, why did not you name the RDF in the first command?

In Docs you can see that you need to https://docs.dgraph.io/deploy#bulk-loader
dgraph bulk -r goldendata.rdf.gz -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

Are you trying to upload a RDF of yours? Why are you using a sample schema? “goldendata.schema”

Please tell your machine specifications to have a good idea.

Cheers

my data Contains 2000 RDF files ,Can I use
dgraph bulk command Import one by one ?

Wow, not sure. @pawan, @mrjn Can you see that?

@ogreso Are you sure all these RDFs are compatible? there are many RDF standards used for ontological controls. Not all of them are compatible with Dgraph.

If there’s a significant slowdown, it could be:

  1. You were running some other program, which interfered with the bulk loader. This could be a download which would affect the disk throughput, hence affecting the loader.
  2. There’s something about having 2000 files, which might not be fitting well with the loader. We haven’t tested with a directory with so many RDF files; I don’t see why that could be a problem; but who knows.

I’d say check for 1. Retry, and see if you still see that issue. If you still see it, you could try doing a cpu profiling, if you know how to do that (using http): Profiling Go Programs - The Go Programming Language

Otherwise, you could try to merge these files to drop their number. Merging could be done with linux + bash, via zcat, and gzip.

Bulk loader is really meant to be used in one go.

Update

I’m looking at the code,

And I don’t really see anything here, which won’t work well with increased number of files. In fact, we’re ensuring that we don’t create as many goroutines as the number of files, by using a throttle. So, cpu performance should be the same as one file.

thank you all~
Finally, I merged 2000 files into a 171G RDF file, but when I used dgraph bulk , I found that the disk was occupied so crazy that the 1.1TiB space had been occupied so far, but the 171G data had not yet been imported. How much disk space does 171RDF file need to import normally? @mrjn

I’m afraid the program crashes after the disk is stained, But all the data is in the TMP directory. I spent about 24 hours of events importing data, which may eventually result in failed due to insufficient disk use. Whether you can modify the import rule in the future synchronize updates with the Out folder for the TMP folder, which guarantees that data that has been processed does not have to be imported again if there is not enough disk space.@mrjn

Not something I can tell you on top of my head. You will have to experiment to figure that out.

If your map phase is done, you could do a --skip_map_phase, to avoid re-doing that part of the work. We don’t have any checkpointing for the map phase itself, though it could potentially be something we could build in the future.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.