Problem with dgraph bulk -r command

ogreso · May 11, 2018, 5:51am

At the first time I run
dgraph bulk -r /data/Mydata/ -s /data/goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

speed very slow :

MAP 02h18m36s rdf_count:309.1M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.6k/sec
MAP 02h18m37s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m38s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m39s rdf_count:309.2M rdf_speed:37.17k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m40s rdf_count:309.2M rdf_speed:37.16k/sec edge_count:1.044G edge_speed:125.5k/sec
MAP 02h18m41s rdf_count:309.2M rdf_speed:37.16k/sec edge_count:1.044G edge_speed:125.5k/sec

But I run
dgraph bulk -r /data/Mydata/all.rdf -s /data/goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

edge_speed:800k/sec

Why is that? And How to improve the speed ?

MichelDiz · May 11, 2018, 4:23pm

Well, why did not you name the RDF in the first command?

In Docs you can see that you need to https://docs.dgraph.io/deploy#bulk-loader
dgraph bulk -r goldendata.rdf.gz -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

Are you trying to upload a RDF of yours? Why are you using a sample schema? “goldendata.schema”

Please tell your machine specifications to have a good idea.

Cheers

ogreso · May 14, 2018, 7:15am

my data Contains 2000 RDF files ，Can I use
dgraph bulk command Import one by one ?

MichelDiz · May 14, 2018, 3:33pm

Wow, not sure. @pawan, @mrjn Can you see that?

@ogreso Are you sure all these RDFs are compatible? there are many RDF standards used for ontological controls. Not all of them are compatible with Dgraph.

mrjn · May 15, 2018, 1:31am

If there’s a significant slowdown, it could be:

You were running some other program, which interfered with the bulk loader. This could be a download which would affect the disk throughput, hence affecting the loader.
There’s something about having 2000 files, which might not be fitting well with the loader. We haven’t tested with a directory with so many RDF files; I don’t see why that could be a problem; but who knows.

I’d say check for 1. Retry, and see if you still see that issue. If you still see it, you could try doing a cpu profiling, if you know how to do that (using http): Profiling Go Programs - The Go Programming Language

Otherwise, you could try to merge these files to drop their number. Merging could be done with linux + bash, via zcat, and gzip.

Bulk loader is really meant to be used in one go.

Update

I’m looking at the code,

github.com

dgraph-io/dgraph/blob/e896c17e434f8bdedfa7578f61ba239e9e9059eb/dgraph/cmd/bulk/loader.go#L235


      
          		}
          	}
          
          	if len(readers) == 0 {
          		fmt.Println("No rdf files found.")
          		os.Exit(1)
          	}
          
          	thr := x.NewThrottle(ld.opt.NumGoroutines)
          	for _, r := range readers {
          		thr.Start()
          		go func(r *bufio.Reader) {
          			defer thr.Done()
          			for {
          				chunkBuf, err := readChunk(r)
          				if err == io.EOF {
          					if chunkBuf.Len() != 0 {
          						ld.rdfChunkCh <- chunkBuf
          					}
          					break
          				}

And I don’t really see anything here, which won’t work well with increased number of files. In fact, we’re ensuring that we don’t create as many goroutines as the number of files, by using a throttle. So, cpu performance should be the same as one file.

ogreso · May 16, 2018, 4:18am

thank you all~
Finally, I merged 2000 files into a 171G RDF file, but when I used dgraph bulk , I found that the disk was occupied so crazy that the 1.1TiB space had been occupied so far, but the 171G data had not yet been imported. How much disk space does 171RDF file need to import normally? @mrjn

ogreso · May 16, 2018, 4:33am

I’m afraid the program crashes after the disk is stained, But all the data is in the TMP directory. I spent about 24 hours of events importing data, which may eventually result in failed due to insufficient disk use. Whether you can modify the import rule in the future synchronize updates with the Out folder for the TMP folder, which guarantees that data that has been processed does not have to be imported again if there is not enough disk space.@mrjn

mrjn · May 16, 2018, 5:26pm

Not something I can tell you on top of my head. You will have to experiment to figure that out.

If your map phase is done, you could do a --skip_map_phase, to avoid re-doing that part of the work. We don’t have any checkpointing for the map phase itself, though it could potentially be something we could build in the future.

system · June 15, 2018, 5:26pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some mistakes when running dgraph bulk Users	7	619	June 27, 2018
Data Ingestion very slow Users	6	1083	October 25, 2018
Bulk loader becomes slow when memory gets full Users	20	2169	December 17, 2017
Loading close to 1M edges/sec into Dgraph - Dgraph Blog Blog	3	1463	November 15, 2018
Bulk Loader - Deploy Documentation	0	894	December 16, 2020

Problem with dgraph bulk -r command

Update

Related topics