At the first time I run dgraph bulk -r /data/Mydata/ -s /data/goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080
Well, why did not you name the RDF in the first command?
In Docs you can see that you need to https://docs.dgraph.io/deploy#bulk-loader dgraph bulk -r goldendata.rdf.gz -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080
Are you trying to upload a RDF of yours? Why are you using a sample schema? “goldendata.schema”
Please tell your machine specifications to have a good idea.
@ogreso Are you sure all these RDFs are compatible? there are many RDF standards used for ontological controls. Not all of them are compatible with Dgraph.
You were running some other program, which interfered with the bulk loader. This could be a download which would affect the disk throughput, hence affecting the loader.
There’s something about having 2000 files, which might not be fitting well with the loader. We haven’t tested with a directory with so many RDF files; I don’t see why that could be a problem; but who knows.
I’d say check for 1. Retry, and see if you still see that issue. If you still see it, you could try doing a cpu profiling, if you know how to do that (using http): Profiling Go Programs - The Go Programming Language
Otherwise, you could try to merge these files to drop their number. Merging could be done with linux + bash, via zcat, and gzip.
Bulk loader is really meant to be used in one go.
Update
I’m looking at the code,
And I don’t really see anything here, which won’t work well with increased number of files. In fact, we’re ensuring that we don’t create as many goroutines as the number of files, by using a throttle. So, cpu performance should be the same as one file.
thank you all~
Finally, I merged 2000 files into a 171G RDF file, but when I used dgraph bulk , I found that the disk was occupied so crazy that the 1.1TiB space had been occupied so far, but the 171G data had not yet been imported. How much disk space does 171RDF file need to import normally? @mrjn
I’m afraid the program crashes after the disk is stained, But all the data is in the TMP directory. I spent about 24 hours of events importing data, which may eventually result in failed due to insufficient disk use. Whether you can modify the import rule in the future synchronize updates with the Out folder for the TMP folder, which guarantees that data that has been processed does not have to be imported again if there is not enough disk space.@mrjn
Not something I can tell you on top of my head. You will have to experiment to figure that out.
If your map phase is done, you could do a --skip_map_phase, to avoid re-doing that part of the work. We don’t have any checkpointing for the map phase itself, though it could potentially be something we could build in the future.