When programmatically inserting mutations there is the concept of a transaction and the blank nodes from the mutations are recognised within the transaction. What does this transalte to when using the Bulk Loader? Is there something corresponding to a transaction? Are blank nodes in any way recognised between triples?
I am aware of the --xidmap option, but have been unable to use it. It consumes way too much memory. As I understand it, it is a disk based cache and should be able to have limited RAM usage. Whenever I run the bulk loader with the --xidmap option it consumes all available memory and eventually crashes.
This post
discuss a possible option --limitMemory. There is no such option in the current version, right?
With or without such an option there seems to be a problem with offloading cached id:s to disk so that memory can be limited. Is there perhaps a known bug here?
That --xidmap option does precisely what we need, but we can’t use it.
There’s no transaction in the Bulk Loader. It is used only once to populate the cluster.
A blank node is just an identifier. But you can store that in the node itself by using --store_xids flag.
That can be used in upsert queries(block) and also in Liveload.
➜ ~ dgraph bulk -h | grep xid
--store_xids Generate an xid edge for each node.
--xidmap string Directory to store xid to uid mapping
➜ ~ dgraph live -h | grep xid
-U, --upsertPredicate string run in upsertPredicate mode. the value would be used to store blank nodes as an xid
-x, --xidmap string Directory to store xid to uid mapping
Are you suggesting that --xidmap works differently when combined with --store_xids? or are you recommending to use --store_xids instead of --xidmap and then do upsert:s?
Should interpret you answer as the --xidmap option won’t work with larger data sets?
No, but it depends. OOMs are normal, it happens when you don’t have the idea of data x resources. When you don’t know how much resource you need for that particular dataset.
For example, the data in the blog post Loading close to 1M edges/sec into Dgraph - Dgraph Blog was around 150GB. I don’t remember exactly the size, but it was around that. With that in mind, a dataset of 150GB should not go OOM easily. With the configuration mentioned in the blog post.
But sure, some limitations(e.g limit memory usage, add a cool down and so on) could be good to avoid it. But the time to process would increase.
We have 400G RAM available. When using the bulk loader without the --xidmap option it is well behaved and only uses a fraction of that. With the --xidmap option memory consumption never stops growing, and the process eventually crashes. That seems like faulty behaviour to me.
How much memory does the --xidmap option need to function properly? Is there some metric bytes per unique blank node?
Just updated to the very latest version and tried this again.
With the --xidmap option the MAP phase is half speed compared to not using it. It quickly claims 100G and within 40min it consumed the entire 400G and crashed.
Without the --xidmap option it initially claims 50G, and it grows much slower. Right now I can’t tell you the max memory level it reaches – but it doesn’t crash.