Bulk Loader transaction scope (blank node identification)

When programmatically inserting mutations there is the concept of a transaction and the blank nodes from the mutations are recognised within the transaction. What does this transalte to when using the Bulk Loader? Is there something corresponding to a transaction? Are blank nodes in any way recognised between triples?

I am aware of the --xidmap option, but have been unable to use it. It consumes way too much memory. As I understand it, it is a disk based cache and should be able to have limited RAM usage. Whenever I run the bulk loader with the --xidmap option it consumes all available memory and eventually crashes.

This post

discuss a possible option --limitMemory. There is no such option in the current version, right?

With or without such an option there seems to be a problem with offloading cached id:s to disk so that memory can be limited. Is there perhaps a known bug here?

That --xidmap option does precisely what we need, but we can’t use it.

There’s no transaction in the Bulk Loader. It is used only once to populate the cluster.

A blank node is just an identifier. But you can store that in the node itself by using --store_xids flag.
That can be used in upsert queries(block) and also in Liveload.

➜  ~ dgraph bulk -h | grep xid
      --store_xids                       Generate an xid edge for each node.
      --xidmap string                    Directory to store xid to uid mapping

➜  ~ dgraph live -h | grep xid
  -U, --upsertPredicate string       run in upsertPredicate mode. the value would be used to store blank nodes as an xid
  -x, --xidmap string                Directory to store xid to uid mapping

Try to use --store_xids.

Are you suggesting that --xidmap works differently when combined with --store_xids? or are you recommending to use --store_xids instead of --xidmap and then do upsert:s?

Should interpret you answer as the --xidmap option won’t work with larger data sets?

This, yes.

No, but it depends. OOMs are normal, it happens when you don’t have the idea of data x resources. When you don’t know how much resource you need for that particular dataset.

For example, the data in the blog post Loading close to 1M edges/sec into Dgraph - Dgraph Blog was around 150GB. I don’t remember exactly the size, but it was around that. With that in mind, a dataset of 150GB should not go OOM easily. With the configuration mentioned in the blog post.

But sure, some limitations(e.g limit memory usage, add a cool down and so on) could be good to avoid it. But the time to process would increase.

We have 400G RAM available. When using the bulk loader without the --xidmap option it is well behaved and only uses a fraction of that. With the --xidmap option memory consumption never stops growing, and the process eventually crashes. That seems like faulty behaviour to me.

How much memory does the --xidmap option need to function properly? Is there some metric bytes per unique blank node?

Not sure

let me ping @Anurag and @ibrahim - Maybe it needs a refactoring to use jemalloc or something.

Just updated to the very latest version and tried this again.

With the --xidmap option the MAP phase is half speed compared to not using it. It quickly claims 100G and within 40min it consumed the entire 400G and crashed.

Without the --xidmap option it initially claims 50G, and it grows much slower. Right now I can’t tell you the max memory level it reaches – but it doesn’t crash.