Bulk Loader transaction scope (blank node identification)

apete · January 29, 2021, 4:57pm

When programmatically inserting mutations there is the concept of a transaction and the blank nodes from the mutations are recognised within the transaction. What does this transalte to when using the Bulk Loader? Is there something corresponding to a transaction? Are blank nodes in any way recognised between triples?

I am aware of the --xidmap option, but have been unable to use it. It consumes way too much memory. As I understand it, it is a disk based cache and should be able to have limited RAM usage. Whenever I run the bulk loader with the --xidmap option it consumes all available memory and eventually crashes.

This post

discuss a possible option --limitMemory. There is no such option in the current version, right?

With or without such an option there seems to be a problem with offloading cached id:s to disk so that memory can be limited. Is there perhaps a known bug here?

That --xidmap option does precisely what we need, but we can’t use it.

MichelDiz · January 29, 2021, 5:33pm

There’s no transaction in the Bulk Loader. It is used only once to populate the cluster.

A blank node is just an identifier. But you can store that in the node itself by using --store_xids flag.
That can be used in upsert queries(block) and also in Liveload.

➜  ~ dgraph bulk -h | grep xid
      --store_xids                       Generate an xid edge for each node.
      --xidmap string                    Directory to store xid to uid mapping


➜  ~ dgraph live -h | grep xid
  -U, --upsertPredicate string       run in upsertPredicate mode. the value would be used to store blank nodes as an xid
  -x, --xidmap string                Directory to store xid to uid mapping

Try to use --store_xids.

apete · January 29, 2021, 5:58pm

Are you suggesting that --xidmap works differently when combined with --store_xids? or are you recommending to use --store_xids instead of --xidmap and then do upsert:s?

Should interpret you answer as the --xidmap option won’t work with larger data sets?

MichelDiz · January 29, 2021, 8:08pm

This, yes.

No, but it depends. OOMs are normal, it happens when you don’t have the idea of data x resources. When you don’t know how much resource you need for that particular dataset.

For example, the data in the blog post Loading close to 1M edges/sec into Dgraph - Dgraph Blog was around 150GB. I don’t remember exactly the size, but it was around that. With that in mind, a dataset of 150GB should not go OOM easily. With the configuration mentioned in the blog post.

But sure, some limitations(e.g limit memory usage, add a cool down and so on) could be good to avoid it. But the time to process would increase.

apete · January 31, 2021, 11:10am

We have 400G RAM available. When using the bulk loader without the --xidmap option it is well behaved and only uses a fraction of that. With the --xidmap option memory consumption never stops growing, and the process eventually crashes. That seems like faulty behaviour to me.

How much memory does the --xidmap option need to function properly? Is there some metric bytes per unique blank node?

MichelDiz · January 31, 2021, 4:43pm

Not sure

let me ping @Anurag and @ibrahim - Maybe it needs a refactoring to use jemalloc or something.

apete · February 1, 2021, 9:20am

Just updated to the very latest version and tried this again.

With the --xidmap option the MAP phase is half speed compared to not using it. It quickly claims 100G and within 40min it consumed the entire 400G and crashed.

Without the --xidmap option it initially claims 50G, and it grows much slower. Right now I can’t tell you the max memory level it reaches – but it doesn’t crash.

Topic		Replies	Views
Bulk loader Dgraph	2	353	February 13, 2023
Bulk loader same blank nodes from different rdf files Users	4	614	July 21, 2020
Bulk loader xidmap memory optimization Dev	8	1089	March 4, 2021
Bulk loader -x option Users mutation	7	815	May 9, 2020
Transactions, mutations and blank nodes Dgraph kind:question	4	423	January 14, 2021

Bulk Loader transaction scope (blank node identification)

Related topics