Bulk loader xidmap memory optimization

While bulk loading a dataset with a high number of unique xids, the bulk loader goes out of memory in the mapper phase. We keep a sharded map (XidMap) of xids to their corresponding id so that multiple instances of the same blank nodes get the same id. The collected memory required by the shared map and all the xids increases significantly.

We propose that we introduce a command-line option to the bulk loader, --limitMemory. This option would allow the user to limit the memory required by the xidmap of the bulk loader. It works by dumping the xidMap onto the disk using badger. We would prompt the users to use this mode if the input file exceeds a certain threshold. This option would only be used in the case when you are running a big dataset with a high number of unique xids.

Implementation Details

The current xidMap exists only inside one function AssignUid. This function takes xid as an input, and gives the id for the corresponding xid. It first reads the map under a read lock to check if the value exists and returns. If it doesn’t exist, it acquires a write lock, gets a new id, and assigns it to the map.

We propose that we add a bloom filter, ristretto cache, and a badger to this xidMap. Badger allows us to dump excess xids onto disk. Bloom Filter and cache would allow us to reduce the amount of the read calls we need to make to Badger. After the length of uidMap (in memory go map for xids) exceeds a certain threshold, we dump it to bloom filter, ristretto cache, and finally badger.

When we need to read, we first check uidMap, then bloom filter, then badger.

Write

for uidMap := range badgerWriteCh {
		// acquire lock
		// store uidMap in cache 
		// store uidMap in badger 
		memPool.Put(uidMap)
}

if len(xidMap) > 1000000 { // trigger point
	badgerWriteCh <- uidMap

	for key, _ := range uidMap {
		// Add the keys to bloom filter
	}

	uidMap = memPool.Get()
}

Read

uid, ok := uidMap[xid] // check if the xid exists in current map
if !ok && exists_in_bloom_filter {
	// Check cache for xid
	// Acquire lock
	// Wait for writes to finish. 
	// Seek the key from badger
}

badgerWriteCh is a buffered channel which dumps the data onto badger. The buffer size of this channel allows us to limit the memory used by xidMap. This solution limits the memory used by bulk to be around 40GB no matter the size of the input. The xidmap is limited to 4GB of space.

Performance impact

The performance of the solution depends upon the ratio of unique to total xids in the dataset. In the case of a high unique xids ratio, we expect the solution to perform better. To check the hypotheses, we created 5 different datasets that should represent various cases for the bulk loader. The database can be small or big, high/low unique xids.

  1. 100 Million RDFs (10 million xid)
  2. 1 Billion RDFs (100 million xid)
  3. 100 Million RDFs (100 million xid)
  4. 300 Million RDFs (300 million xid)
  5. 1 Billion RDFs (1 billion xid)

In all these datasets, memory usage was typically around 40-45 GB. Except for the 1Billion/1Billion and 300 Million/300 cases for master. In those two cases, memory spikes and causes to around 50GB. At around 350 Million, master OOMs.

1 Billion / 1 Billion 1 Billion / 100 Million 100 Million / 100 Million 100 Million / 10 Million 300 Million / 300 Million
Master OOM (18m45s, 357 RDF) 36m 57s 3m 25s 3m 04s 15m 41s
BigMap 1h 19m 40 50m 4m 26s 3m 14s 15m 04s

Apart from the 1 Billion / 1 Billion case, master overperforms the solution. However, under increasing load on xidMap, master becomes slow and takes up too much memory. Thus we suggest users use the solution only when they have a high number of unique xids.

Alternate solutions tried

  1. Skip list in badger instead of a BatchWriter: Skip lists have predefined space, requiring us to guess how much memory to use or to continually grow the memory. If we do predefined memory, we can just allocate the memory in the xidMap, and we would get a similar performance gain. Plus, it wouldn’t help with big datasets where we need to dump the data off to the disk.

  2. The number of map shards/badgers: Keeping only 1 badger per 32 map shards, was leading to a lot of threads waiting for reads as badger was the bottleneck. We tried using 1 badger per map shard, but that led to all the CPU time being hogged by the badger. This performs well in a heavy unique/total xid ratio, but in other cases, it cancels the effect of the cache. Using 16 badgers for 32 map shards seems to be most optimal.

  3. Smaller batches of write: One concern is that we dump too much data at once onto disk, which could cause bottlenecks. One simple solution is to decrease the batch size of write. However, running smaller writes slows down the progress, which offsets the gain it provides.

  4. Higher threshold: The current threshold for triggering disk write is hardcoded to 1Million. Increasing this threshold doesn’t change performance while increasing the memory used. Even though we keep more data in memory, we still take the same amount of time writing to the disk. Hence we see no gain.

  5. Hash: Current xidMap is a map from string to uint64. We can change that to map from uint64 to uint64 by changing xid to a hash of xid. This doesn’t have any performance benefit, however, the bulk loader takes less space. This also brings in a chance of error in the case of hash collisions.

  6. DiskV: Github based solution solving the same problem. It’s too slow compared to our solution.

BigMap Library

Interfaces

  • Get(byte) interface{}
  • Set(byte, interface{}) bool
  • Upsert(byte, interface{}) bool

Things that are configurable:

  • The number of map shards
  • The number of the bloom filter, badger, and cache shards. (Can be made to 0 to work as a normal map)
  • Pre-reserve map space
  • Maximum capacity of the map
  • Maximum number of buffered maps to keep in memory
  • The hash function for the key

Set benchmarks

I inserted 500 Million entries into different configurations of 1badger/8badger/32badgers.

image_out image_log

Increasing the number of badgers decreases the time taken to complete the insertion of 500 Million entries. However, it also increases the memory used. 32 badgers crash even before 500 Million entries are added, along with Go’s native map. 8 badgers seem to be most optimal for performance per memory usage.

image_out1 image_log1

As one can see, changing the maximum length of the map doesn’t have much effect on performance, however, memory usage is quite less for 100k/10k than 1Mil.

Hence 100k/8badgers seems the best for BigMap

1 Billion entries for 100k/8badgers

image_out2 image_log2

Get benchmarks

I performed 50% set and 50% read operations, with the following results.

image_out4 image_log4

1 Like

Build a storage-backed caching library first: Involve Ristretto, Skip lists, Bloom Filters, Badger, etc. – once that’s built, tested thoroughly, pushed hard and so forth, then it can be introduced into bulk and live loaders.

1 Like

Very well written post. @harshil_goel

Some suggestions

1. --dry-run option

While you convert this into a library, I’d also suggest an option --dry-run to bulk loader or library. This basically means that bulk loader will do a dry run without allocating any UIDs and provide a report and/or dump it in a file. During dry run, it simply reads the input and keeps a count of unique xid, total xid and total rdfs. You will need a map from xid → bool but it’s memory needs will be lower than the actual by a factor of ~8 (because value is bool and not uint64).

The report could include how many xids exist in the data set, total rdfs, etc. This will allow the admin to tweak the bulk loader or the bulk loader can read the report file and intelligently decide the --limitmemory mode or not.

We can also take a step further, -dry-run <time> which will scan the input data for that duration and report the xids. This is to limit doing a dry run for very large datasets but get a guesstimate of the xid/rdf ratio, so it is not 100% fool proof.

2. Adaptive limitMemory

Instead of keeping a hard-coded threshold of 100K, we could taken into the account the system total RAM and Data file size etc to figure out a threshold.

3. CLI Option for ratio of unique xid/xid

In addition to or instead of (1), we can just ask the user to provide us their best guess of the xid and the total xid or the ratio. The presumption here is that they generally know their data sets and may be able to do some estimation. This will allow us to decide whether to go in --limitMemory mode or not.

Finally, why is the ratio of unique xids to xids important. I think the absolute number of unique xids will drive most of the decision making, not the ratio, Right?

Edit - I think HyperLogLog would be a better approach HyperLogLog - Wikipedia

Bloom filters with atomic count would be cheaper in terms of memory (but slower) than storing a map just to keep track of unique xids. Something like the following should work

increment uniq if not in the bloom filter and add uniq to bloom filter

We can also use a count-min sketch to get an approximate length for each xid->id pair.

I like the --dry-run option compared to the other two because it might be the simplest for the end-user. The other two options would require some prior knowledge.

1 Like

For (3), yes. But for (2), code can figure out the needed values.

Interesting. Need to read more about this.

Was this fix implemented into the bulk loader? I seem to be running into similar issues with Dgraph v20.11.1 right now and was wondering what a workaround could be if not.

We didn’t implement this. We switched over to a B+ tree

This B+ tree was earlier mmaped to conserve memory but now we keep it completely in memory

Thank you Ibrahim.

I am trying to load ~20B triples and want to keep the xid_map for future loading efforts - but it seems like that causes the machine to run out of memory rather quickly. Are there any proposed workarounds or flags that might help solve this issue?

Curious what kind of memory usage are you seeing. Can you give us the output from /jemalloc and from go memory profiler ?

Better still. If you are ok sharing data with us privately, we can run it and fix up any issues. Or, perhaps use Dgraph Cloud and select to bulk load data for a new cluster . That’d work too.