While bulk loading a dataset with a high number of unique xids, the bulk loader goes out of memory in the mapper phase. We keep a sharded map (XidMap) of xids to their corresponding id so that multiple instances of the same blank nodes get the same id. The collected memory required by the shared map and all the xids increases significantly.
We propose that we introduce a command-line option to the bulk loader, --limitMemory
. This option would allow the user to limit the memory required by the xidmap
of the bulk loader. It works by dumping the xidMap onto the disk using badger. We would prompt the users to use this mode if the input file exceeds a certain threshold. This option would only be used in the case when you are running a big dataset with a high number of unique xids.
Implementation Details
The current xidMap
exists only inside one function AssignUid
. This function takes xid
as an input, and gives the id for the corresponding xid
. It first reads the map under a read lock to check if the value exists and returns. If it doesn’t exist, it acquires a write lock, gets a new id, and assigns it to the map.
We propose that we add a bloom filter, ristretto cache, and a badger to this xidMap. Badger allows us to dump excess xids onto disk. Bloom Filter and cache would allow us to reduce the amount of the read calls we need to make to Badger. After the length of uidMap
(in memory go map for xids) exceeds a certain threshold, we dump it to bloom filter, ristretto cache, and finally badger.
When we need to read, we first check uidMap
, then bloom filter, then badger.
Write
for uidMap := range badgerWriteCh {
// acquire lock
// store uidMap in cache
// store uidMap in badger
memPool.Put(uidMap)
}
if len(xidMap) > 1000000 { // trigger point
badgerWriteCh <- uidMap
for key, _ := range uidMap {
// Add the keys to bloom filter
}
uidMap = memPool.Get()
}
Read
uid, ok := uidMap[xid] // check if the xid exists in current map
if !ok && exists_in_bloom_filter {
// Check cache for xid
// Acquire lock
// Wait for writes to finish.
// Seek the key from badger
}
badgerWriteCh
is a buffered channel which dumps the data onto badger. The buffer size of this channel allows us to limit the memory used by xidMap
. This solution limits the memory used by bulk to be around 40GB no matter the size of the input. The xidmap
is limited to 4GB of space.
Performance impact
The performance of the solution depends upon the ratio of unique to total xids
in the dataset. In the case of a high unique xids
ratio, we expect the solution to perform better. To check the hypotheses, we created 5 different datasets that should represent various cases for the bulk loader. The database can be small or big, high/low unique xids.
- 100 Million RDFs (10 million xid)
- 1 Billion RDFs (100 million xid)
- 100 Million RDFs (100 million xid)
- 300 Million RDFs (300 million xid)
- 1 Billion RDFs (1 billion xid)
In all these datasets, memory usage was typically around 40-45 GB. Except for the 1Billion/1Billion and 300 Million/300 cases for master. In those two cases, memory spikes and causes to around 50GB. At around 350 Million, master OOMs.
1 Billion / 1 Billion | 1 Billion / 100 Million | 100 Million / 100 Million | 100 Million / 10 Million | 300 Million / 300 Million | |
---|---|---|---|---|---|
Master | OOM (18m45s, 357 RDF) | 36m 57s | 3m 25s | 3m 04s | 15m 41s |
BigMap | 1h 19m 40 | 50m | 4m 26s | 3m 14s | 15m 04s |
Apart from the 1 Billion / 1 Billion case, master
overperforms the solution. However, under increasing load on xidMap
, master
becomes slow and takes up too much memory. Thus we suggest users use the solution only when they have a high number of unique xids.
Alternate solutions tried
-
Skip list in badger instead of a
BatchWriter
: Skip lists have predefined space, requiring us to guess how much memory to use or to continually grow the memory. If we do predefined memory, we can just allocate the memory in thexidMap
, and we would get a similar performance gain. Plus, it wouldn’t help with big datasets where we need to dump the data off to the disk. -
The number of map shards/badgers: Keeping only 1 badger per 32 map shards, was leading to a lot of threads waiting for reads as badger was the bottleneck. We tried using 1 badger per map shard, but that led to all the CPU time being hogged by the badger. This performs well in a heavy unique/total xid ratio, but in other cases, it cancels the effect of the cache. Using 16 badgers for 32 map shards seems to be most optimal.
-
Smaller batches of write: One concern is that we dump too much data at once onto disk, which could cause bottlenecks. One simple solution is to decrease the batch size of write. However, running smaller writes slows down the progress, which offsets the gain it provides.
-
Higher threshold: The current threshold for triggering disk write is hardcoded to 1Million. Increasing this threshold doesn’t change performance while increasing the memory used. Even though we keep more data in memory, we still take the same amount of time writing to the disk. Hence we see no gain.
-
Hash: Current
xidMap
is a map from string to uint64. We can change that to map from uint64 to uint64 by changing xid to a hash of xid. This doesn’t have any performance benefit, however, the bulk loader takes less space. This also brings in a chance of error in the case of hash collisions. -
DiskV: Github based solution solving the same problem. It’s too slow compared to our solution.
BigMap Library
Interfaces
- Get(byte) interface{}
- Set(byte, interface{}) bool
- Upsert(byte, interface{}) bool
Things that are configurable:
- The number of map shards
- The number of the bloom filter, badger, and cache shards. (Can be made to 0 to work as a normal map)
- Pre-reserve map space
- Maximum capacity of the map
- Maximum number of buffered maps to keep in memory
- The hash function for the key
Set benchmarks
I inserted 500 Million entries into different configurations of 1badger/8badger/32badgers.
Increasing the number of badgers decreases the time taken to complete the insertion of 500 Million entries. However, it also increases the memory used. 32 badgers crash even before 500 Million entries are added, along with Go’s native map. 8 badgers seem to be most optimal for performance per memory usage.
As one can see, changing the maximum length of the map doesn’t have much effect on performance, however, memory usage is quite less for 100k/10k than 1Mil.
Hence 100k/8badgers seems the best for BigMap
1 Billion entries for 100k/8badgers
Get benchmarks
I performed 50% set and 50% read operations, with the following results.