Speeding up insert-or-update

I’m currently trying to insert ~1.5 billion key/value pairs into a badger database on an SSD disk. The keys aren’t unique. What I need is, roughly speaking, to increment a counter in the serialized struct whenever a key is encountered, so every operation is a Get followed by Set. There’s other data besides the counter in the struct that remains unchanged after the first Set. What should I know or do to get the most out of badger performance?

After the first Set cache the key+counter in memory. Increase only the cached counter. After you are done ingesting do a final Set per cached pair. In case you run out of memory, only cache it partially.

Badger already has block and index caches. What’s the point in reinventing the wheel while trying to avoid OOM?

[Summary]
Level 0 size:       57 MiB
Level 1 size:          0 B
Level 2 size:          0 B
Level 3 size:          0 B
Level 4 size:          0 B
Level 5 size:      1.2 GiB
Level 6 size:       12 GiB
Total SST size:     13 GiB
Value log size:    2.0 GiB

Is that OK that the intermediate levels are empty? The keys are still being inserted.

By the way, it’s crazy IMO that you can’t set some keys (the ones with the “!badger!” prefix).

I suspect the lookup via the indexes is not the slow part. You can profile your application to find out.
Disk I/O is slow. Every update to the badger db is persisted on disk. Caching the counters in memory and only writing them to the db once you are finished will be faster.

So the whole process took 1 day 6 hours 12 minutes 29 seconds. That’s 72 μs/op (op=Get+Set). Not bad. Maybe with additional caching I could shave these 6h off.

I disabled swap, and badger started to get OOM-killed on me after eating 85+% of 24Gb RAM. And that’s after inserting only about 70 million k/v pairs. The key size is 8 bytes, the value size if 18 bytes.

1 Like

Well, with an in-memory 1Gb cache suggested by vnium the data ingestion took a bit less than 24 hours, just as I had predicted.