Hey, by the way, if you guys don’t mind if I get a little off-topic, I have some question about applying Badger to my use case.
I want to make sure Badger has all the features and performance requirements I need before I invest more time into testing it and creating the C API for my needs.
I will explain what I’m trying to do with it in more detail.
I’m trying to store time-series financial data in an efficient manner, this data doesn’t need to be stored forever, but they do come in high quantities during certain periods of time.
So, for example, every five minutes, I receive a very high number of data to be stored (around 150~300k values) divided into multiple keys. This data normally comes in batches of around 80 operations each, so 300k / 80 = 3750 batches.
Normally, each value of this data changes one single key, in rocksdb, I organize this as follows:
I have around 800 rocksdb instances (they all share the same cache memory so I can control the memory usage easily), each instance has around 20 column families and each column family has around 160 keys (actually, it is way more keys, I will explain why below).
To store the data as time-series data, I use rocksdb prefix extractor and comparator, ex:
Let’s say I want to add a new value to key rsi
in column family five_minutes
of a rocksdb instance. I do not want to replace the old value in the rsi
key, I want to append somehow to it.
To do that, I add an ISO8601 timestamp to the end of the key, like this: rsi.2020-09-20T02:07:51.353549173+00:00
.
That way I can add new data to the rsi
key without having to (de)serialize a list to append the data to it, I just add a new key with that name, and then I can retrieve a list of the latest data from rsi
key using rocksdb prefix iterator (note that I had to create my custom comparator since by default the lexicographical sort of rocksdb is by ascending order).
Now, I do not want to keep this data laying around forever, so I use TTL to remove old data based on their column family, for example, the five_minutes
column family will remove data older than 4 days, ten_minutes
, 8 days and so on.
It would be nicer to control the data remove logic by number of data and not TTL, but I don’t think that is possible with the built-in features of rocksdb, for example, if the five_minutes
column family with rsi
key exceed a maximum of 1000 values, remove the exceeded older values.
So… that’s is pretty much it. Now, do you guys think Badger would be a great fit for my use case?
Things that had my attention was the key versioning, which I guess would allow me to have the same behaviour as my key with timestamp workaround for rocksdb but with probably better performance since keys are store in memory and also allow me to limit the number of data (versions) per key instead of using TTL (or maybe I can even have a combination of both).
What I’m not sure is if I can open multiple database instances is ok with Badger and if I can share the cache memory between the instance to avoid too much memory usage (normally all my rocksdb instances consume in total around 5gb of ram peak).
Note that I create an instance of rocksdb per market (btc/usdt, eth/usdt, etc) because otherwise I would need to create more column families (ex. btc_usdt_five_minutes
instead of a btc_usdt
instance with five_minutes
column family) and for some reason after each column family created, the next one takes more time in rocksdb… I also tried to create just the 20 column families and add the market to the key (ex. btc_usdt_rsi
), but this would result in extremely slow column family compactions for some reason after some time.
Maybe Badger has a better solution for that?
I didn’t find any mention about something similar to column families, if there is none, how would you recommend me to organize my database?
Thanks for the help!