QUESTION: Is there a tailing iterator?

Moved from GitHub badger/1404

Posted by gowthamgutha:

What version of Go are you using (go version)?

$ go version
go version go1.14.4 windows/amd64

What operating system are you using?

Windows (but GOOS=linux)

What version of Badger are you using?

github.com/dgraph-io/badger/v2

We have a requirement where we only write the keys (key=time_in_millis + counter) and remove them (no updates).
We would like to have 2 go routines, one that puts into the db and the other that iterates in insertion order and removes entry from the map when a certain condition is met.

We would like to have a tailing iterator that iterates over the values as they are put into the map (i.e. the iterator should not create a snapshot and that I should be getting all the new entries as they are added to the map in the insertion order by another go routine). If there are no entries in the map, then it should block.

Is this possible?

jarifibrahim commented :

the other that iterates in insertion order and removes entry from the map when a certain condition is met.

Do you badger when you say map?

We would like to have a tailing iterator that iterates over the values as they are put into the map (i.e. the iterator should not create a snapshot and that I should be getting all the new entries as they are added to the map in the insertion order by another go routine).

I’m assuming you mean badger when you say map.
No, Badger works on a snapshot and it won’t be able to see new writes until they are committed. You would new transaction (with higher readTs than the previous transaction) to read the new values.

You could use use the subscribe API to get all the new items. See badger/db.go at b5a5d66cabce7f3de8c2b55c2790a5b3964b9f90 · dgraph-io/badger · GitHub

Your application would look something like

// iterate badger and count number of items. If zero, set BLOCKED bool to true
// Subscribe to "*" prefix in badger so that you get all the updates.
// On every update/insertion do whatever operation you want to do and keep track of items you want to delete.
// On the subscription handler, set the BLOCKED bool to false (you'd need to run a goroutine to check for empty DB state which sets the BLOCKED bool to true)
// Later, delete whatever entries you want using the txn.Delete API

@gowthamgutha would something like this work for your use case?

gowthamgutha commented :

@jarifibrahim

  1. Will the order of elements by the insertion order when the key is current_time_in_millis + counter?
  2. Also, is there any chance of getting duplicates in the subscription handler?
  3. Can I delete in the subscription handler itself?
  4. When we do txn.Delete can another goroutine simulataneously do txn.Set ?

jarifibrahim commented :

@gowthamgutha

Will the order of elements by the insertion order when the key is current_time_in_millis + counter?

Do you mean to ask would subscription return items based on the insertion order? If so, the answer is yes, the subscribe API is emit the items as they’re inserted in the DB.

Also, is there any chance of getting duplicates in the subscription handler?

Only if you insert duplicates

Can I delete in the subscription handler itself?

No, you’ll have to collect the keys and then separately call delete

When we do txn.Delete can another goroutine simulataneously do txn.Set ?

Not on the same txn object. It is not thread-safe. You can do it on a different txn but your transactions can conflict.

gowthamgutha commented :

@jarifibrahim Thanks for your reply, few more questions…

  1. Do I need to have a transaction for every put? (I want to just put and delete and don’t need any updates)
  2. When do we get notification in subscribe() handler, is it when the transaction is committed?
  3. Will there be any performance impact in running multiple transactions on different keys in different goroutines?

Also, can you provide any beginner’s tutorial for badger that has all the topics covered (atleast to the extent we’ve discussed in this issue)?

jarifibrahim commented :

@gowthamgutha

Also, can you provide any beginner’s tutorial for badger that has all the topics covered (atleast to the extent we’ve discussed in this issue)?

I gave an internal tech talk about badger https://www.youtube.com/watch?v=VQAeO823W8M . It’s a good starting point. If you feel something is missing it might be helpful for others, please feel free to send documentation PRs (or documentation requests) :slight_smile:

Do I need to have a transaction for every put? (I want to just put and delete and don’t need any updates)

No, a transaction can store multiple updates (which is puts and deletes) but another transaction won’t be able to see these puts and deletes until the original transaction commits. Think of a transaction as a temporary store and only after commit the data is added to the permanent store.

When do we get notification in subscribe() handler, is it when the transaction is committed?

Yes, the subscribe handler will receive the key/value only after the transaction is committed.

Will there be any performance impact in running multiple transactions on different keys in different goroutines?

You would see a high CPU and memory usage because each transaction will take up some memory/cpu. I don’t think you should see any performance degradation when using multiple iterators.
Please be aware that badger follows snapshot isolation which means a transaction can see only a snapshot of data. If new data is added after a transaction starts running, the currently executing transaction won’t be able to see it.

gowthamgutha commented :

@jarifibrahim

I saw your video. You said that if the process crashes in the middle of WriteBatch whatever that is uncommitted will be lost.

How do we decide when do we have to flush the write batch? Is there any internal automatic periodic commit, like for example, based on time or no. of records?

  1. We have a stream of data that is coming in and we would want to put it to badger and to increase the throughput, we would need to use batching and we would also need to commit them so that they don’t get lost when the process crashes.

  2. So, when the data is streaming (endless) and in calling Flush(), we will be blocking the routine so we will not be able to catch up with our input stream… (I suppose for this, we may need to use channels and then write another goroutine that takes from the channel and put it to the badger). Isn’t it? (or) may be can omit calling Flush() altogether, because commit any way happens asynchronously?

  3. Does badger make use of memory mapped files? If it does, I suppose that even if the process crashes it can recover, because it will be flushed to disk? If not, do you see any point in storing the SSTs and Value logs in memory mapped files?