QUESTION: Is there a tailing iterator?

diggy · July 9, 2020, 2:33pm

Moved from GitHub badger/1404

What version of Go are you using (`go version`)?

$ go version
go version go1.14.4 windows/amd64

What operating system are you using?

Windows (but GOOS=linux)

What version of Badger are you using?

github.com/dgraph-io/badger/v2

We have a requirement where we only write the keys (key=time_in_millis + counter) and remove them (no updates).
We would like to have 2 go routines, one that puts into the db and the other that iterates in insertion order and removes entry from the map when a certain condition is met.

We would like to have a tailing iterator that iterates over the values as they are put into the map (i.e. the iterator should not create a snapshot and that I should be getting all the new entries as they are added to the map in the insertion order by another go routine). If there are no entries in the map, then it should block.

Is this possible?

diggy · July 13, 2020, 12:12pm

jarifibrahim commented :

the other that iterates in insertion order and removes entry from the map when a certain condition is met.

Do you badger when you say map?

We would like to have a tailing iterator that iterates over the values as they are put into the map (i.e. the iterator should not create a snapshot and that I should be getting all the new entries as they are added to the map in the insertion order by another go routine).

I’m assuming you mean badger when you say map.
No, Badger works on a snapshot and it won’t be able to see new writes until they are committed. You would new transaction (with higher readTs than the previous transaction) to read the new values.

You could use use the subscribe API to get all the new items. See badger/db.go at b5a5d66cabce7f3de8c2b55c2790a5b3964b9f90 · dgraph-io/badger · GitHub

Your application would look something like

// iterate badger and count number of items. If zero, set BLOCKED bool to true
// Subscribe to "*" prefix in badger so that you get all the updates.
// On every update/insertion do whatever operation you want to do and keep track of items you want to delete.
// On the subscription handler, set the BLOCKED bool to false (you'd need to run a goroutine to check for empty DB state which sets the BLOCKED bool to true)
// Later, delete whatever entries you want using the txn.Delete API

@gowthamgutha would something like this work for your use case?

diggy · July 16, 2020, 11:30am

gowthamgutha commented :

@jarifibrahim

Will the order of elements by the insertion order when the key is current_time_in_millis + counter?
Also, is there any chance of getting duplicates in the subscription handler?
Can I delete in the subscription handler itself?
When we do txn.Delete can another goroutine simulataneously do txn.Set ?

diggy · July 16, 2020, 12:44pm

jarifibrahim commented :

@gowthamgutha

Will the order of elements by the insertion order when the key is current_time_in_millis + counter?

Do you mean to ask would subscription return items based on the insertion order? If so, the answer is yes, the subscribe API is emit the items as they’re inserted in the DB.

Also, is there any chance of getting duplicates in the subscription handler?

Only if you insert duplicates

Can I delete in the subscription handler itself?

No, you’ll have to collect the keys and then separately call delete

When we do txn.Delete can another goroutine simulataneously do txn.Set ?

Not on the same txn object. It is not thread-safe. You can do it on a different txn but your transactions can conflict.

diggy · July 17, 2020, 10:53am

gowthamgutha commented :

@jarifibrahim Thanks for your reply, few more questions…

Do I need to have a transaction for every put? (I want to just put and delete and don’t need any updates)
When do we get notification in subscribe() handler, is it when the transaction is committed?
Will there be any performance impact in running multiple transactions on different keys in different goroutines?

Also, can you provide any beginner’s tutorial for badger that has all the topics covered (atleast to the extent we’ve discussed in this issue)?

diggy · July 17, 2020, 11:02am

jarifibrahim commented :

@gowthamgutha

Also, can you provide any beginner’s tutorial for badger that has all the topics covered (atleast to the extent we’ve discussed in this issue)?

I gave an internal tech talk about badger https://www.youtube.com/watch?v=VQAeO823W8M . It’s a good starting point. If you feel something is missing it might be helpful for others, please feel free to send documentation PRs (or documentation requests)

Do I need to have a transaction for every put? (I want to just put and delete and don’t need any updates)

No, a transaction can store multiple updates (which is puts and deletes) but another transaction won’t be able to see these puts and deletes until the original transaction commits. Think of a transaction as a temporary store and only after commit the data is added to the permanent store.

When do we get notification in subscribe() handler, is it when the transaction is committed?

Yes, the subscribe handler will receive the key/value only after the transaction is committed.

Will there be any performance impact in running multiple transactions on different keys in different goroutines?

You would see a high CPU and memory usage because each transaction will take up some memory/cpu. I don’t think you should see any performance degradation when using multiple iterators.
Please be aware that badger follows snapshot isolation which means a transaction can see only a snapshot of data. If new data is added after a transaction starts running, the currently executing transaction won’t be able to see it.

diggy · July 17, 2020, 12:15pm

gowthamgutha commented :

@jarifibrahim

I saw your video. You said that if the process crashes in the middle of WriteBatch whatever that is uncommitted will be lost.

How do we decide when do we have to flush the write batch? Is there any internal automatic periodic commit, like for example, based on time or no. of records?

We have a stream of data that is coming in and we would want to put it to badger and to increase the throughput, we would need to use batching and we would also need to commit them so that they don’t get lost when the process crashes.
So, when the data is streaming (endless) and in calling Flush(), we will be blocking the routine so we will not be able to catch up with our input stream… (I suppose for this, we may need to use channels and then write another goroutine that takes from the channel and put it to the badger). Isn’t it? (or) may be can omit calling Flush() altogether, because commit any way happens asynchronously?
Does badger make use of memory mapped files? If it does, I suppose that even if the process crashes it can recover, because it will be flushed to disk? If not, do you see any point in storing the SSTs and Value logs in memory mapped files?

Topic		Replies	Views
Bug: Iterator inside a transaction Badger kind:bug	3	472	July 8, 2021
Badger Documentation limited or Feature removed? Badger kind:question , status:accepted	10	744	December 21, 2020
Iteration over keys added in transaction Badger	1	943	October 20, 2021
Badger.iterator needs Last() and Prev() Badger badger , kind:enhancement , priority:p2 , area:api , skip:stale	6	865	March 26, 2021
QUESTION: Best way to create snapshots from badger Badger kind:question	1	478	December 9, 2020

QUESTION: Is there a tailing iterator?

What version of Go are you using (go version)?

What operating system are you using?

What version of Badger are you using?

Related topics

What version of Go are you using (`go version`)?