What's the better to transfer snapshot to the new member joining an existing cluster

Looking for a faster way to get new member to sync with the snapshot from the leader.
Was using badger 1.6.2 and just upgraded to 3.2103.2.
I currently iterate the keys using iterator and pipe it to new member.

  1. I came across badger stream which does spin off multiple goroutines to transfer the data to new member. Does it persist progress so it can survive restarts ?
  2. Can I directly transfer the snapshot over network and use it on the new member ?

@mrjn @MichelDiz any suggestions here

Sorry, I can’t help you with Badger. But maybe @amanmangal or @ibrahim can.

Hi @scr

Before we solve the problem, could you elaborate the problem you are trying to solve? Did you try something that you found slow? How slow was it? Do you have any logs that you could share?

Thanks

Thanks @amanmangal.
I use badger as a persistence storage for my raft protocol.
I currently take a snapshot and iterate through the keys to send them to the member over pipe which is trying to catchup with the existing database, which is slow because its a single goroutine and small buffer size. It took me around to 60 minutes for a 5GB database.

Well that all said its a custom logic on top of the snapshot.

func sendSnapshot(txn *badger.Txn, writer io.Writer) error {
	it := txn.NewIterator(badger.DefaultIteratorOptions)
	defer it.Close()

	u32Buf := make([]byte, 4)
	smallBuf := make([]byte, 32*1024)
	bw := bufio.NewWriterSize(writer, 16384)

	for it.Rewind(); it.Valid(); it.Next() {
		item := it.Item()

		var entryBuf []byte
		err := item.Value(func(value []byte) error {
			if len(value) == 0 {
				return &kvdb.CorruptDbError{Message: "Empty ValueTuple", Key: item.KeyCopy(nil)}
			}

			entry := kvpb.BackupEntry{
				Key:        item.Key(),
				ValueTuple: value,
			}

			if sz := entry.Size(); sz <= len(smallBuf) {
				entryBuf = smallBuf[:sz]
			} else {
				entryBuf = make([]byte, sz)
			}
			n, err := entry.MarshalTo(entryBuf)
			if err != nil {
				return errors.Wrap(err, "Failed to marshal BackupEntry")
			}
			entryBuf = entryBuf[:n]

			return nil
		})
		if err != nil {
			return errors.Wrap(err, "item.Value failed")
		}

		binary.BigEndian.PutUint32(u32Buf, uint32(len(entryBuf)))
		if _, err := bw.Write(u32Buf); err != nil {
			return err
		}
		if _, err = bw.Write(entryBuf); err != nil {
			return err
		}
	}

	return bw.Flush()
}

I am looking to try out anything badger internally supports and can test it. I am looking at really quick way for the new member to catch up on existing db.

@amanmangal can you suggest what would be the best way for member catchup. Thanks!

Give me a few days to get back to you on this.

1 Like

@amanmangal any update on this.
Tried the badger stream API but its not much faster and also it doesn’t survive restarts.

Hi Sumith,

Sorry, I have been busy with the release. There are a few options that you could try. You could use the Stream framework to read the data instead. How long does that take for you? You could take a look at how we do it in Dgraph here dgraph/snapshot.go at main · dgraph-io/dgraph · GitHub.

We also have a change coming up in badger that’d speed things up opt(stream): add option to directly copy over tables from lower levels (#1700) by mangalaman93 · Pull Request #1872 · dgraph-io/badger · GitHub. We plan to use it in Dgraph as shown in the PR here opt(snapshot): use full table copy when streaming the entire data (#7870) by mangalaman93 · Pull Request #8736 · dgraph-io/dgraph · GitHub. This only works when you are trying to transfer the full badger data.

Could you also measure your network throughput using iperf or a similar tool? 1 hour for 5 GB data seems a lot to me. Badger should be able to read the data much faster.

For surviving restarts, this seems a bit complex to me. When would you consider a key to be transferred to the other side, is it when you have sent it or is it when the other side has acknowledged it? You will have to build a protocol that allows you account for acknowledgement if needed. You could store the range of keys that has been sent/acknowledges on to disk once and modify the chooseKey function accordingly. But in that case, you won’t be able to optimization in the PR.

Hope this helps.