Performance regression 1.6 to 2.0.2

jarifibrahim commented :

@dougdonohoe How are you inserting the data? Stream writer?

dougdonohoe commented :

Iā€™m using badger.WriteBatch. This smells like a race condition bug. Error occurs when calling Flush().

dougdonohoe commented :

In other news, I ran my test program to create 22,000,000 entries (with random lookups done for each one). In 1.6, it ran in 25m22s. In 2.0.2 with patch, it ran in 20m40s. So that is a positive sign.

OTOH, I was trying to compare the re-world performance of my Kafka extraction program, but it died due to the above filename bug. At the time of the crash, it had been running for 1h45m. At the approximately same time when last run using 1.6, it took 1h30m. So it appears to be 16% slower in 2.x. Note that this code is predominately write-focused.

dougdonohoe commented :

I donā€™t know much about the code, but I noticed this:

1.6 value.go:1013

			newid := atomic.AddUint32(&vlog.maxFid, 1)

2.0 value.go:1405

			newid := vlog.maxFid + 1

Seems suspicious to have removed the atomic add.

dougdonohoe commented :

I spent a little more time trying to understand the Badger code base. Itā€™s a lot to come up to speed on. It is obvious the maxFid logic changed substantially from 1.6 to 2.0 so my point above may not be relevant.

However, one thing I donā€™t understand in 2.0 is why the *.vlog file timestamps arenā€™t monotonically increasing.

In 1.6, ls by timestamp shows timestamps increasing along with file id.

$ ls -ltrh *.vlog
-rw-rw-r-- 1 donohoe donohoe 250M Feb 17 15:08 000000.vlog
-rw-rw-r-- 1 donohoe donohoe 268M Feb 17 15:13 000001.vlog
-rw-rw-r-- 1 donohoe donohoe 247M Feb 17 15:17 000002.vlog
-rw-rw-r-- 1 donohoe donohoe 232M Feb 17 15:21 000003.vlog
-rw-rw-r-- 1 donohoe donohoe 209M Feb 17 15:23 000004.vlog
-rw-rw-r-- 1 donohoe donohoe 288M Feb 17 15:28 000005.vlog
-rw-rw-r-- 1 donohoe donohoe 310M Feb 17 15:33 000006.vlog
-rw-rw-r-- 1 donohoe donohoe 273M Feb 17 15:38 000007.vlog
-rw-rw-r-- 1 donohoe donohoe 280M Feb 17 15:43 000008.vlog
-rw-rw-r-- 1 donohoe donohoe 279M Feb 17 15:51 000009.vlog
-rw-rw-r-- 1 donohoe donohoe 299M Feb 17 16:04 000010.vlog
-rw-rw-r-- 1 donohoe donohoe 274M Feb 17 16:12 000011.vlog
-rw-rw-r-- 1 donohoe donohoe 220M Feb 17 16:14 000012.vlog
-rw-rw-r-- 1 donohoe donohoe 218M Feb 17 16:17 000013.vlog
-rw-rw-r-- 1 donohoe donohoe 412M Feb 17 16:24 000014.vlog
-rw-rw-r-- 1 donohoe donohoe 259M Feb 17 16:28 000015.vlog
-rw-rw-r-- 1 donohoe donohoe 253M Feb 17 16:30 000016.vlog
-rw-rw-r-- 1 donohoe donohoe 288M Feb 17 16:33 000017.vlog
-rw-rw-r-- 1 donohoe donohoe 246M Feb 17 16:36 000018.vlog
-rw-rw-r-- 1 donohoe donohoe 188M Feb 17 16:37 000019.vlog
-rw-rw-r-- 1 donohoe donohoe 187M Feb 17 16:38 000020.vlog
-rw-rw-r-- 1 donohoe donohoe 225M Feb 17 16:39 000021.vlog
-rw-rw-r-- 1 donohoe donohoe 245M Feb 17 16:41 000022.vlog
-rw-rw-r-- 1 donohoe donohoe 223M Feb 17 16:44 000023.vlog
-rw-rw-r-- 1 donohoe donohoe 220M Feb 17 16:47 000024.vlog
-rw-rw-r-- 1 donohoe donohoe 228M Feb 17 16:50 000025.vlog
-rw-rw-r-- 1 donohoe donohoe 228M Feb 17 16:53 000026.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 16:56 000027.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 16:59 000028.vlog
-rw-rw-r-- 1 donohoe donohoe 243M Feb 17 17:02 000029.vlog
-rw-rw-r-- 1 donohoe donohoe 244M Feb 17 17:05 000030.vlog
-rw-rw-r-- 1 donohoe donohoe 238M Feb 17 17:09 000031.vlog
-rw-rw-r-- 1 donohoe donohoe 244M Feb 17 17:12 000032.vlog
-rw-rw-r-- 1 donohoe donohoe 267M Feb 17 17:15 000033.vlog
-rw-rw-r-- 1 donohoe donohoe 240M Feb 17 17:18 000034.vlog
-rw-rw-r-- 1 donohoe donohoe 249M Feb 17 17:23 000035.vlog
-rw-rw-r-- 1 donohoe donohoe 275M Feb 17 17:27 000036.vlog
-rw-rw-r-- 1 donohoe donohoe 282M Feb 17 17:32 000037.vlog
-rw-rw-r-- 1 donohoe donohoe 303M Feb 17 17:35 000038.vlog
-rw-rw-r-- 1 donohoe donohoe 311M Feb 17 17:38 000039.vlog
-rw-rw-r-- 1 donohoe donohoe 276M Feb 17 17:43 000040.vlog
-rw-rw-r-- 1 donohoe donohoe 253M Feb 17 17:47 000041.vlog
-rw-rw-r-- 1 donohoe donohoe 226M Feb 17 17:50 000042.vlog
-rw-rw-r-- 1 donohoe donohoe 235M Feb 17 17:53 000043.vlog
-rw-rw-r-- 1 donohoe donohoe 229M Feb 17 17:56 000044.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 17:59 000045.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:03 000046.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:06 000047.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:09 000048.vlog
-rw-rw-r-- 1 donohoe donohoe 274M Feb 17 18:14 000049.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:17 000050.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:21 000051.vlog
-rw-rw-r-- 1 donohoe donohoe 231M Feb 17 18:24 000052.vlog
-rw-rw-r-- 1 donohoe donohoe 230M Feb 17 18:27 000053.vlog
-rw-rw-r-- 1 donohoe donohoe 273M Feb 17 18:32 000054.vlog
-rw-rw-r-- 1 donohoe donohoe 290M Feb 17 18:37 000055.vlog
-rw-rw-r-- 1 donohoe donohoe 292M Feb 17 18:44 000056.vlog
-rw-rw-r-- 1 donohoe donohoe 164M Feb 18 03:09 000057.vlog

However in 2.0, it looks like .vlog files are accessed again after they have been written. In my use case, Iā€™m only setting one value for a particular key, so no need to remove/touch previous .vlog files. Yet:

$ ls -ltrh *.vlog
-rw------- 1 donohoe donohoe 181M Mar 12 16:47 000018.vlog
-rw------- 1 donohoe donohoe 177M Mar 12 16:48 000019.vlog
-rw------- 1 donohoe donohoe 192M Mar 12 16:49 000020.vlog
-rw------- 1 donohoe donohoe 238M Mar 12 16:51 000000.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 16:51 000021.vlog
-rw------- 1 donohoe donohoe 229M Mar 12 16:55 000022.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 16:58 000001.vlog
-rw------- 1 donohoe donohoe 207M Mar 12 16:59 000023.vlog
-rw------- 1 donohoe donohoe 222M Mar 12 17:02 000024.vlog
-rw------- 1 donohoe donohoe 235M Mar 12 17:03 000002.vlog
-rw------- 1 donohoe donohoe 211M Mar 12 17:05 000025.vlog
-rw------- 1 donohoe donohoe 221M Mar 12 17:08 000003.vlog
-rw------- 1 donohoe donohoe 224M Mar 12 17:09 000026.vlog
-rw------- 1 donohoe donohoe 198M Mar 12 17:11 000004.vlog
-rw------- 1 donohoe donohoe 229M Mar 12 17:13 000027.vlog
-rw------- 1 donohoe donohoe 276M Mar 12 17:16 000005.vlog
-rw------- 1 donohoe donohoe 232M Mar 12 17:16 000028.vlog
-rw------- 1 donohoe donohoe 230M Mar 12 17:21 000029.vlog
-rw------- 1 donohoe donohoe 299M Mar 12 17:23 000006.vlog
-rw------- 1 donohoe donohoe 230M Mar 12 17:25 000030.vlog
-rw------- 1 donohoe donohoe 226M Mar 12 17:28 000031.vlog
-rw------- 1 donohoe donohoe 262M Mar 12 17:29 000007.vlog
-rw------- 1 donohoe donohoe 259M Mar 12 17:32 000032.vlog
-rw------- 1 donohoe donohoe 268M Mar 12 17:35 000008.vlog
-rw------- 1 donohoe donohoe 228M Mar 12 17:36 000033.vlog
-rw------- 1 donohoe donohoe 235M Mar 12 17:41 000034.vlog
-rw------- 1 donohoe donohoe 267M Mar 12 17:44 000009.vlog
-rw------- 1 donohoe donohoe 249M Mar 12 17:46 000035.vlog
-rw------- 1 donohoe donohoe 269M Mar 12 17:52 000036.vlog
-rw------- 1 donohoe donohoe 274M Mar 12 17:57 000037.vlog
-rw------- 1 donohoe donohoe 287M Mar 12 17:58 000010.vlog
-rw------- 1 donohoe donohoe 307M Mar 12 18:00 000038.vlog
-rw------- 1 donohoe donohoe 279M Mar 12 18:06 000039.vlog
-rw------- 1 donohoe donohoe 263M Mar 12 18:07 000011.vlog
-rw------- 1 donohoe donohoe 208M Mar 12 18:09 000012.vlog
-rw------- 1 donohoe donohoe 243M Mar 12 18:11 000040.vlog
-rw------- 1 donohoe donohoe 207M Mar 12 18:11 000013.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 18:17 000041.vlog
-rw------- 1 donohoe donohoe 401M Mar 12 18:19 000014.vlog
-rw------- 1 donohoe donohoe 204M Mar 12 18:20 000042.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:23 000043.vlog
-rw------- 1 donohoe donohoe 248M Mar 12 18:24 000015.vlog
-rw------- 1 donohoe donohoe 242M Mar 12 18:27 000016.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:27 000044.vlog
-rw------- 1 donohoe donohoe 277M Mar 12 18:30 000017.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:31 000045.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:34 000046.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:37 000047.vlog
-rw------- 1 donohoe donohoe 226M Mar 12 18:41 000048.vlog
-rw------- 1 donohoe donohoe 256M Mar 12 18:46 000049.vlog
-rw------- 1 donohoe donohoe 218M Mar 12 18:49 000050.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:53 000051.vlog
-rw------- 1 donohoe donohoe 219M Mar 12 18:56 000052.vlog
-rw------- 1 donohoe donohoe 237M Mar 12 19:00 000053.vlog
-rw------- 1 donohoe donohoe 273M Mar 12 19:05 000054.vlog
-rw------- 1 donohoe donohoe 280M Mar 12 19:11 000055.vlog
-rw------- 1 donohoe donohoe 275M Mar 12 19:18 000056.vlog
-rw------- 1 donohoe donohoe 110M Mar 12 19:21 000057.vlog

Also notice file 0000018.vlog is the least recently modified ā€¦ and it is the one where the bug happened. This seems like a clue.

Iā€™d like to dig in more, but I have to do some day job stuff :slight_smile:

dougdonohoe commented :

Note that both occurrence of this bug happened well after startup. To summarize previous info:

  • Using WithSyncWrites(false)
  • Using badger.WriteBatch with a batch size of 1000

Iā€™ve been looking through the code a bit more, and it must be failing in the toDisk() anonymous function. In write:

			newid := vlog.maxFid + 1
			y.AssertTruef(newid > 0, "newid has overflown uint32: %v", newid)
			newlf, err := vlog.createVlogFile(newid)

AFAIK, aside from open, the only place that vlog.maxFid is changed is when vlog.createVlogFile is called. The only place that is called is in write. So the only way youā€™d get the error Iā€™m seeing if if a multiple toDisk() are running concurrently, right?

dougdonohoe commented :

FYI: Iā€™ve moved on to other things and have put on hold our upgrade to 2.0. This intermittent issue where the vlog files already exist is a huge blocker for us. Iā€™ve seen it happen twice, but it takes a non-deterministic amount of time for it to occur. In 1.6, I never had any issues running long running imports before (some over 24 hours long).

jarifibrahim commented :

Hey @dougdonohoe, do you some specific time/pattern between the missing vlog file issues? I tried doing the same thing using the badger benchmark write -m 1000 command which inserts 1000 million keys in badger and didnā€™t see any missing file/file already exists issue.

If thereā€™s any script/test program that you can share which can help me reproduce the issue it would be great. I havenā€™t been able to reproduce the failure and thatā€™s why we havenā€™t fixed it yet.

As for the vlog.maxFid code here

It is a sequential call and thatā€™s why it should be okay to not have any locks there. Earlier we were using atomics but they werenā€™t really needed.

So the only way youā€™d get the error Iā€™m seeing if if a multiple toDisk() are running concurrently, right?

Writes to the vlog file should always be serial. We cannot write concurrently to the vlog file which is the write-ahead log file.

Issue 1187 looks like the code that changed this functionality. Linking here just for future reference.

The file already exists issue mentioned in Performance regression 1.6 to 2.0.2 - #20 by diggy was fixed by Remove vlog file if bootstrap, syncDir or mmap fails by jarifibrahim Ā· Pull Request #1434 Ā· dgraph-io/badger Ā· GitHub (itā€™s not yet released).

I re-ran my entire application in v2.0.3 and things look great - it is even a bit faster that v1.6.1 in places. You can close this ticket.

3 Likes