I’m working (along with @Ahsan and @Naman) on testing badger on 1 TB of data and this post lists the outcome of the tests. (This is a work in progress document. More details will be added as we do the tests)
Test 1
Running badger benchmark write
tool to generate the data on commit add more flags to write benchmark (#1423) · dgraph-io/badger@5566371 · GitHub
We ran benchmark write with the following options
go run main.go benchmark write \
--compression \ # Use ZSTD Compression
--keep-bidx \ # Keep table index in cache
--keep-blocks \ # Keep block in cache
-m 1000 \ # Insert 1000 million key value pairs
--load-blooms=false \ # Do not load bloom filters on startup
--logs \ # Show logs
--dir ./data
The test is also using a 1 GB cache for blocks and table index. It also has a 512 mb cache for bloom filters.
Notes/Observations
- Closing DB took 8+ hours because of compaction. There are a huge compaction when closing the DB because badger compactions couldn’t keep up with the incoming writes.
badger 2020/07/14 21:06:16 INFO: LOG Compact 0->1, del 4961 tables, add 695 tables, took 8h9m12.175260469s
badger 2020/07/14 21:06:16 INFO: Compaction for level: 0 DONE
badger 2020/07/14 21:06:16 INFO: Force compaction on level 0 done
2020/07/14 21:06:16 DB.Close. Error: <nil>. Time taken to close: 8h37m23.600671481s
- Huge compactions. The changes to stalling on level 0 lead to too many tables getting accumulated on level 0 which causes huge compactions. This is a known issue Slow compaction progress and high memory usage · Issue #1370 · dgraph-io/badger · GitHub
log.txt:badger 2020/07/14 11:09:16 INFO: LOG Compact 0->1, del 5 tables, add 1 tables, took 1.556458854s
log.txt:badger 2020/07/14 11:09:20 INFO: LOG Compact 0->1, del 6 tables, add 1 tables, took 3.333484517s
log.txt:badger 2020/07/14 11:09:25 INFO: LOG Compact 0->1, del 7 tables, add 2 tables, took 5.162959096s
log.txt:badger 2020/07/14 11:09:34 INFO: LOG Compact 0->1, del 12 tables, add 3 tables, took 8.813571395s
log.txt:badger 2020/07/14 11:09:49 INFO: LOG Compact 0->1, del 20 tables, add 4 tables, took 14.770120697s
log.txt:badger 2020/07/14 11:10:15 INFO: LOG Compact 0->1, del 34 tables, add 6 tables, took 26.302152394s
log.txt:badger 2020/07/14 11:10:59 INFO: LOG Compact 0->1, del 53 tables, add 10 tables, took 44.057044467s
log.txt:badger 2020/07/14 11:12:18 INFO: LOG Compact 0->1, del 96 tables, add 17 tables, took 1m18.654535709s
log.txt:badger 2020/07/14 11:14:39 INFO: LOG Compact 0->1, del 168 tables, add 30 tables, took 2m21.568908177s
log.txt:badger 2020/07/14 11:19:03 INFO: LOG Compact 0->1, del 295 tables, add 51 tables, took 4m23.486988315s
log.txt:badger 2020/07/14 11:27:45 INFO: LOG Compact 0->1, del 536 tables, add 91 tables, took 8m41.791298658s
log.txt:badger 2020/07/14 11:44:51 INFO: LOG Compact 0->1, del 1033 tables, add 168 tables, took 17m5.981830833s
log.txt:badger 2020/07/14 12:57:04 INFO: LOG Compact 0->1, del 2021 tables, add 319 tables, took 1h12m13.305061489s
The fix for this issue involves reverting Improve write stalling on level 0 and 1 · dgraph-io/badger@3747be5 · GitHub . The aforementioned commit increases the speed of writes but takes a toll on the size of compaction. Bigger compaction means higher memory usage.
Also note, we’ve reverted the background compression because of which building table takes longer which means compaction takes longer.
Test 2
We ran badger benchmark write
tool on this commit.
It was initially run with the following options
go run main.go benchmark write \
-m 5000 \ # Insert 5000 million key value pairs
--dir ~/data \ # data directory
--gc-every 100h \ # Effectively without GC
-l # Show logs
Notes/Observations
- We ran out of memory after writing ~410 GB (valid data) in ~4 hours because of memtables and long running iterators which were not being closed (Memtable memory usage causes OOM). This issue is fixed by this PR.
- We found that the
vlogMaxEntries
was set with default value of 10000 in the benchmark tool. This limited the size of each vlog file to be very low despite each vlog file being memory mapped to 2GB. This value was changed to 1000000 in this commit. Other than this,
Test 3 (after the fixes)
After the above mentioned fixes we ran it again on this commit with the following options.
go run main.go benchmark write \
--keep-bidx \ # Keep table index in cache
--max-cache=1073741824 \ # Maximum cache of 1 GB
--max-bf-cache=536870912 \ # 500MB of Bloom Filter cache
-m 3000 \ # Insert 3000 million key value pairs
--gc-every=5s \ # Garbage Collection
-l \ # Show logs
--dir ./data
This test was run on a gcloud machine with 32GB
memory and 1.5TB
HDD (similar observation in SSD machine as well).
Notes/Observations
- The logs show that it wrote
480 GB
(~3 Billion entries) in 5.5 hours with 31 successful vlogGC cycles.
badger 2020/08/14 03:07:15 INFO: Got compaction priority: {level:1 score:254.09063628688455 dropPrefixes:[]}
Time elapsed: 05h26m42s, bytes written: 480 GB, speed: 24 MB/sec, entries written: 2999900268, speed: 153040/sec, gcSuccess: 31
and even after 30 hours from then the DB didn’t close.
badger 2020/08/15 09:12:11 INFO: Got compaction priority: {level:1 score:403.8460370860994 dropPrefixes:[]}
The disk usage is 1.2TB
used of 1.5TB
and the disk activity as shown by atop
is as follows:
│ DSK | sda | busy 100% | read 51598e3 | write 8834e4 | avio 4.89 ms |
- The slow writes are a result of two things
- High disk IO because of the periodic GC and heavy writes. The GCP machines we’re using have limited IOPS. It took more than 15 minutes to write 4 MB of data. We hit the throttle limit of 15000 IOPS in Google Cloud.
- Huge compactions - We saw this in the first test as well. We disabled the L0/L1 stalling via Do not stall Level 0 and 1 by jarifibrahim · Pull Request #1186 · dgraph-io/badger · GitHub and this causes hugs compactions while closing the DB. This should be fixed by the compactions changes in the Improvements: Manual Memory allocation via Calloc by jarifibrahim · Pull Request #1459 · dgraph-io/badger · GitHub PR.
Fixes/Changes
- The writes are now stalled for level 0 if compaction cannot keep the LSM structure healthy. This makes sure that the size of levels are in control and we do not end up doing a huge compaction while closing the DB. Also, now a separate thread is used for the compaction of L0-L1, and a separate thread for the rest of the levels. This fix is done by this commit.
- Introduced manual memory allocation for the block buffers via calloc instead of using the Go runtime for memory allocation. This change is brought in by this commit.
- After the change mentioned in point 2, we found some memory leaks which roots out from Ristretto. There was no callback function in ristretto for a Set rejection by the policy and also for an entry update. This caused some memory leaks in badger. This is now fixed and the changes are brought into badger by this commit.
Test 4 (after the fixes)
After the above mentioned changes and fixes, a test was run on this commit.
With compression
The test was run with the following parameters:
go run main.go benchmark write \
--compression \ # with compression
--dir=../../data \ # data directory
--keep-blocks=true \ # blocks in cache
--keep-bidx=true \ # block indices in cache
--max-cache=1000 \ # Block cache of 1GB
--max-bf-cache=500 \ # Bloom fiter cache of 500MB
--gc-every=5m # Garbage collection
-m=1000 \ # 1000 million entries
-l #logs
The test was run on a directory containing ~600GB of data (written using running stream writer and the same test as above)
badger 2020/08/21 16:10:41 INFO: All 18225 tables opened in 3m10.095s
...
Time elapsed: 11h15m35s, bytes written: 160 GB, speed: 3.9 MB/sec, entries written: 1000000000, speed: 24670/sec, gcSuccess: 0
...
2020/08/22 03:31:23 DB.Close. Error: <nil>. Time taken to close: 4m49.848771197s
It wrote 1000 million random entries worth 160GB, this process took 11h15m and the DB closed in 4m49s. The heap profile after 2.5hours of run from the start is as below:
Showing nodes accounting for 11553.50MB, 99.35% of 11629.62MB total
Dropped 83 nodes (cum <= 58.15MB)
Showing top 10 nodes out of 46
flat flat% sum% cum cum%
9551.74MB 82.13% 82.13% 9551.74MB 82.13% github.com/dgraph-io/ristretto/z.(*Bloom).Size
632.54MB 5.44% 87.57% 1025.56MB 8.82% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
582.42MB 5.01% 92.58% 582.42MB 5.01% github.com/dgraph-io/badger/v2/skl.newArena
393.02MB 3.38% 95.96% 393.02MB 3.38% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
384MB 3.30% 99.26% 384MB 3.30% github.com/dgraph-io/ristretto.newCmRow
7.50MB 0.065% 99.33% 9369.25MB 80.56% github.com/dgraph-io/badger/v2/table.OpenTable
1.27MB 0.011% 99.34% 577.27MB 4.96% github.com/dgraph-io/ristretto.NewCache
0.50MB 0.0043% 99.34% 1026.87MB 8.83% github.com/dgraph-io/badger/v2/table.(*Table).block
0.50MB 0.0043% 99.35% 9552.24MB 82.14% github.com/dgraph-io/ristretto/z.NewBloomFilter
0 0% 99.35% 582.42MB 5.01% github.com/dgraph-io/badger/v2.(*DB).doWrites.func1
We repeated the above test multiple times and used the stream writer as well to get a total of 844GB on-disk data. The stream writer took 29m24s to write 5000 million entries worth 680GB, as mentioned below.
Time elapsed: 29m24s, bytes written: 680 GB, speed: 385 MB/sec, entries written: 4998907417, speed: 2833847/sec, gcSuccess: 0
...
...
2020/08/19 13:28:11 DB.Close. Error: <nil>. Time taken to close: 338.671091ms
There are 21040
tables and the levels after this process have the following sizes:
Level 0 size: 0 B
Level 1 size: 1.1 GB
Level 2 size: 28 GB
Level 3 size: 256 GB
Level 4 size: 36 GB
Total index size: 322 GB
Value log size: 584 GB
Without compression
The test was run with the following parameters:
go run main.go benchmark write \
--dir=../../data \ # data directory
--keep-blocks=true \ # blocks in cache
--keep-bidx=true \ # block indices in cache
--max-cache=1000 \ # Block cache of 1GB
--max-bf-cache=500 \ # Bloom fiter cache of 500MB
--gc-every=5m # Garbage collection
-m=2000 \ # 2000 million entries
-s \ # Stream Writer (sorted entries)
-l \ # enable logs
Time elapsed: 11m52s, bytes written: 272 GB, speed: 382 MB/sec, entries written: 2000000000, speed: 2808988/sec, gcSuccess: 0
...
...
2020/08/24 12:44:15 DB.Close. Error: <nil>. Time taken to close: 1.69936773s
It took 11m52s to write 2000 million entries worth 272GB using the stream writer. The DB closed in 1.69s. After this a test was run using random writes with the following parameters:
go run main.go benchmark write \
--dir=../../data \ # data directory
--keep-blocks=true \ # blocks in cache
--keep-bidx=true \ # block indices in cache
--max-cache=1000 \ # Block cache of 1GB
--max-bf-cache=500 \ # Bloom fiter cache of 500MB
--gc-every=5m # Garbage collection
-m=2500 \ # 2500 million entries
-l \ # enable logs
The last 100 lines of log is as below:
Time elapsed: 18h13m21s, bytes written: 400 GB, speed: 6.1 MB/sec, entries written: 2499953900, speed: 38108/sec, gcSuccess: 0
...
...
2020/08/26 03:14:16 DB.Close. Error: <nil>. Time taken to close: 30.633640208s
It can be noticed that it took 18h13m to write 2500 million entries worth 400GB. The time taken to close the DB is 30s. It is also observed that there were 0 successful GC during the run. The heap profile after 17h30m is as below:
Showing nodes accounting for 5848.40MB, 99.09% of 5902.13MB total
Dropped 76 nodes (cum <= 29.51MB)
Showing top 10 nodes out of 55
flat flat% sum% cum cum%
3849.87MB 65.23% 65.23% 3849.87MB 65.23% github.com/dgraph-io/ristretto/z.(*Bloom).Size
614.59MB 10.41% 75.64% 1028.61MB 17.43% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
582.42MB 9.87% 85.51% 582.42MB 9.87% github.com/dgraph-io/badger/v2/skl.newArena
414.02MB 7.01% 92.52% 414.02MB 7.01% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
384MB 6.51% 99.03% 384MB 6.51% github.com/dgraph-io/ristretto.newCmRow
1.50MB 0.025% 99.06% 3324.99MB 56.34% github.com/dgraph-io/badger/v2/table.OpenTable
1.50MB 0.025% 99.08% 3851.37MB 65.25% github.com/dgraph-io/ristretto/z.NewBloomFilter
0.50MB 0.0085% 99.09% 1028.03MB 17.42% github.com/dgraph-io/badger/v2/table.(*Table).block
0 0% 99.09% 349.86MB 5.93% github.com/dgraph-io/badger/v2.(*DB).RunValueLogGC
0 0% 99.09% 583.88MB 9.89% github.com/dgraph-io/badger/v2.(*DB).doWrites.func1
The levels structure is as below:
[Summary]
Level 0 size: 0 B
Level 1 size: 2.2 GB
Level 2 size: 90 GB
Level 3 size: 361 GB
Total index size: 452 GB
Value log size: 445 GB