Badger write performance on 1 TB data

I’m working (along with @Ahsan and @Naman) on testing badger on 1 TB of data and this post lists the outcome of the tests. (This is a work in progress document. More details will be added as we do the tests)

Test 1

Running badger benchmark write tool to generate the data on commit https://github.com/dgraph-io/badger/commit/55663718db531c4fc5885f8b99dea8fcb7fc729c
We ran benchmark write with the following options

go run main.go benchmark write \
    --compression \           # Use ZSTD Compression
    --keep-bidx \             #  Keep table index in cache
    --keep-blocks \           # Keep block in cache
    -m 1000 \                 # Insert 1000 million key value pairs
    --load-blooms=false \     # Do not load bloom filters on startup
    --logs \                  # Show logs
    --dir ./data

The test is also using a 1 GB cache for blocks and table index. It also has a 512 mb cache for bloom filters.

Notes/Observations

  1. Closing DB took 8+ hours because of compaction. There are a huge compaction when closing the DB because badger compactions couldn’t keep up with the incoming writes.
badger 2020/07/14 21:06:16 INFO: LOG Compact 0->1, del 4961 tables, add 695 tables, took 8h9m12.175260469s
badger 2020/07/14 21:06:16 INFO: Compaction for level: 0 DONE
badger 2020/07/14 21:06:16 INFO: Force compaction on level 0 done
2020/07/14 21:06:16 DB.Close. Error: <nil>. Time taken to close: 8h37m23.600671481s
  1. Huge compactions. The changes to stalling on level 0 lead to too many tables getting accumulated on level 0 which causes huge compactions. This is a known issue https://github.com/dgraph-io/badger/issues/1370
log.txt:badger 2020/07/14 11:09:16 INFO: LOG Compact 0->1, del 5 tables, add 1 tables, took 1.556458854s
log.txt:badger 2020/07/14 11:09:20 INFO: LOG Compact 0->1, del 6 tables, add 1 tables, took 3.333484517s
log.txt:badger 2020/07/14 11:09:25 INFO: LOG Compact 0->1, del 7 tables, add 2 tables, took 5.162959096s
log.txt:badger 2020/07/14 11:09:34 INFO: LOG Compact 0->1, del 12 tables, add 3 tables, took 8.813571395s
log.txt:badger 2020/07/14 11:09:49 INFO: LOG Compact 0->1, del 20 tables, add 4 tables, took 14.770120697s
log.txt:badger 2020/07/14 11:10:15 INFO: LOG Compact 0->1, del 34 tables, add 6 tables, took 26.302152394s
log.txt:badger 2020/07/14 11:10:59 INFO: LOG Compact 0->1, del 53 tables, add 10 tables, took 44.057044467s
log.txt:badger 2020/07/14 11:12:18 INFO: LOG Compact 0->1, del 96 tables, add 17 tables, took 1m18.654535709s
log.txt:badger 2020/07/14 11:14:39 INFO: LOG Compact 0->1, del 168 tables, add 30 tables, took 2m21.568908177s
log.txt:badger 2020/07/14 11:19:03 INFO: LOG Compact 0->1, del 295 tables, add 51 tables, took 4m23.486988315s
log.txt:badger 2020/07/14 11:27:45 INFO: LOG Compact 0->1, del 536 tables, add 91 tables, took 8m41.791298658s
log.txt:badger 2020/07/14 11:44:51 INFO: LOG Compact 0->1, del 1033 tables, add 168 tables, took 17m5.981830833s
log.txt:badger 2020/07/14 12:57:04 INFO: LOG Compact 0->1, del 2021 tables, add 319 tables, took 1h12m13.305061489s

The fix for this issue involves reverting https://github.com/dgraph-io/badger/commit/3747be59ebbbf34382ab8d84207151c7f7014da2 . The aforementioned commit increases the speed of writes but takes a toll on the size of compaction. Bigger compaction means higher memory usage.
Also note, we’ve reverted the background compression because of which building table takes longer which means compaction takes longer.

Test 2

We ran badger benchmark write tool on this commit.

It was initially run with the following options

go run main.go benchmark write \
    -m 5000 \            # Insert 5000 million key value pairs
    --dir ~/data \       # data directory
    --gc-every 100h \    # Effectively without GC
    -l                   # Show logs

Notes/Observations

  1. We ran out of memory after writing ~410 GB (valid data) in ~4 hours because of memtables and long running iterators which were not being closed (Memtable memory usage causes OOM). This issue is fixed by this PR.
  2. We found that the vlogMaxEntries was set with default value of 10000 in the benchmark tool. This limited the size of each vlog file to be very low despite each vlog file being memory mapped to 2GB. This value was changed to 1000000 in this commit. Other than this,

Test 3 (after the fixes)

After the above mentioned fixes we ran it again on this commit with the following options.

go run main.go benchmark write \
    --keep-bidx \                # Keep table index in cache
    --max-cache=1073741824 \     # Maximum cache of 1 GB
    --max-bf-cache=536870912 \   # 500MB of Bloom Filter cache
    -m 3000 \                    # Insert 3000 million key value pairs
    --gc-every=5s \              # Garbage Collection 
    -l \                         # Show logs
    --dir ./data

This test was run on a gcloud machine with 32GB memory and 1.5TB HDD (similar observation in SSD machine as well).

Notes/Observations

  1. The logs show that it wrote 480 GB (~3 Billion entries) in 5.5 hours with 31 successful vlogGC cycles.
badger 2020/08/14 03:07:15 INFO: Got compaction priority: {level:1 score:254.09063628688455 dropPrefixes:[]}
Time elapsed: 05h26m42s, bytes written: 480 GB, speed: 24 MB/sec, entries written: 2999900268, speed: 153040/sec, gcSuccess: 31

and even after 30 hours from then the DB didn’t close.

badger 2020/08/15 09:12:11 INFO: Got compaction priority: {level:1 score:403.8460370860994 dropPrefixes:[]}

The disk usage is 1.2TB used of 1.5TB and the disk activity as shown by atop is as follows:

│ DSK |          sda | busy    100% | read 51598e3 | write 8834e4 | avio 4.89 ms |
  1. The slow writes are a result of two things
    • High disk IO because of the periodic GC and heavy writes. The GCP machines we’re using have limited IOPS. It took more than 15 minutes to write 4 MB of data. We hit the throttle limit of 15000 IOPS in Google Cloud.
    • Huge compactions - We saw this in the first test as well. We disabled the L0/L1 stalling via https://github.com/dgraph-io/badger/pull/1186 and this causes hugs compactions while closing the DB. This should be fixed by the compactions changes in the https://github.com/dgraph-io/badger/pull/1459 PR.

Fixes/Changes

  1. The writes are now stalled for level 0 if compaction cannot keep the LSM structure healthy. This makes sure that the size of levels are in control and we do not end up doing a huge compaction while closing the DB. Also, now a separate thread is used for the compaction of L0-L1, and a separate thread for the rest of the levels. This fix is done by this commit.
  2. Introduced manual memory allocation for the block buffers via calloc instead of using the Go runtime for memory allocation. This change is brought in by this commit.
  3. After the change mentioned in point 2, we found some memory leaks which roots out from Ristretto. There was no callback function in ristretto for a Set rejection by the policy and also for an entry update. This caused some memory leaks in badger. This is now fixed and the changes are brought into badger by this commit.

Test 4 (after the fixes)

After the above mentioned changes and fixes, a test was run on this commit.

With compression

The test was run with the following parameters:

go run main.go benchmark write \ 
  --compression \       # with compression
  --dir=../../data \    # data directory 
  --keep-blocks=true \  # blocks in cache
  --keep-bidx=true \    # block indices in cache
  --max-cache=1000 \    # Block cache of 1GB
  --max-bf-cache=500 \  # Bloom fiter cache of 500MB
  --gc-every=5m         # Garbage collection
  -m=1000 \             # 1000 million entries
  -l                    #logs

The test was run on a directory containing ~600GB of data (written using running stream writer and the same test as above)

badger 2020/08/21 16:10:41 INFO: All 18225 tables opened in 3m10.095s
...
Time elapsed: 11h15m35s, bytes written: 160 GB, speed: 3.9 MB/sec, entries written: 1000000000, speed: 24670/sec, gcSuccess: 0
...
2020/08/22 03:31:23 DB.Close. Error: <nil>. Time taken to close: 4m49.848771197s

It wrote 1000 million random entries worth 160GB, this process took 11h15m and the DB closed in 4m49s. The heap profile after 2.5hours of run from the start is as below:

Showing nodes accounting for 11553.50MB, 99.35% of 11629.62MB total
Dropped 83 nodes (cum <= 58.15MB)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
 9551.74MB 82.13% 82.13%  9551.74MB 82.13%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  632.54MB  5.44% 87.57%  1025.56MB  8.82%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  582.42MB  5.01% 92.58%   582.42MB  5.01%  github.com/dgraph-io/badger/v2/skl.newArena
  393.02MB  3.38% 95.96%   393.02MB  3.38%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
     384MB  3.30% 99.26%      384MB  3.30%  github.com/dgraph-io/ristretto.newCmRow
    7.50MB 0.065% 99.33%  9369.25MB 80.56%  github.com/dgraph-io/badger/v2/table.OpenTable
    1.27MB 0.011% 99.34%   577.27MB  4.96%  github.com/dgraph-io/ristretto.NewCache
    0.50MB 0.0043% 99.34%  1026.87MB  8.83%  github.com/dgraph-io/badger/v2/table.(*Table).block
    0.50MB 0.0043% 99.35%  9552.24MB 82.14%  github.com/dgraph-io/ristretto/z.NewBloomFilter
         0     0% 99.35%   582.42MB  5.01%  github.com/dgraph-io/badger/v2.(*DB).doWrites.func1

We repeated the above test multiple times and used the stream writer as well to get a total of 844GB on-disk data. The stream writer took 29m24s to write 5000 million entries worth 680GB, as mentioned below.

Time elapsed: 29m24s, bytes written: 680 GB, speed: 385 MB/sec, entries written: 4998907417, speed: 2833847/sec, gcSuccess: 0
...
...
2020/08/19 13:28:11 DB.Close. Error: <nil>. Time taken to close: 338.671091ms

There are 21040 tables and the levels after this process have the following sizes:

Level 0 size:          0 B
Level 1 size:       1.1 GB
Level 2 size:        28 GB
Level 3 size:       256 GB
Level 4 size:        36 GB
Total index size:   322 GB
Value log size:     584 GB

Without compression

The test was run with the following parameters:

go run main.go benchmark write \
  --dir=../../data \    # data directory 
  --keep-blocks=true \  # blocks in cache
  --keep-bidx=true \    # block indices in cache
  --max-cache=1000 \    # Block cache of 1GB
  --max-bf-cache=500 \  # Bloom fiter cache of 500MB
  --gc-every=5m         # Garbage collection
  -m=2000 \             # 2000 million entries
  -s \                  # Stream Writer (sorted entries)
  -l \                  # enable logs
 
Time elapsed: 11m52s, bytes written: 272 GB, speed: 382 MB/sec, entries written: 2000000000, speed: 2808988/sec, gcSuccess: 0
...
...
2020/08/24 12:44:15 DB.Close. Error: <nil>. Time taken to close: 1.69936773s

It took 11m52s to write 2000 million entries worth 272GB using the stream writer. The DB closed in 1.69s. After this a test was run using random writes with the following parameters:

go run main.go benchmark write \
  --dir=../../data \    # data directory 
  --keep-blocks=true \  # blocks in cache
  --keep-bidx=true \    # block indices in cache
  --max-cache=1000 \    # Block cache of 1GB
  --max-bf-cache=500 \  # Bloom fiter cache of 500MB
  --gc-every=5m         # Garbage collection
  -m=2500 \             # 2500 million entries
  -l \                  # enable logs
 

The last 100 lines of log is as below:

Time elapsed: 18h13m21s, bytes written: 400 GB, speed: 6.1 MB/sec, entries written: 2499953900, speed: 38108/sec, gcSuccess: 0
...
...
2020/08/26 03:14:16 DB.Close. Error: <nil>. Time taken to close: 30.633640208s

It can be noticed that it took 18h13m to write 2500 million entries worth 400GB. The time taken to close the DB is 30s. It is also observed that there were 0 successful GC during the run. The heap profile after 17h30m is as below:

Showing nodes accounting for 5848.40MB, 99.09% of 5902.13MB total
Dropped 76 nodes (cum <= 29.51MB)
Showing top 10 nodes out of 55
      flat  flat%   sum%        cum   cum%
 3849.87MB 65.23% 65.23%  3849.87MB 65.23%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
  614.59MB 10.41% 75.64%  1028.61MB 17.43%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  582.42MB  9.87% 85.51%   582.42MB  9.87%  github.com/dgraph-io/badger/v2/skl.newArena
  414.02MB  7.01% 92.52%   414.02MB  7.01%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
     384MB  6.51% 99.03%      384MB  6.51%  github.com/dgraph-io/ristretto.newCmRow
    1.50MB 0.025% 99.06%  3324.99MB 56.34%  github.com/dgraph-io/badger/v2/table.OpenTable
    1.50MB 0.025% 99.08%  3851.37MB 65.25%  github.com/dgraph-io/ristretto/z.NewBloomFilter
    0.50MB 0.0085% 99.09%  1028.03MB 17.42%  github.com/dgraph-io/badger/v2/table.(*Table).block
         0     0% 99.09%   349.86MB  5.93%  github.com/dgraph-io/badger/v2.(*DB).RunValueLogGC
         0     0% 99.09%   583.88MB  9.89%  github.com/dgraph-io/badger/v2.(*DB).doWrites.func1

The levels structure is as below:

[Summary]
Level 0 size:          0 B
Level 1 size:       2.2 GB
Level 2 size:        90 GB
Level 3 size:       361 GB
Total index size:   452 GB
Value log size:     445 GB
2 Likes

Regarding compaction, I have rewritten the compaction worker. The current “fight for the death” approach is highly inefficient in write-heavy scenarios.

6 Likes

hope badger can improve performance on huge dataset :grinning: