Badger panics after db reaches 1.1T

panic: Base level can’t be zero.

DB size:

$ du -h /data/badger/
1.1T	/data/badger/

Badger settings:

badgerOptions := badger.DefaultOptions("/data/badger")
badgerDB, err := badger.Open(badgerOptions)

Badger version: v3.2103.2

Stack trace:

[2021-11-18 09:00:24.588 UTC] info (logutil/zap_raft.go:77) 21650 tables out of 24770 opened in 3s
[2021-11-18 09:00:25.000 UTC] info (logutil/zap_raft.go:77) All 24770 tables opened in 3.412s
[2021-11-18 09:00:25.015 UTC] info (logutil/zap_raft.go:77) Discard stats nextEmptySlot: 0
[2021-11-18 09:00:25.037 UTC] info (logutil/zap_raft.go:77) Set nextTxnTs to 3887394350 
[2021-11-18 09:00:25.092 UTC] info (logutil/zap_raft.go:77) Deleting empty file: /data/badger/000696.vlog
panic: Base level can't be zero.
goroutine 24954 [running]:
github.com/dgraph-io/badger/v3.(*levelsController).fillTablesL0ToLbase(0x0, 0x0)
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:1182 +0x8f1
github.com/dgraph-io/badger/v3.(*levelsController).fillTablesL0(0xfa31c8, 0xc000034068)
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:1243 +0x25
github.com/dgraph-io/badger/v3.(*levelsController).doCompact(0xc000268000, 0x3, {0x0, 0x3ff6666666666666, 0x3ff05af864031d71, {0x0, 0x0, 0x0}, {0x0, {0xc00320a6c0, ...}, ...}})
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:1519 +0x2e5
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func2({0x0, 0x3ff6666666666666, 0x3ff05af864031d71, {0x0, 0x0, 0x0}, {0x0, {0xc00320a640, 0x7, 0x7}, ...}})
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:465 +0x78
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:488 +0x158
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc000268000, 0x3, 0xc001a90090)
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:517 +0x3a9
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/Users/cae/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:354 +0x53
1 Like

Referring:

Changing default to WithMaxLevels(8) fixes, not sure if only temporarily.

@badger
I think this a permanent fix and should allow the db to grow to ~11.1 TiB. Some feedback from the Dgraph team on this would be nice. Especially on the performance implications.
As described in the linked thread, the documentation for Badger and Dgraph should be extended to cover this behavior in more detail.

1 Like

I’ll possibly outgrow 11.1T, how far can I go on Max Levels?

I think there is no technical limit only a practical. A db with with 11 TiB is already hard to manage, I can only imagine what you would do at 100 TiB or more.

Instead of the max level you can also increase the level multiplier (defaults to 10), but I am not sure how that would impact performance.

Just ran into this… Any news on this effort?

Check out @caevv 's solution above: increase the levels to solve. For extra info also see: Panic: Base level can't be zero - #7 by vnium.

Note that for truly huge data sets, you’d move to a distributed configuration, so each node would hold a portion of the data, rather than all of it.

We resolved it with adding another level. Our install is shared across 3 servers. A single node hit the limit and killed the cluster. I suppose I’m curious if there a more elegant solution to hitting this limit?

The process we took was…

  1. Cluster crashed
  2. Google searched for error with “dgraph” in the search term. No hits.
  3. Google searched for error without “dgraph” in search term. BadgerDB error hits on this discuss forum.
  4. Address the issue.
  5. Restart the swarm.
  6. Working again.

This just doesn’t seem like a production level resolution to hitting 1.1TB on a node of a predicate.

Are there plans in place to address memory size limits or predicate sharding across nodes?

Addressing this: Splitting predicates into multiple groups - #13 by eugaia, seems like it could mitigate the issue substantially.

Thanks,
Ryan

Thanks Ryan. We just added some documentation about this based on your comments above, which is waiting in a PR for our next doc release. I appreciate your comments here which helps people find this error if they do encounter it.

For other readers I want to clarify that sharding individual predicates will only be needed if you have a single huge predicate taking up 1.1TB on disk. Dgraph already shards by moving various predicates around among node groups to keep things balanced. So you can scale vertically as above by increasing levels to handle 11.1TB per machine (though most machines don’t scale to that level) or scale horizontally by adding new alpha node groups which will result in the data being split among groups.

While most people will never see a single 1TB predicate, there is an existing roadmap item to shard individual predicates here: ticket: Single predicate sharded across groups.

Hey Damon,
Let me expand a little bit on our environment and correct my single predicate note from above…

Environment
We have 4 Groups, with each Group hosting 3 instances. We hit the max levels at 1.1TB due to the sum of 12 predicate/tables in Group 3 exceeding the limit (not a single predicate hitting the limit). These Groups are hosted on servers with SSDs and local networking to each other. They run in a docker swarm.

A look at our Groups today:

Tablet # Group #1 Group #2 Group #3 Group #4
1 212.4 216.8 141.9 273.5
2 205.4 107.9 140.5 189.0
3 53.7 26.0 134.0 112.5
4 0.0 22.6 59.3 103.3
5 0.0 13.7 58.4 7.0
6 0.0 3.9 53.4 0
7 0.0 - 50.8 0
8 - - 49.7 -
9 - - 48.6 -
10 - - 48.0 -
11 - - 24.0 -
12 - - 2.0 -
Total 471.5GB 390.9GB 796.4GB 685.3GB

LSM Compaction Levels
Our understanding is that WithMaxLevels defines the max number of compaction levels in the LSM (log-structured merge tree), defaulted to 7 levels with a 1.1TB limit.

Badger Compaction Levels, for reference:

Level Size Where the Compaction is Performed
1 10MiB In Memory
2 100MiB On Disk
3 1GiB On Disk
4 10GiB On Disk
5 100GiB On Disk
6 1TiB On Disk
7 10TiB On Disk (Default)
8 100TiB On Disk

I’m curious why when we set --badger=maxlevels=8, that it actually opens up the space to 10TiB, aka level 7. And subsequently to get to 1TiB, the maxlevels needs to be 7, not 6. Is this table off (it’s just our notes)? Is there a better description someplace for the levels?

Description of problem and resolution
We hit 1.1TB in in Group #3, Instance 1. Instance 1 could no longer compact, ingest data, or serve requests; it died. Instance 2 in Group #3 was auto-promoted, same result. Instance 3 in Group #3 remained up, but since there wasn’t a quorum for the RAFT vote, it never started doing anything. This left our entire cluster in a bad state. Users were unable to perform queries and systems were unable to write to the database - not just to Group #3, but the entire database.

I want to underscore the magnitude of this issue - once the limit is reached and the Group cannot reach a quorum any longer, Dgraph is unusable until that Group is recovered.

After we applied --badger=maxlevels=8 to enable the next compaction level, dgraph/badger actually freed-up more than 200GB due to the ability to perform the compaction. Maybe the default should be 8?

Below is a comparison of Group #3 on Oct 31st (when we had the issue) vs today. We haven’t deleted any data, only additions have occurred. You can observe the positive effect compaction had after setting the maxlevel to 8, it amounted to a 200GB savings. What’s the overhead of running compaction?

Tablet # Before (Oct 31) After (Nov 9)
1 327.3 134.0
2 206.0 126.3
3 129.0 141.9
4 53.9 59.3
5 52.9 58.4
6 48.4 53.4
7 46.0 50.8
8 45.2 49.7
9 44.1 48.6
10 43.7 48.0
11 21.8 24.0
12 2.0 2.0
Total 1020.3GB 796.4GB

A couple recommendations:

  1. Move to a read-only state where users could query data, but no new data could be inserted (possibly through user-defined watermarks).
  2. Automatically split predicates/tablets across Groups. (REF: Splitting predicates into multiple groups)
    2.a Tablets located within Groups can move from one Group to another Group, but the Keys/Values within the Tablet cannot be split/shared across Groups - which ultimately necessitates vertical scaling or clever predicate design until this is resolved.
  3. Improve rebalancing tablets so it doesn’t take as long and it doesn’t timeout (REF: After bulk load, dgraph times out during rebalance)
    3.a Moving tablets/predicates to different groups takes a large amount of time and resources. When we rebalance, we often see dgraph service degradation followed by a timeout. After the timeout, the tablet hasn’t moved so the service degradation was for nothing and successive tries must be attempted.
    3.b We’ve seen this often enough, we’ve tuned the auto-rebalance to happen every 90 days. We’d ideally like to see user-configurable throttling and timeout implemented, as well as a retry that picks up from where dgraph left-off if the timeout does occur.

We’ve really enjoyed using DGraph. We’re looking forward to future updates/releases. Most definitely glad to see the momentum is picking up again!

Best,
Ryan

Thanks for all this! It looks to me like the levels are numbered 0 to maxLevels-1, with Level 0 consisting of the in-memory SST structures. That’s from a quick scan of some code, and could be wrong.

I agree that it will be more friendly if the server would reject writes if there is no more room in the levels but continue to serve reads, and in any case “Base level can’t be zero” is not that informative without googling and finding this or a similar online forum message. If you are willing, perhaps you can add a feature request for that in github.

Is your overall DB 1.1TB (before re-compaction) or is each group holding 1.1TB, so 4+TB across 4 groups? I would expect the latter. Or at least close to 1TB per group since balancing predicates among the groups will never be perfect.

We could alter default maxLevels but that would allow 10TB on a single machine, which probably creates a different set of problems. Perhaps changing the levelMultiplier to 11 or 12 would be better, but even that depends on the machines and use case for a database, so it may be good for people to really think about data size per machine up around 1TB each, and get an earlier or less severe warning/error.