Badger panics after db reaches 1.1T

rahst12 · November 9, 2022, 10:45pm

Hey Damon,
Let me expand a little bit on our environment and correct my single predicate note from above…

Environment
We have 4 Groups, with each Group hosting 3 instances. We hit the max levels at 1.1TB due to the sum of 12 predicate/tables in Group 3 exceeding the limit (not a single predicate hitting the limit). These Groups are hosted on servers with SSDs and local networking to each other. They run in a docker swarm.

A look at our Groups today:

Tablet #	Group #1	Group #2	Group #3	Group #4
1	212.4	216.8	141.9	273.5
2	205.4	107.9	140.5	189.0
3	53.7	26.0	134.0	112.5
4	0.0	22.6	59.3	103.3
5	0.0	13.7	58.4	7.0
6	0.0	3.9	53.4	0
7	0.0	-	50.8	0
8	-	-	49.7	-
9	-	-	48.6	-
10	-	-	48.0	-
11	-	-	24.0	-
12	-	-	2.0	-
Total	471.5GB	390.9GB	796.4GB	685.3GB

LSM Compaction Levels
Our understanding is that WithMaxLevels defines the max number of compaction levels in the LSM (log-structured merge tree), defaulted to 7 levels with a 1.1TB limit.

Badger Compaction Levels, for reference:

Level	Size	Where the Compaction is Performed
1	10MiB	In Memory
2	100MiB	On Disk
3	1GiB	On Disk
4	10GiB	On Disk
5	100GiB	On Disk
6	1TiB	On Disk
7	10TiB	On Disk (Default)
8	100TiB	On Disk

I’m curious why when we set --badger=maxlevels=8, that it actually opens up the space to 10TiB, aka level 7. And subsequently to get to 1TiB, the maxlevels needs to be 7, not 6. Is this table off (it’s just our notes)? Is there a better description someplace for the levels?

Description of problem and resolution
We hit 1.1TB in in Group #3, Instance 1. Instance 1 could no longer compact, ingest data, or serve requests; it died. Instance 2 in Group #3 was auto-promoted, same result. Instance 3 in Group #3 remained up, but since there wasn’t a quorum for the RAFT vote, it never started doing anything. This left our entire cluster in a bad state. Users were unable to perform queries and systems were unable to write to the database - not just to Group #3, but the entire database.

I want to underscore the magnitude of this issue - once the limit is reached and the Group cannot reach a quorum any longer, Dgraph is unusable until that Group is recovered.

After we applied --badger=maxlevels=8 to enable the next compaction level, dgraph/badger actually freed-up more than 200GB due to the ability to perform the compaction. Maybe the default should be 8?

Below is a comparison of Group #3 on Oct 31st (when we had the issue) vs today. We haven’t deleted any data, only additions have occurred. You can observe the positive effect compaction had after setting the maxlevel to 8, it amounted to a 200GB savings. What’s the overhead of running compaction?

Tablet #	Before (Oct 31)	→	After (Nov 9)
1	327.3	→	134.0
2	206.0	→	126.3
3	129.0	→	141.9
4	53.9	→	59.3
5	52.9	→	58.4
6	48.4	→	53.4
7	46.0	→	50.8
8	45.2	→	49.7
9	44.1	→	48.6
10	43.7	→	48.0
11	21.8	→	24.0
12	2.0	→	2.0
Total	1020.3GB	→	796.4GB

A couple recommendations:

Move to a read-only state where users could query data, but no new data could be inserted (possibly through user-defined watermarks).
Automatically split predicates/tablets across Groups. (REF: Splitting predicates into multiple groups)
2.a Tablets located within Groups can move from one Group to another Group, but the Keys/Values within the Tablet cannot be split/shared across Groups - which ultimately necessitates vertical scaling or clever predicate design until this is resolved.
Improve rebalancing tablets so it doesn’t take as long and it doesn’t timeout (REF: After bulk load, dgraph times out during rebalance)
3.a Moving tablets/predicates to different groups takes a large amount of time and resources. When we rebalance, we often see dgraph service degradation followed by a timeout. After the timeout, the tablet hasn’t moved so the service degradation was for nothing and successive tries must be attempted.
3.b We’ve seen this often enough, we’ve tuned the auto-rebalance to happen every 90 days. We’d ideally like to see user-configurable throttling and timeout implemented, as well as a retry that picks up from where dgraph left-off if the timeout does occur.

We’ve really enjoyed using DGraph. We’re looking forward to future updates/releases. Most definitely glad to see the momentum is picking up again!

Best,
Ryan

Topic		Replies	Views
Panic: Base level can't be zero Dgraph kind:bug	6	1052	October 18, 2021
Panics when opening a corrupted Badger DB Badger	9	2004	November 15, 2024
ValueThreshhold set to 0 or 1 results in badger abandoning crash resiliency Badger kind:question , kind:bug	0	603	August 5, 2022
Unable to read some data Badger kind:bug	3	821	November 4, 2020
Lost power killed db v3.2103.0 Badger	3	968	April 7, 2022

Badger panics after db reaches 1.1T

Related topics