Hey Damon,
Let me expand a little bit on our environment and correct my single predicate note from above…
Environment
We have 4 Groups, with each Group hosting 3 instances. We hit the max levels at 1.1TB due to the sum of 12 predicate/tables in Group 3 exceeding the limit (not a single predicate hitting the limit). These Groups are hosted on servers with SSDs and local networking to each other. They run in a docker swarm.
A look at our Groups today:
Tablet # | Group #1 | Group #2 | Group #3 | Group #4 |
---|---|---|---|---|
1 | 212.4 | 216.8 | 141.9 | 273.5 |
2 | 205.4 | 107.9 | 140.5 | 189.0 |
3 | 53.7 | 26.0 | 134.0 | 112.5 |
4 | 0.0 | 22.6 | 59.3 | 103.3 |
5 | 0.0 | 13.7 | 58.4 | 7.0 |
6 | 0.0 | 3.9 | 53.4 | 0 |
7 | 0.0 | - | 50.8 | 0 |
8 | - | - | 49.7 | - |
9 | - | - | 48.6 | - |
10 | - | - | 48.0 | - |
11 | - | - | 24.0 | - |
12 | - | - | 2.0 | - |
Total | 471.5GB | 390.9GB | 796.4GB | 685.3GB |
LSM Compaction Levels
Our understanding is that WithMaxLevels
defines the max number of compaction levels in the LSM (log-structured merge tree), defaulted to 7 levels with a 1.1TB limit.
Badger Compaction Levels, for reference:
Level | Size | Where the Compaction is Performed |
---|---|---|
1 | 10MiB | In Memory |
2 | 100MiB | On Disk |
3 | 1GiB | On Disk |
4 | 10GiB | On Disk |
5 | 100GiB | On Disk |
6 | 1TiB | On Disk |
7 | 10TiB | On Disk (Default) |
8 | 100TiB | On Disk |
I’m curious why when we set --badger=maxlevels=8
, that it actually opens up the space to 10TiB, aka level 7. And subsequently to get to 1TiB, the maxlevels needs to be 7, not 6. Is this table off (it’s just our notes)? Is there a better description someplace for the levels?
Description of problem and resolution
We hit 1.1TB in in Group #3, Instance 1. Instance 1 could no longer compact, ingest data, or serve requests; it died. Instance 2 in Group #3 was auto-promoted, same result. Instance 3 in Group #3 remained up, but since there wasn’t a quorum for the RAFT vote, it never started doing anything. This left our entire cluster in a bad state. Users were unable to perform queries and systems were unable to write to the database - not just to Group #3, but the entire database.
I want to underscore the magnitude of this issue - once the limit is reached and the Group cannot reach a quorum any longer, Dgraph is unusable until that Group is recovered.
After we applied --badger=maxlevels=8
to enable the next compaction level, dgraph/badger actually freed-up more than 200GB due to the ability to perform the compaction. Maybe the default should be 8?
Below is a comparison of Group #3 on Oct 31st (when we had the issue) vs today. We haven’t deleted any data, only additions have occurred. You can observe the positive effect compaction had after setting the maxlevel to 8, it amounted to a 200GB savings. What’s the overhead of running compaction?
Tablet # | Before (Oct 31) | → | After (Nov 9) |
---|---|---|---|
1 | 327.3 | → | 134.0 |
2 | 206.0 | → | 126.3 |
3 | 129.0 | → | 141.9 |
4 | 53.9 | → | 59.3 |
5 | 52.9 | → | 58.4 |
6 | 48.4 | → | 53.4 |
7 | 46.0 | → | 50.8 |
8 | 45.2 | → | 49.7 |
9 | 44.1 | → | 48.6 |
10 | 43.7 | → | 48.0 |
11 | 21.8 | → | 24.0 |
12 | 2.0 | → | 2.0 |
Total | 1020.3GB | → | 796.4GB |
A couple recommendations:
- Move to a read-only state where users could query data, but no new data could be inserted (possibly through user-defined watermarks).
- Automatically split predicates/tablets across Groups. (REF: Splitting predicates into multiple groups)
2.a Tablets located within Groups can move from one Group to another Group, but the Keys/Values within the Tablet cannot be split/shared across Groups - which ultimately necessitates vertical scaling or clever predicate design until this is resolved. - Improve rebalancing tablets so it doesn’t take as long and it doesn’t timeout (REF: After bulk load, dgraph times out during rebalance)
3.a Moving tablets/predicates to different groups takes a large amount of time and resources. When we rebalance, we often see dgraph service degradation followed by a timeout. After the timeout, the tablet hasn’t moved so the service degradation was for nothing and successive tries must be attempted.
3.b We’ve seen this often enough, we’ve tuned the auto-rebalance to happen every 90 days. We’d ideally like to see user-configurable throttling and timeout implemented, as well as a retry that picks up from where dgraph left-off if the timeout does occur.
We’ve really enjoyed using DGraph. We’re looking forward to future updates/releases. Most definitely glad to see the momentum is picking up again!
Best,
Ryan