Unable to reach leader in group 1 - dir structures help

We ran this on a bad node:

[EXTRA]
1046234.mem 128MiB
DISCARD 1.0MiB
KEYREGISTRY 28B
LOCK 3B

[Summary]
Level 0 size :    14 KiB
Level 1 size :   9.3 MiB
Level 2 size :   9.7 MiB
Level 3 size :    54 MiB
Level 4 size :   559 MiB
Level 5 size :   5.5 GiB
Level 6 size :    55 GiB
Level 7 size :   547 GiB
Total SST size:  608 GiB
Value log size:  2.0 GiB

Abnormalities:
4 extra files.
0 missing files.
0 empty files.
0 truncated manifests.
Error: failed to open database err: while opening memtables error: while opening fid: 1046234 error: while updating skiplist error: end offset: 16978 < size: 134217728 error: Log truncate required to run DB. This might result in data loss

Looks like we have to run a log truncate? Is that the stream command you mentioned?

Thanks

I backed up the files and ran:

badger flatten --dir ./p
...
panic:runtime: error: index out of range [7] with length 7

I’m guessing this is because our MaxLevels is set to 8, but I’m not sure how to override this with the badger flatten command.

I also ran:

badger stream -dir ./p ./new_p
...
Error: cannot open DB at ./p error: while opening memtables error: while opening fid: 1046240 error: while updating skiplist error: end offset 20 < 134217728 error: Log truncate required to run DB.  This might result in data loss

I suppose we need to truncate the file then?

Running:

badger info --read-only=false --truncate=true --dir ./p
...
panic: runtime error: index out of rhange [7] with length 7
...
/ext-go/1/src/github.com/dgraph-io/badger/levels.go:163 +0x805
/ext-go/1/src/github.com/dgraph-io/badger/levels.go:129 +0x645

Looks like I need to specify the maxLevels=8 for our specific scenario. I’m not sure how to do this when running badger info.

It doesn’t look like the badger info command allows for setting the maxlevels to what is configured in the badgerdb.

Testing the copy/paste method… I’d love some thoughts on if this works.

Steps:

  1. Tar up the p, w, and t directories on a healthly alpha in the group.
  2. Scp the tar to the unhealthy alpha in the group. Untar.
  3. Start up the unhealthly alpha pointed to the new p, w, and t directories.

Will this recover an alpha?

@MichelDiz, please advise… How would we prevent this from happening during a hardware power outage in the future?

I have asked other engineers to take a look at your case. Give us a day or so.

Looks like you have corrupted data.


@rebamun

To prevent data corruption during a hardware power outage, there are a few steps that can be taken. Firstly, utilizing a storage system such as ZFS can be helpful, as it has options for disk replication and the ability to create ISCSI disks(Not sure if it would work with Docker. But works for K8s). ZFS also allows for snapshots to be taken, which can aid in data recovery if corruption occurs.

Additionally, using ECC RAM in servers can greatly reduce the likelihood of data corruption due to memory errors. In the context of Dgraph, enabling the HA (High Availability) feature can also help mitigate data corruption by performing full replication of the group.

Another technique in Dgraph is to use Learner Nodes(EE), which retain data and can serve as a backup in the event of data corruption. You could use the Learner p directory as a restore point. And if you are using EE, just use the backup option.

It’s worth noting that there are several other issues that can cause data corruption, such as transient I/O errors due to a bad disk or controller, on-disk data corruption due to cosmic rays, driver bugs resulting in data being transferred to or from the wrong location, or a user accidentally overwriting portions of the physical device.

Other few different methods, including:

  1. Uninterruptible Power Supply: A UPS can provide temporary power to the hardware in case of a power outage, allowing the system to shut down properly or to switch to an alternate power source.
  2. Redundant Power Supply: RPS provides an additional power source in case one of the power sources fails. This ensures that the hardware can continue to operate without interruption.
  3. RAID: RAID or ZFS Mirror/Raid.
  4. Backups: Regularly backing up the data to an offsite location or a cloud storage service can ensure that data is not lost in case of a hardware power outage.

Cheers.

Hello @rahst12,

In addition to Michel’s suggestions, typically, we do not expect the log truncate error with badger info --dir p, unless the copied ‘p’ directory still has a lock, due to db.Close() not being called properly. In such cases, opening it with db.open() would result in the error. This is all by design, as we do not want multiple processes to access badger at the same time but can cause issues when the Dgraph alpha process unexpectedly terminates.

The LOCK file in the ‘p’ dir ensures other processes cannot obtain a new lock, but you can delete the LOCK file if the original process accessing it is no longer running. And then re-try with badger info --dir p. The --read-only=false and --truncate flags would also not required in this case. Similarly, badger stream and badger flatten should also work fine.

In scenarios, where you have 2 Alphas failing out of 3 in an HA cluster, resulting in a complete loss of quorum, the following steps can be used to rebuild the 2 failed Alphas from scratch, using Snapshot streaming:
The steps below assume you’re running Dgraph on bare-metal hosts but the process is similar for other types of deployments:

  1. Stop the alpha process on both the affected Alpha nodes. Ensure that the leader Alpha is still online and responding to best-effort queries.
  2. Rename the ‘p’ ‘w’ and ‘t’ directories on both the bad Alphas.
  3. As done previously, run the /removeNode operation on the Zero leader to remove both the Alphas based on their RAFT IDs.
  4. Run the following command on the Zero leader to ensure that both Alphas have been removed from the RAFT state. Check the “alpha” and “removed” sections from the output to confirm the same (Needs the jq tool installed).

curl -s localhost:6080/state | jq '{alphas: .groups."1".members, removed: .removed, zeros: .zeros}'

  1. Once we confirm that both the Alphas have been removed, start up one of the Alphas first. At this point, a snapshot transfer should be initiated and you should see the following messages in the leader logs:
17 snapshot.go:294] Got StreamSnapshot request: context:<id:4 group:1 addr:"alpha-internal-hostname" > index:29912278 read_ts:31082312
17 snapshot.go:203] Waiting to reach timestamp: 31082312
17 log.go:34] Sending Snapshot Streaming about 41 GiB of uncompressed data (13 GiB on disk)
17 log.go:34] Sending Snapshot [05s] Scan (8): ~188.4 MiB/41 GiB at 0 B/sec. Sent: 224.0 MiB at 0 B/sec. jemalloc: 1004 MiB
17 log.go:34] Sending Snapshot [10s] Scan (8): ~319.3 MiB/41 GiB at 13 MiB/sec. Sent: 384.0 MiB at 16 MiB/sec. jemalloc: 1000 MiB
17 log.go:34] Sending Snapshot [15s] Scan (8): ~395.9 MiB/41 GiB at 14 MiB/sec. Sent: 480.0 MiB at 17 MiB/sec. jemalloc: 1003 MiB
..
  1. Once the snapshot stream completes (as per leader-logs below), you should have the ‘p’ directory recreated on the bad Alpha using snapshot streaming (which uses Badger’s Stream API).
17 snapshot.go:259] Received ACK with done: true
17 snapshot.go:300] Stream snapshot: OK
17 draft.go:137] Operation completed with id: opSnapshot
  1. At this point this Alpha should be back online along with the leader Alpha and hence quorum should be restored. The backend should be available to serve requests.

  2. Repeat this process for the second bad Alpha. Ensure that /removeNode is called and the ‘p’, ‘w’ and ‘t’ directories renamed before you start the Alpha process backup to force snapshot streaming. If we miss cleaning out any of the ‘p’, or ‘w’ directories, the bad Alpha would fail to join back the existing RAFT cluster.

Can you try the above steps and let us know if they help ?

Best,

1 Like

@MichelDiz
We’re running a RAID 5 (striped with 1 parity) and a hot spare with SSD SAS Mix Use 12Gbps drives. The RAM has the ECC data integrity checks. These are on 6x Dell R940s with an RPS. They’re supplied power through 2x UPSs.

We have 4 Groups. Each Group has 3 Alphas. There are 3 Zeros.

We’re running Dgraph in Docker Swarm.

The issue
We experienced an unscheduled power outage where 3 of the 6 servers came online after the outage and 3 had to be manually restarted a few hours later.

With 6 servers up, the DGraph state became:
1 of the 3 Zero’s were alive, no leader.
Group #1 had 1 Alpha alive, but was not the leader.
Group #2 had 2x Alphas alive, with a leader.
Group #3 had 2x Alphas alive, with a leader.
Group #4 had 2x Alphas alive, with a leader.

Repairs
The Zero #2 came back online just fine.
The Zero #1 didn’t have the --peer statement in it’s startup command and created it’s own cluster. This resulted in a split brain with some of the Alpha’s now attached to it. We eventually fixed this by adding the --peer statement.

Group #2, #3, and #4’s dead Alpha’s couldn’t move past a DirectedEdge: illegal tag error. To correct this, we called removeNode for each one and created a new Alpha. Replication occurred.

Group #1 - The alive Alpha had all the data needed, it just needed to replicate it to the 2x dead ones.
The 2x dead alpha’s also couldn’t move past the DirectedEdge: illegal tag error. We called removeNode on both of these and created 2x new alpha’s expecting a leader to be elected and replication to occur. They just continued to throw the error: Error while calling hasPeer: Unable to reach leader in group 1. Retrying... (Described here)

Group #1’s Alpha that was up appeared to believe that the original Alpha’s in it’s group were never removed. We confirmed this running dgraph debug -o command on the w directory. The Zero’s also thought the original group #1 alpha’s were still there. They correctly appeared in the /state “removed nodes” list, but when running the dgraph debug -o command in the w directory, the Snapshot Metadata: {ConfState:{Nodes:[]... showed the removed nodes still there. That appeared to line-up with the error messages we were seeing from the Alpha that was up the whole time… Unable to send message to peer: 0x1. Error: Do not have address of peer 0x1

Once we decided we couldn’t fix the Group’s ability to elect a leader, we put back the original node’s data (everything was backed up) into the new created Alphas. Back to the DirectedEdge: illegal tag error.

We attempted to run/repair the data with badger info --read-only=false --truncate=true , badger flatten, and badger stream, however they all failed because we have our maxlevels configured to 8 instead of the default 7 and there’s not currently an option to override that. We have our maxlevels set to 8 because badger panics once it’s reaches 1.1TB.

Once we realized we couldn’t repair the data because of the badger errors, we realized we needed to take more drastic action to get replication working. We removed all nodes from DGraph. We started up 3 completely clean Zeros (brand new zw folders) with no knowledge of the previous cluster state. We started up Group #1 Alpha 1 (the one with the working copy of the data). We then started up an empty 2nd and 3rd node. Dgraph elected a leader from the 2nd and 3rd empty nodes. This replicated the empty database to Alpha 1.

We tore it all down again… repeated the steps to get the zero’s up clean, copied Alpha 1 data to Alpha 2’s p directory and started up an empty Alpha 3. They elected a leader - this time the one with the data. Ratel showed data for the Group, all sizes 0Bytes though. We brought online Groups 2, 3, and 4. They came up with data. After waiting about 10 minutes for each Group to come up, the tablet’s changed from 0 Bytes changed to the appropriate Gigabytes for them, except roughly half the tablets didn’t load from each group. They’re now erroring with: While retrieving snapshot, error: cannot retrieve snapshot from peer: rpc error: code = Unknown desc = operation opSnapshot is already running. Retrying...

At this point we’ve invested 3 days of 4 people’s efforts into trying to recovery the data and there isn’t a clear known path to recovery. I do think that when we called removeNodes on the erroring nodes in Group #1 we went down a path that we were unable to revert from.

We’re shifting gears to rebuilding the database from scratch.

From this experience, we do have a couple recommendations…

  1. Update badger to work with maxlevels of 8 (vs. default of 7) so info, stream, and flatten work.
  2. removeNodes needs to be easily able to be reverted in some cases, maybe an addNodes
  3. All of this could have been avoided if we could have just declared a lead in Group #1, or manually manipulated the state back to the “correct” version vs letting Dgraph try to figure it out.

Thanks for all the support on this.

Please, add those as feature request in their respective Repo. So we can track them.

Thanks for the clear and detailed summary!

I believe we could’ve tried a few other things with Group#1 - the problem area, but I can’t complain looking at the overall effort that went in the recovery process.

Regarding the part where Zero’s RAFT state was not updated despite calling /removeNode, we’ve seen some instances where /removeNode had to be called twice, to have the ID incremented twice.
Unfortunately, we haven’t been able to reproduce it yet but this is worth adding to your notes for the future.

Regarding your recommendation on Badger info, stream or flatten not working with 7 levels, I’m unable to reproduce the problem as everything works fine with 7 Levels.
In your case, I believe the original problem was that, with the unexpected restart, a lock was left on the original ‘p’ directory which then got replicated after a copy was made. None of the options like flatten or stream, which open the DB in write-mode, would work due to the lock. Nether on the original ‘p’ dir with the lock or the copy.

Here, if you run badger info in write-mode, i.e. with --read-only=false , on replay, we check how much of the .mem file is valid. In your case, we saw the error suggesting that 16KB in the .mem file is data not flushed to disk and would be lost upon truncation. The remaining 134MB in the .mem file is actually empty.

Error: failed to open database err: while opening memtables error: while opening fid: 1046234 error: while updating skiplist error: end offset: 16978 < size: 134217728 error:

To open in write-mode, you may want to first delete the LOCK file from the ‘p’ directory. Once deleted, other APIs like stream, flatten should also work fine.

1 Like

@rarvikar Hi Rahul, thanks for your reply! I’m looking into your suggestions, and I’ll get back to you with my comments tomorrow.

@rarvikar @MichelDiz Good morning! Still looking into your suggestions. I do have two other questions in the meantime.

Could these issues we had be related to the critical bug with data corruption in Badger that was fixed in the latest version of Dgraph v22.0.2? I’m not sure if we ever mentioned that we were running v21.03.1.

Also, I’m recently having issues getting our Ratel Docker container to run properly. I run the following:

docker stack deploy -c ratel-stack.yml ratel --with-registry-auth

where the ratel-stack.yml looks like this:

version: "3.2"
services:
  ratel:
    image: dgraph/ratel:v21.12.0
    ports:
      - 8000:8000
    deploy:
      placement:
        contstraints:
          - node.labels.worker == dgraph4

The container appears on server dgraph4 when running:

docker ps -a

but the STATUS remains in the “Created” state and never changes to “Up.”

When running this on the Docker Swarm manager server:

docker service ls

Ratel is listed but the replicas show as 0/1. Printing the logs also produces no output:

docker service logs ratel_ratel

And navigating to the Ratel URL that we have routed through NGINX (which has not changed since it was running previously) now gives a “502 Bad Gateway.”

Do you have any suggestions on how to fix this?

Thank you!

I don’t think so. I was related to ARM64. Maybe @joshua could clarify.

Try to downgrade the Ratel version.

1 Like

UPDATE: The Ratel issue has been fixed. I found that the Docker Swarm overlay network for Ratel was not being replicated to the worker nodes. We were able to fix it by creating a new network.

@rarvikar Hi Rahul. We were having this issue when trying to run the badger commands with 8 levels, not 7. I was unable to replicate the errors we were having last week regarding the maxLevels, but I know that @rahst12 wanted to look into it further. He may have more feedback when he’s in tomorrow. Thanks for your help.

1 Like

We upgraded to the lastest version of Dgraph -v22.0.2, but badger is no longer in the container image. Is this correct?

@rebamun Draph v21.03.1 uses Badger v3.2103.0, so a lot of fixes have gone into Badger since then. Full list is here.

@rahst12 The Badger CLI tool was not included in the Dgraph container image in v22.0.2, correct. This was an oversight when we were streamlining the release pipeline. It will be included in the image in the next release.

However, the Badger CLI tool was still released as an artifact during the Dgraph v22.0.2 release. Check out the release page here. You can wget the Badger CLI tool when needed.

Thanks, I found the binary on the release page.

@joshua, I pulled the v23.0.0-beta1 image and it doesn’t have the badger CLI in it (yet) either.

Quick update on testing the following command on a clean dgraph:

badger info --read-only=false --truncate=true --dir ./p

Old database prints with:

...
[Summary]
Level 0 size :    14 KiB
Level 1 size :   9.3 MiB
Level 2 size :   9.7 MiB
Level 3 size :    54 MiB
Level 4 size :   559 MiB
Level 5 size :   5.5 GiB
Level 6 size :    55 GiB
Level 7 size :   547 GiB
Total SST size:  608 GiB
Value log size:  2.0 GiB

Abnormalities:
4 extra files.
0 missing files.
0 empty files.
0 truncated files.
panic:runtime: error: index out of range [7] with length 7
go routine 226 [running]:
github.com/dgraph-ip/badger/v3.newLevelsController.func1({0xc000907920, 0xf}, {0x0?, 0x0?, 0x0?})
   /home/runner/work/badger/badger/levels.go:163 +0x64c
created by github.com/dgraph-io/badger/v3.newLevelsController
   /home/runner/work/badger/badger/levels.go:129 +0x585

A clean dgraph database, also set with --maxlevels=7 outputs with:

[Summary]
Level 0 size :   43 B
Total SST size:  43 B
Value log size:  40 B

Abnormalities:
2 extra files.
0 missing files.
0 empty files.
0 truncated manifests.
badger 2023/04/11 17:51:01 INFO: All 1 tables opened in 2ms
badger 2023/04/11 17:51:01 INFO: Discard stats nextEmptySlot: 0
badger 2023/04/11 17:51:01 INFO: Set nextTxnTs to1
badger 2023/04/11 17:51:01 INFO: Deleting empty file: ../alpha1/p/000004.vlog
badger 2023/04/11 17:51:01 INFO: Lifetime L0 stalled for 0s
badger 2023/04/11 17:51:01 INFO: 
Level 0 [ ]: NumTables: 01. Size: 439 B of 0 B. score: 0.00->0.00 StaleData: 0 B Target FileSize 64 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. score: 0.00->0.00 StaleData: 0 B Target FileSize 2.0 MiB
Level Done
Num Allocated Bytes at program end: 0 B

Maybe to reproduce the error the database not only needs to be set to 7 levels, but also have data in it’s 7th level? I’ll try to reproduce it again tomorrow with a lot more data.