Proposal: Data integrity check (Silent data corruption detection)

diggy · January 1, 2020, 6:57pm

Moved from GitHub badger/1180

Proposal: Data integrity check (Silent data corruption detection)

Abstract

This proposal describes an approach to implement data integrity check based on NVMe driver with low overhead & no breaking changes.

Background

Silent Data Corruption Do damge the data safety

We usually think the hardware is reliable because there are checksums everywhere,
but in rare situations, the protections couldn’t detect these errors, and it may cause serious problem, e.g. :

Amazon S3 Availability Event: July 20, 2008

Facebook temporarily loses more than 10% of photos in hard drive failure

Netflix Outage Blamed on Hardware

Drivers producers Do care about silent data corruption

In enterprise SAS diver, there is a tech called “T10 Data Integrity Field” which makes a fat-sector like this:

And in NVMe End-to-end Data Protection (see NVMe spec) it almost has the same thing:

Actually, it’ hard to see bit flipping in SSD’s flash media & its DRAM, but in SSD’s controller there are lots of SRAM,
bit flipping may occur much more frequently there.

Here is a good article about what Intel have done
to deal with SDC (Silent Data Corruption)

(PS: I could not enable E2E protection in AWS bare metal instance, maybe just not follow the spec but still has protection, I don’t know…)

PCIe has CRC

LCRC (Link CRC) and ECRC (End-to-end CRC), ECRC is optional, if there is no switch between endpoints, no need to enable it.

ECC everywhere

Memory has ECC, disk sector has ECC, and LDPC is widely used in SSD sector now, it provides better ability of correcting.

The weakest part is outside of the host

Checksum in TCP is very weak (see When The CRC and TCP Checksum Disagree, Jonathan Stone and Craig Partridge),
and each switch will recalculate the checksum that means we can’t detect the error inside the switch.
What’s worse, the data provider maybe a personal computer which is not as reliable as the server side.

If client cannot provide a checksum, the E2E protection lose the 90% of meaning.

Veteran database vendors have provided SDC protections

e.g. Oracle:

Hardware solution
Data integrity webcast
Lost Writes, a DBA’s Nightmare? (A practice in CERN)

Most open source databases are “weak”

They do not tolerate silent faults particularly well (see Impact of Disk Corruption on Open-Source DBMS)

No way to provide 100% protection

Any digest has rate to fail, any bit can flip at any time.

Proposal

Document

First, it’ s important to show how silent data corruption happens, and what should we do if we care about it or not.
There are three things must be mentioned in this doc:

End-to-end protection needs client side actions.

Time has changed, normal X86 servers have reliable hardware too. In the past time, only customized solutions can do that.

The theory behind our solutions:

Logically, I/O is about direction & read/write, so we can gain “100%” (actually we can’t) protection logically.
Types of data corruption, besides read, misdirected write, lost write which we already know,
for SSD drivers data corruption may also happen in FTL metadata, erase operation etc.
The principle: See below.

Principles

Based on NVMe drivers, deploy on a high availability distributed system
(e.g. consensus algorithm system). It provides strong hardware protection, and reliable copies to repair data.

If we could raise errors, it’s a good protection. Normal data corruption is easy to find
(e.g. can’t init, not found). And silent data corruption protection is the thing about “mismatch”,
even we can’t find a key which supposed to be there (caused by data integrity issues),
it’s okay, we can repair it by copies or replay the log. So in this design, I just ignore many types of SDC in key’s LSM tree.

Protection options

There will be three levels: None, Typical, Full

None:

Nothing to do, best performance. It’s safe enough under the protection of hardware & protocols in most cases.

Typical:

Compare key in vlog when read, it’s online check.

Scheduled scrub (details see below), it’s offline check.

Sync write is optional.

Full:

Compare key in vlog when read

Verify checksum every read

Sync write will be enabled automatically

Scheduled scrub.

Effect of different levels:

None:

NVMe has tags (application & reference), as we known, misdirected write usually happens inside the driver’s firmware,
so with the help of tags, it shouldn’t happen. And with the protection provided by ECC & CRC, SDC is rare. It’s safe enough for most cases.

(A team in Alibaba Group said they found a ext4 bug which would cause misdirected write,
and it cause their MySQL losing data, but they didn’t provide the link and details, and I haven’t found it)

Typical:

The design of wisckey has a good side-effect that we could have extra check information ---- key,
there are two I/O (key & value) in different position, help to detect write issues, because we can regard key as reference
(it’s almost impossible to find old data in same position which has the same keys in vlog, because key has ts).
It helps to avoid misdirected write in application or filesystem layer.
Only if the writes to both the LSM tree and the vlog are lost simultaneously will such a scheme fail,
an unlikely (but unfortunately, possible!) situation. (see Operating systems: Three easy pieces)
But entry maybe big, so we may have the right key, but wrong data.

SyncWrite is optional because users may want to get balance. Although it may lose data, if there is, we will notice it.

Full:

Sync is not that heavy as we thought in NVMe SSD, because when the data arrives the persistent cache it will return,
and the cache is fast. It’s ridiculous that we want highest protection and unsafe write at the same time.

Recovery

All reconstruction work is under checksum protection, verify data first, then write it down.

Fast Recovery

Recover the single broken entry. It saves time.

Slow Recovery

Recover more than one entry.

Monitor

Errors in switch, errors in ECC memory, errors in S.M.A.R.T etc.

Replace it with new device, if there are many errors even it cloud still work.

This job should be done in normal operations, if Badger does that too, it may lead to redundancy.

Scrub

It could be done in GC/init process.

Test/Try

There is a PR shows how it works.

diggy · January 6, 2020, 9:13am

jarifibrahim commented :

Hey @templexxx Thank you so much for writing such a detailed proposal. I don’t have a lot of knowledge about NVMe driver or other hardware in general. I’ll try to read more about this and get back to you.

Thank you once again

diggy · June 11, 2020, 3:58pm

eloff commented :

It’s 6 months later, what’s the status on this? It seems like an important proposal that shouldn’t get lost in the shuffle.

diggy · June 11, 2020, 4:06pm

jarifibrahim commented :

Hey @eloff, this is definitely an important issue but we do not have the capacity to take this up right now.The issue has been marked as accepted which means we will work on it.

Topic		Replies	Views
Making Badger Crash Resilient with ALICE - Dgraph Blog Blog	0	1776	August 18, 2017
Challenge: Prove that Badger loses data Badger	1	927	September 20, 2018
Encryption algorithm used for TDE does not provide integrity protection Badger kind:bug , area:security	0	783	June 28, 2022
Enabling Encyption on an Unencrypted Alpha Dev eng	25	1337	May 26, 2020
Data management best practices Dgraph dgraph , data-integrity	8	555	September 21, 2023

Proposal: Data integrity check (Silent data corruption detection)