Moved from GitHub badger/1180
Posted by templexxx:
Proposal: Data integrity check (Silent data corruption detection)
This proposal describes an approach to implement data integrity check based on NVMe driver with low overhead & no breaking changes.
Silent Data Corruption Do damge the data safety
We usually think the hardware is reliable because there are checksums everywhere,
but in rare situations, the protections couldn’t detect these errors, and it may cause serious problem, e.g. :
Drivers producers Do care about silent data corruption
In enterprise SAS diver, there is a tech called “T10 Data Integrity Field” which makes a fat-sector like this:
And in NVMe End-to-end Data Protection (see NVMe spec) it almost has the same thing:
Actually, it’ hard to see bit flipping in SSD’s flash media & its DRAM, but in SSD’s controller there are lots of SRAM,
bit flipping may occur much more frequently there.
Here is a good article about what Intel have done
to deal with SDC (Silent Data Corruption)
(PS: I could not enable E2E protection in AWS bare metal instance, maybe just not follow the spec but still has protection, I don’t know…)
PCIe has CRC
LCRC (Link CRC) and ECRC (End-to-end CRC), ECRC is optional, if there is no switch between endpoints, no need to enable it.
Memory has ECC, disk sector has ECC, and LDPC is widely used in SSD sector now, it provides better ability of correcting.
The weakest part is outside of the host
Checksum in TCP is very weak (see When The CRC and TCP Checksum Disagree, Jonathan Stone and Craig Partridge),
and each switch will recalculate the checksum that means we can’t detect the error inside the switch.
What’s worse, the data provider maybe a personal computer which is not as reliable as the server side.
If client cannot provide a checksum, the E2E protection lose the 90% of meaning.
Veteran database vendors have provided SDC protections
Most open source databases are “weak”
They do not tolerate silent faults particularly well (see Impact of Disk Corruption on Open-Source DBMS)
No way to provide 100% protection
Any digest has rate to fail, any bit can flip at any time.
First, it’ s important to show how silent data corruption happens, and what should we do if we care about it or not.
There are three things must be mentioned in this doc:
End-to-end protection needs client side actions.
Time has changed, normal X86 servers have reliable hardware too. In the past time, only customized solutions can do that.
The theory behind our solutions:
- Logically, I/O is about direction & read/write, so we can gain “100%” (actually we can’t) protection logically.
- Types of data corruption, besides read, misdirected write, lost write which we already know,
for SSD drivers data corruption may also happen in FTL metadata, erase operation etc.
- The principle: See below.
Based on NVMe drivers, deploy on a high availability distributed system
(e.g. consensus algorithm system). It provides strong hardware protection, and reliable copies to repair data.
If we could raise errors, it’s a good protection. Normal data corruption is easy to find
(e.g. can’t init, not found). And silent data corruption protection is the thing about “mismatch”,
even we can’t find a key which supposed to be there (caused by data integrity issues),
it’s okay, we can repair it by copies or replay the log. So in this design, I just ignore many types of SDC in key’s LSM tree.
There will be three levels: None, Typical, Full
Nothing to do, best performance. It’s safe enough under the protection of hardware & protocols in most cases.
- Compare key in vlog when read, it’s online check.
- Scheduled scrub (details see below), it’s offline check.
- Sync write is optional.
- Compare key in vlog when read
- Verify checksum every read
- Sync write will be enabled automatically
- Scheduled scrub.
Effect of different levels:
NVMe has tags (application & reference), as we known, misdirected write usually happens inside the driver’s firmware,
so with the help of tags, it shouldn’t happen. And with the protection provided by ECC & CRC, SDC is rare. It’s safe enough for most cases.
(A team in Alibaba Group said they found a ext4 bug which would cause misdirected write,
and it cause their MySQL losing data, but they didn’t provide the link and details, and I haven’t found it)
The design of wisckey has a good side-effect that we could have extra check information ---- key,
there are two I/O (key & value) in different position, help to detect write issues, because we can regard key as reference
(it’s almost impossible to find old data in same position which has the same keys in vlog, because key has ts).
It helps to avoid misdirected write in application or filesystem layer.
Only if the writes to both the LSM tree and the vlog are lost simultaneously will such a scheme fail,
an unlikely (but unfortunately, possible!) situation. (see Operating systems: Three easy pieces)
But entry maybe big, so we may have the right key, but wrong data.
SyncWrite is optional because users may want to get balance. Although it may lose data, if there is, we will notice it.
Sync is not that heavy as we thought in NVMe SSD, because when the data arrives the persistent cache it will return,
and the cache is fast. It’s ridiculous that we want highest protection and unsafe write at the same time.
All reconstruction work is under checksum protection, verify data first, then write it down.
Recover the single broken entry. It saves time.
Recover more than one entry.
Errors in switch, errors in ECC memory, errors in S.M.A.R.T etc.
Replace it with new device, if there are many errors even it cloud still work.
This job should be done in normal operations, if Badger does that too, it may lead to redundancy.
It could be done in GC/init process.
There is a PR shows how it works.