What version of Go are you using (go version
)?
go version go1.15.14 linux/amd64
What operating system are you using?
linux
What version of Badger are you using?
7677fcb1e0c2 (the version dgraph v21.03.2 uses) → master
Does this issue reproduce with the latest master?
yes
Steps to Reproduce the issue
The program itself to reproduce is simple, just write a bunch of data to badger:
kv, _ := badger.Open(DefaultOptions("/some/directory"))
wb := kv.NewWriteBatch()
for i:=0;i<1000;i++ {
for j:=0;j<10000;j++ {
wb.Set([]byte(fmt.Sprintf("wakkawakka%d%d",i,j)),[]byte(fmt.Sprintf("value%d",j)))
}
}
wb.Flush()
However, the reproduction of the error requires syscall failure in the os.File.Sync()
on the last line of manifestFile.addChanges()
. A single error from the Sync() call will put the server in a permanent error state that it cannot recover from. To test this I patched a temporary error into manifest.go:
diff --git a/manifest.go b/manifest.go
index 7520456..98fa63a 100644
--- a/manifest.go
+++ b/manifest.go
@@ -239,9 +240,20 @@ func (mf *manifestFile) addChanges(changesParam []*pb.ManifestChange) error {
}
mf.appendLock.Unlock()
+ {
+ c := make([]uintptr, 10)
+ runtime.Callers(2, c)
+ f, _ := runtime.CallersFrames(c).Next()
+ if f.Function == "github.com/dgraph-io/badger/v3.(*levelsController).runCompactDef"&&!oopsed {
+ oopsed=true
+ return errors.New("WONKY")
+ }
+ }
return mf.fp.Sync()
}
+var oopsed=false
yea its… very specific. But you get it.
I do not understand why that file is being closed in the first place, but I have plenty of evidence from many dgraph v21.03.2 corruptions that this does really happen.
My theory: manifestFile.rewrite()
closes the manifest file, rewrites it, and opens the new file, and is causing the error to come though temporarily. Note here in the golang source that ErrFileClosing is rewritten to ErrClosed, this could be what is bubbling up (this wrapErr gets called in the posix implementation of Sync/FSync).
The answer is probably to run the in-memory alteration (applyChanges()
) after a successful Sync()
. Obviously you guys want these two things to happen atomically… but I don’t think we can do that.
What Badger options were set?
badger.DefaultOptions("/some/disk/location")
What did you do?
patched manifest.go to run Sync() before the in-memory changes
diff --git a/manifest.go b/manifest.go
index 7520456..3fbc30d 100644
--- a/manifest.go
+++ b/manifest.go
@@ -212,10 +213,6 @@ func (mf *manifestFile) addChanges(changesParam []*pb.ManifestChange) error {
}
// Maybe we could use O_APPEND instead (on certain file systems)
mf.appendLock.Lock()
- if err := applyChangeSet(&mf.manifest, &changes); err != nil {
- mf.appendLock.Unlock()
- return err
- }
if mf.inMemory {
mf.appendLock.Unlock()
return nil
@@ -239,9 +236,14 @@ func (mf *manifestFile) addChanges(changesParam []*pb.ManifestChange) error {
}
mf.appendLock.Unlock()
- return mf.fp.Sync()
+ if err := mf.fp.Sync(); err != nil {
+ return err
+ }
+ return applyChangeSet(&mf.manifest, &changes)
}
What did you expect to see?
during other calls to addChanges, retries are done:
ERROR: Failure while flushing memtable to disk: sync p/MANIFEST: file already closed. Retrying...
What did you see instead?
in the doCompact() call, it does not retry - permanent failure corrupts the manifest for all time:
WARNING: [Compactor: 2] LOG Compact FAILED with error: sync p/MANIFEST: file already closed
WARNING: While running doCompact: sync p/MANIFEST: file already closed
WARNING: [Compactor: 2] LOG Compact FAILED with error: MANIFEST removes non-existing table 1....
WARNING: While running doCompact: MANIFEST removes non-existing table 1....