Upgrade from 1.0.4 to 1.0.6 crash and data loss

Hello gents,

Writing to report a major issue we’ve encountered in production when upgrading to 1.0.6. Firstly, we shouldn’t have pushed the fresh release so quickly (lesson learned). Second, when we did upgrade we refreshed the db with a brand new install and imported our old records following instructions accordingly.

Yesterday we save a major system failure and when we attempted reverting to a previous backup, we noticed that all our backup files since Wednesday (the cutover date) were corrupted or missing sets of triples.

Here’s our internal record of the event:

What went wrong?

dgraph database crashed permanently around July 8 at 1am EST.

When was it discovered?

July 9 around 10am.

What was done to resolve the issue?

Reinstall dgraph v1.0.4 from backups

What was the root cause?

dgraph v1.0.6 seems to have a fatal crash where it corrupts data and refuses to restart.
We installed v1.0.6 during the AWS transfer believing that because it’s a minor patch update (1.0.4 to 1.0.6) it will as stable but with important bug fixes and stability improvements, but found that that is not the case.

What is a long-term resolution for issues like these?

As dgraph has proved to be unstable, we need to create an early warning notification to admins if the database crashes again.

Ensure that new versions (even patch) of dgraph db are tested thoroughly before making it into prod.

Side effects

Because the backups were corrupted, the earliest db backup that we could use was one from July 4. Any data added to the database between July 4 and July 9 has been lost.

I’ll post the raw logs shortly.

David

This is what we’re seeing currently when we reboot the instance:

Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: panic: runtime error: slice bounds out of range
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: goroutine 2748140 [running]:
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Arena).getKey(...)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/arena.go:118
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*node).key(...)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/skl.go:143
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Skiplist).findNear(0xc4db93f2a0, 0xc4c02bb710, 0x30, 0x30, 0xc4dc810100, 0xc42ac53601, 0xc42ac53650)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/skl.go:209 +0x311
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Skiplist).Get(0xc4db93f2a0, 0xc4c02bb710, 0x30, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/skl.go:376 +0x89
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*DB).get(0xc420375880, 0xc4c02bb710, 0x30, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/db.go:501 +0x111
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Item).yieldItemValue(0xc4a13a2000, 0x0, 0xc470576400, 0xc420172138, 0x7fd5c68f0458, 0x0, 0x40)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/iterator.go:178 +0x3c4
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Item).Value(0xc4a13a2000, 0xc4c02cd640, 0x19, 0xc4c02cd601, 0x0, 0x0)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/iterator.go:103 +0x63
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/worker.export(0x12a177d, 0x6, 0xf81f, 0x0, 0x0)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/worker/export.go:322 +0xeb4
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/worker.handleExportForGroup(0x137e960, 0xc420038028, 0xc4963054e0, 0xc4963054e0)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/worker/export.go:407 +0x208
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/worker.handleExportForGroupOverNetwork(0x137e960, 0xc420038028, 0xc4963054e0, 0xc43ea04a00)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/worker/export.go:368 +0x356
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: github.com/dgraph-io/dgraph/worker.ExportOverNetwork.func1(0xf81f, 0xc42c81cea0, 0x137e960, 0xc420038028, 0xc400000001)
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/worker/export.go:476 +0x79
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: created by github.com/dgraph-io/dgraph/worker.ExportOverNetwork
Jul  8 05:00:06 prod-gametv-db dgraphserver[30173]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/worker/export.go:470 +0x2ba
Jul  8 05:00:17 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:17 server.go:116: Setting Badger option: default
Jul  8 05:00:17 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:17 server.go:131: Setting Badger table load option: mmap
Jul  8 05:00:17 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:17 server.go:143: Setting Badger value log load option: mmap
Jul  8 05:00:17 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:17 server.go:153: Opening postings Badger DB with options: {Dir:p ValueDir:p SyncWrites:true TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:32 NumMemtables:5 NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741824 ValueLogMaxEntries:1000000 NumCompactors:3 managedTxns:false DoNotCompact:false maxBatchCount:0 maxBatchSize:0 ReadOnly:false Truncate:false}
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 gRPC server started.  Listening on port 9080
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 HTTP server started.  Listening on port 8080
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 groups.go:77: Current Raft Id: 1
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 worker.go:89: Worker listening at address: [::]:7080
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 pool.go:108: == CONNECT ==> Setting localhost:5080
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 groups.go:104: Connected to group zero. Assigned group: 0
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 draft.go:150: Node ID: 1 with GroupID: 1
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 node.go:224: Found Snapshot, Metadata: {ConfState:{Nodes:[] Learners:[] XXX_unrecognized:[]} Index:8661 Term:5 XXX_unrecognized:[]}
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 node.go:235: Found hardstate: {Term:5 Vote:1 Commit:8671 XXX_unrecognized:[]}
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 node.go:244: Group 1 found 11 entries
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 draft.go:1018: Restarting node for group: 1
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 raft.go:639: INFO: 1 became follower at term 5
Jul  8 05:00:19 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:19 raft.go:354: INFO: newRaft 1 [peers: [], term: 5, commit: 8671, applied: 8661, lastindex: 8671, lastterm: 5]
Jul  8 05:00:21 prod-gametv-db dgraphserver[13934]: 2018/07/08 05:00:21 groups.go:494: Got address of a Zero server: localhost:5080
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: panic: runtime error: slice bounds out of range
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: goroutine 22707 [running]:
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Arena).getKey(...)
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/arena.go:118
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*node).key(...)
Jul  8 05:00:23 prod-gametv-db dgraphserver[13934]: #011/ext-go/1/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/skl.go:143

Hey @ddibiase,

What’s the crash you’re seeing? And how was the data lost? Can you explain, and create some issues for us to follow through?

Can you add more logs and details? What sort of data corruption is happening?

I can add a little more context. I’m on @ddibiase’s team @mrjn.

The data corruption was misreported by us in that we didn’t get to the data corruption part :slight_smile: Apologies for that.

When we saw the crash, we tried to do what we could to recover the database. Restarting zero, restarting server. Nothing worked. We went backwards to v1.0.4 and tried to restart with the same data directory. That clearly didn’t work. We got dgraphserver[12244]: Unable to read from value log: {Fid:0 Len:90 Offset:2650496}

So we decided to start from scratch by going back to our last working version which was v1.0.4 and recovering from backups. Unfortunately our backups didn’t run before v1.0.6 crashed and so we lost some data as well.

Did you try to re-run the export? Or, did you give up immediately after seeing the crash?

I tried to talk to the server but it was unreachable. And the logs were showing the crash loop above.

Did you try to restart and run export again?

Yes. Every time it restarted it went into the crashloop immediately.

Would you have the store directory still around by any chance?

We have it still. Is there a way for me to send it to you outside the forum?

Yes. Can you can tar, zip and upload it somewhere and DM me with the link, please?

Thanks for sharing the directory. Due to this, I was able to reproduce and fix a bug in Badger, which had been evasive until now.

I’ve imported this bug fix into Dgraph master as well. I plan to cut a v1.0.7-rc soon, so this fix would be part of that.

2 Likes

We encountered an instability with v1.0.6 again on our staging server though.

The database just stopped being responsive. With client’s getting “context exceeded” messages.

When investigating further we found nothing obviously wrong in the logs of either zero or server. When we issued a restart on server, it would come up seemingly OK but still “deadline exceeded” messages on client.

When restarting server it started spitting out this message in a loop:

Jul 12 13:00:17 dev-gametv-db dgraphserver[29455]: 2018/07/12 13:00:17 groups.go:721: WARNING: We don’t have address of any dgraphzero leader.

Jul 12 13:00:18 dev-gametv-db dgraphserver[29455]: 2018/07/12 13:00:18 groups.go:510: Unable to sync memberships. Error: rpc error: code = Unknown desc = context deadline exceeded

Jul 12 13:00:18 dev-gametv-db dgraphserver[29455]: 2018/07/12 13:00:18 groups.go:494: Got address of a Zero server: localhost:5080
After restarting zero, server didn’t change from above, while zero started spitting out this in a loop:

Jul 12 13:01:19 dev-gametv-db dgraphzero[29384]: 2018/07/12 13:01:19 raft.go:1070: INFO: 1 no leader at term 4; dropping index reading msg

Seems to be related to raft, which happens temporarily, depending upon the state of the servers. But, did it converge to a stable cluster later?

@mrjn it did not. It was running on a single dgraph server node.

Hard to say without having the full logs.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.