Corrupt database - unable to restart Dgraph

What version of Dgraph are you using?

standalone:v20.03.0

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

Two platforms - AWS Large instance, Linux, 16GiB RAM; MacOS 32GiB RAM

Steps to reproduce the issue (command/config used to run Dgraph).

Attempt to start up Dgraph. It never fully comes up. Impossible, then, to access anything. If I zap the data directories entirely, restart, then import the data via live loader, it works. This is the 3rd time this has happened over the course of the past couple of months. Luckily, I’m backing up regularly, but I have to completely kill the data directory and rebuild.

Expected behaviour and actual result.

Expected: Dgraph would start-up and serve data.
Actual: Dead DB, and no apparent way to recover from it.

Hey @mikehawkes, what do you actually mean by corrupt DB? Can you share some logs? Is there a panic? are the queries not working?

The logs would show why your DB cannot start.

1 Like

Nothing works - I have a series of panics. At that point I can’t connect to the DB. As this was a production system, I concentrated on getting it working - luckily I had a DB export and that allowed me to recover. I kept a copy of the dead DB on the machine - if I zip the folder structure in its entirety and pull to a test server, can I attempt a restart locally, or can it only run on the production instance as that’s where it was created?

@mikehawkes I understand that you saw multiple panics but can you please share the stack trace and logs so that we can help you?

The data directory can be zipped and started locally. Be sure to set the same options you’ve set on your production cluster.

Warning: This standalone version is meant for quickstart purposes only.
         It is NOT RECOMMENDED for production environments.ESC[0;0m
2020/08/10 12:48:57 Listening on :8000...
[Decoder]: Using assembly version of decoder
[Decoder]: Using assembly version of decoder
[Sentry] 2020/08/10 12:48:57 Integration installed: ContextifyFrames
[Sentry] 2020/08/10 12:48:57 Integration installed: Environment
[Sentry] 2020/08/10 12:48:57 Integration installed: Modules
[Sentry] 2020/08/10 12:48:57 Integration installed: ContextifyFrames
[Sentry] 2020/08/10 12:48:57 Integration installed: IgnoreErrors
[Sentry] 2020/08/10 12:48:57 Integration installed: Environment
[Sentry] 2020/08/10 12:48:57 Integration installed: Modules
[Sentry] 2020/08/10 12:48:57 Integration installed: IgnoreErrors
[Decoder]: Using assembly version of decoder
[Decoder]: Using assembly version of decoder
[Sentry] 2020/08/10 12:48:57 Integration installed: ContextifyFrames
[Sentry] 2020/08/10 12:48:57 Integration installed: Environment
[Sentry] 2020/08/10 12:48:57 Integration installed: Modules
[Sentry] 2020/08/10 12:48:57 Integration installed: IgnoreErrors
[Sentry] 2020/08/10 12:48:57 Integration installed: ContextifyFrames
[Sentry] 2020/08/10 12:48:57 Integration installed: Environment
[Sentry] 2020/08/10 12:48:57 Integration installed: Modules
[Sentry] 2020/08/10 12:48:57 Integration installed: IgnoreErrors
I0810 12:48:58.023709      37 init.go:99] 

Dgraph version   : v20.03.0
Dgraph SHA-256   : 07e63901be984bd20a3505a2ee5840bb8fc4f72cc7749c485f9f77db15b9b75a
Commit SHA-1     : 147c8df9
Commit timestamp : 2020-03-30 17:28:31 -0700
Branch           : HEAD
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0810 12:48:58.024116      37 run.go:606] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I0810 12:48:58.024189      37 run.go:607] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr: ZeroAddr:localhost:5080 RaftId:0 WhiteListedIPRanges:[{Lower:0.0.0.0 Upper:255.255.255.255}] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-08-10 12:48:57.553117616 +0000 UTC m=+0.038242720 LudicrousMode:false}
I0810 12:48:58.024254      37 run.go:608] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: WALDir:w MutationsMode:0 AuthToken: AllottedMemory:2655 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
I0810 12:48:58.027128      37 server_state.go:74] Setting Badger table load option: mmap
I0810 12:48:58.027176      37 server_state.go:86] Setting Badger value log load option: mmap
I0810 12:48:58.027188      37 server_state.go:131] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x260a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0810 12:48:58.030487      38 init.go:99] 

Dgraph version   : v20.03.0
Dgraph SHA-256   : 07e63901be984bd20a3505a2ee5840bb8fc4f72cc7749c485f9f77db15b9b75a
Commit SHA-1     : 147c8df9
Commit timestamp : 2020-03-30 17:28:31 -0700
Branch           : HEAD
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


I0810 12:48:58.031007      38 run.go:105] Setting up grpc listener at: 0.0.0.0:5080
I0810 12:48:58.031803      38 run.go:105] Setting up http listener at: 0.0.0.0:6080
badger 2020/08/10 12:48:58 INFO: All 9 tables opened in 415ms
badger 2020/08/10 12:48:58 INFO: Replaying file id: 92 at offset: 1650826
badger 2020/08/10 12:48:58 INFO: Replay took: 1.732093ms
badger 2020/08/10 12:48:58 DEBUG: Value log discard stats empty
I0810 12:48:58.618879      38 node.go:145] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00008e420 Applied:25527962 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x260a270 DisableProposalForwarding:false}
I0810 12:48:58.619667      38 node.go:303] Found Snapshot.Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:25527962 Term:71 XXX_unrecognized:[]}
I0810 12:48:58.619737      38 node.go:314] Found hardstate: {Term:90 Vote:1 Commit:25530577 XXX_unrecognized:[]}
I0810 12:48:59.659888      37 log.go:34] All 25 tables opened in 1.499s
I0810 12:48:59.770163      37 log.go:34] Replaying file id: 23217 at offset: 1533714
I0810 12:48:59.773631      37 log.go:34] Replay took: 3.416105ms
I0810 12:48:59.779510      37 log.go:34] Replaying file id: 23218 at offset: 0
I0810 12:48:59.815781      37 log.go:34] Replay took: 36.214153ms
I0810 12:48:59.818285      37 server_state.go:74] Setting Badger table load option: mmap
I0810 12:48:59.818323      37 server_state.go:86] Setting Badger value log load option: mmap
I0810 12:48:59.818331      37 server_state.go:154] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x260a270 Compression:0 EventLogging:true InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:true NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:1 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0810 12:49:02.773076      37 log.go:34] All 109 tables opened in 2.58s
I0810 12:49:02.970803      37 log.go:34] Replaying file id: 817 at offset: 39249648
I0810 12:49:02.974310      37 log.go:34] Replay took: 3.440061ms
I0810 12:49:02.977836      37 groups.go:104] Current Raft Id: 0x1
I0810 12:49:02.978021      37 worker.go:96] Worker listening at address: [::]:7080
I0810 12:49:02.978687      37 run.go:477] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0810 12:49:02.978700      37 run.go:478] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0810 12:49:02.978720      37 run.go:509] gRPC server started.  Listening on port 9080
I0810 12:49:02.978728      37 run.go:510] HTTP server started.  Listening on port 8080
I0810 12:49:03.079110      37 pool.go:160] CONNECTING to localhost:5080
I0810 12:49:03.082585      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:06.547078      38 node.go:323] Group 0 found 2616 entries
I0810 12:49:06.547149      38 raft.go:447] Restarting node for dgraphzero
I0810 12:49:06.547261      38 node.go:182] Setting conf state to nodes:1 
I0810 12:49:06.547673      38 pool.go:160] CONNECTING to server:7080
W0810 12:49:06.552291      38 pool.go:254] Connection lost with server:7080. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup server on 192.168.65.1:53: no such host"
I0810 12:49:07.979662      37 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0810 12:49:07.979724      37 admin.go:510] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.
I0810 12:49:11.659915      38 log.go:34] 1 became follower at term 90
I0810 12:49:11.660099      38 log.go:34] newRaft 1 [peers: [1], term: 90, commit: 25530577, applied: 25527962, lastindex: 25530577, lastterm: 90]
I0810 12:49:11.660563      38 run.go:296] Running Dgraph Zero...
I0810 12:49:11.662659      38 log.go:34] 1 no leader at term 90; dropping index reading msg
I0810 12:49:11.665007      38 oracle.go:107] Purged below ts:79063523, len(o.commits):455, len(o.rowCommit):0
I0810 12:49:12.980857      37 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0810 12:49:12.980914      37 admin.go:510] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.
I0810 12:49:13.080569      38 zero.go:426] Connected: cluster_info_only:true 
I0810 12:49:13.281021      38 zero.go:417] Got connection request: cluster_info_only:true 
W0810 12:49:13.661028      38 node.go:671] [0x1] Read index context timed out
I0810 12:49:13.661114      38 log.go:34] 1 no leader at term 90; dropping index reading msg
I0810 12:49:15.360951      38 log.go:34] 1 is starting a new election at term 90
I0810 12:49:15.360997      38 log.go:34] 1 became pre-candidate at term 90
I0810 12:49:15.361004      38 log.go:34] 1 received MsgPreVoteResp from 1 at term 90
I0810 12:49:15.361063      38 log.go:34] 1 became candidate at term 91
I0810 12:49:15.361070      38 log.go:34] 1 received MsgVoteResp from 1 at term 91
I0810 12:49:15.361155      38 log.go:34] 1 became leader at term 91
I0810 12:49:15.361198      38 log.go:34] raft.node: 1 elected leader 1 at term 91
I0810 12:49:15.361320      38 raft.go:667] I've become the leader, updating leases.
I0810 12:49:15.361422      38 assign.go:42] Updated Lease id: 1190194. Txn Ts: 79190001
W0810 12:49:15.618583      38 node.go:671] [0x1] Read index context timed out
I0810 12:49:15.620389      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:15.622306      37 pool.go:160] CONNECTING to zero:5080
W0810 12:49:15.626391      37 pool.go:254] Connection lost with zero:5080. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup zero on 192.168.65.1:53: no such host"
I0810 12:49:15.726947      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:15.727534      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:15.829692      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:15.830362      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:15.932837      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:15.933967      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.036862      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.038064      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.143195      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.144795      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.248905      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.249445      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.352351      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.353352      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.456087      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.456962      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.562051      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.563533      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.667325      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.668630      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.771758      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.773080      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.876603      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.877367      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:16.983784      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:16.984513      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:17.086896      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:17.087731      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:17.191126      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:17.192943      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:17.297615      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:17.298255      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:17.401068      38 zero.go:417] Got connection request: cluster_info_only:true 
I0810 12:49:17.401633      38 zero.go:435] Connected: cluster_info_only:true 
I0810 12:49:17.525199      38 zero.go:417] Got connection request: cluster_info_only:true

I see this in your logs. Can you double check this IP and confirm that it’s reachable from alpha?
If you’re running on AWS or gcloud, your IP will change if you shutdown and restart the machine.

I do not see any panics in the logs you’ve shared.

This is a copy pulled down from AWS and running locally. If I run the Dgraph standalone it works fine; stop; kill the dgraph directory; unzip to that directory; and start up … then the above.

If you’re using the wrong IP:Port for zero, alpha wouldn’t be able to connect to zero and so it wouldn’t start.

This runs for days, then suddenly stops - after that, I have to zap the database and start it again.

It also seems data related - the same shell-script starts the DB with an empty structure and with this populated one. It usually starts without issue … it then runs for some days before dying. After that, I can’t restart it without having to delete the entire structure, import it, and start again.

Also, if I start with a new directory structure, then import from an exported data set, all works … for a few days. Then it collapses again - if I don’t export regularly, once it’s gone, it’s impossible to reopen.

(P.S. On the panic front - I’ll re-run when I can get a quiet moment - Panics did run through the cluster startup - not sure if I’ve seen those on the standalone - both variants fail to start, however, so I’m unable to recover the database).

Looks like I haven’t preserved the logs - the new instance has cleared the old! However, I do have a “core.13” file in the dgraph data directory - is this a core file from a crashed instance, or something else? Can I delete this safely?

Additional information here - I’m wondering if this is actually a memory leak somewhere? I’ve kept an eye on the standalone Docker image and watched the memory footprint creep up as the day progresses - also, on restart, it seems that the startup cleans logs etc.

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
 6037 root      20   0   14.6g   5.7g 381224 S   1.0 72.9 105:02.83 dgraph                                                                                                      
 6038 root      20   0 3243608 750744  41360 S   1.0  9.2  28:57.61 dgraph 

After restarting the image:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
13491 root      20   0 8288716   1.6g 100496 S   0.7 19.9   0:24.46 dgraph                                                                                                      
13492 root      20   0 3191772 802640  58116 S   0.3  9.8   0:14.31 dgraph 

So I’m wondering if it’s leaking memory somewhere and then crashing out. It still doesn’t allow me to open the DB when it’s died, however. So the original database remains corrupt and unusable.

Please help me with some logs/error/panic so that I can help.

Dgraph doesn’t create a core.13 file. It’s not related to dgraph.

Do you mean to say that the memory usage kept increasing even when dgraph couldn’t start? If so, that sounds off.

It died again overnight - I attach a log file.log.txt (47.5 KB)

Thanks - I wondered if the core.13 file is a docker core image file. I’ll delete it in that case. This morning, I have another core (core.12) file at the same timestamp as DG died.

-rw------- 1 root root 203927552 Aug 11 23:03 core.12

@mikehawkes I see the following in your logs

I0811 23:02:09.694392      35 draft.go:523] Creating snapshot at index: 8145811. ReadTs: 8963338.
I0811 23:02:10.417434      36 oracle.go:107] Purged below ts:8963338, len(o.commits):6, len(o.rowCommit):154
runtime/cgo: pthread_create failed: Resource temporarily unavailable
W0811 23:04:05.176556      35 groups.go:835] No membership update for 10s. Closing connection to Zero.
E0811 23:04:06.804167      35 groups.go:796] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
E0811 23:04:06.869526      35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = transport is closing
E0811 23:04:06.869825      35 groups.go:896] Error in oracle delta stream. Error: rpc error: code = Unavailable desc = transport is closing
W0811 23:04:06.870025      35 pool.go:254] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = transport is closing
W0811 23:04:06.870105      35 draft.go:1211] While sending membership to Zero. Error: rpc error: code = Unavailable desc = transport is closing
E0811 23:04:06.889557      35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0811 23:04:07.290211      35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
I0811 23:04:08.057900      35 groups.go:856] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0811 23:04:08.057926      35 groups.go:865] Got Zero leader: localhost:5080
E0811 23:04:08.058262      35 groups.go:877] Error while calling Oracle rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0811 23:04:08.290335      35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
I0811 23:04:09.058456      35 groups.go:856] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0811 23:09:09.290847      35 draft.go:1269] Found 1 old transactions. Acting to abort them.
I0811 23:09:09.290872      35 draft.go:1272] Done abortOldTransactions for 1 txns. Error: No connection exists
github.com/dgraph-io/dgraph/worker.init
	/tmp/go/src/github.com/dgraph-io/dgraph/worker/draft.go:1218
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5414
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.main
	/usr/local/go/src/runtime/proc.go:190
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
I0811 23:10:09.290683      35 draft.go:1269] Found 1 old transactions. Acting to abort them.
I0811 23:10:09.290980      35 draft.go:1272] Done abortOldTransactions for 1 txns. Error: No connection exists
github.com/dgraph-io/dgraph/worker.init
	/tmp/go/src/github.com/dgraph-io/dgraph/worker/draft.go:1218
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5414
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
	/usr/local/go/src/runtime/proc.go:5409
runtime.main
	/usr/local/go/src/runtime/proc.go:190
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373

The message runtime/cgo: pthread_create failed: Resource temporarily unavailable might be the reason for your crashes. I’ve never seen this kind of error message before.
From the logs, it looks like there is a cgo crash and followed by that raft starts having issues and the node is not able to communicate with other nodes.

@mikehawkes Have you tried running dgraph in a different environment? I think it might be because of some environment issues. You can try running the dgraph binary and not the standalone docker image.

If you can help me with all the details about where you’re running dgraph and how you’re running dgraph, I can try to reproduce the crash and investigate it further.

@mikehawkes are you running dgraph on macOS?

It’s running in docker within an AWS large instance. I dropped to standalone as the standard images had failed. I also run it on my dev machines (Mac Pro and Macbook Pro) and haven’t encountered this on these machines. I suspect some resource isn’t getting released - hence the gradual memory creep and core files. I note in another thread, someone also having issues with resource suddenly becoming unavailable … perhaps they’re related. That thread, however, deals with a docker image on Mac, if memory serves me correctly.

I believe you’re talking about Dgraph v20.07.0 / v20.03.0 unreliability in Mac OS environment - #7 by WolfgangFahl but it seems unrelated to me.

@mikehawkes Do you have a script or something that you use to deply dgraph on aws or do you just start the dgraph docker image on an EC2 machine? I want to run dgraph the same way you run on aws and see what happens.

I could be related to how dgraph works on AWS.