Have you tried reproducing the issue with the latest release?
Yes
What is the hardware spec (RAM, OS)?
Two platforms - AWS Large instance, Linux, 16GiB RAM; MacOS 32GiB RAM
Steps to reproduce the issue (command/config used to run Dgraph).
Attempt to start up Dgraph. It never fully comes up. Impossible, then, to access anything. If I zap the data directories entirely, restart, then import the data via live loader, it works. This is the 3rd time this has happened over the course of the past couple of months. Luckily, I’m backing up regularly, but I have to completely kill the data directory and rebuild.
Expected behaviour and actual result.
Expected: Dgraph would start-up and serve data.
Actual: Dead DB, and no apparent way to recover from it.
Nothing works - I have a series of panics. At that point I can’t connect to the DB. As this was a production system, I concentrated on getting it working - luckily I had a DB export and that allowed me to recover. I kept a copy of the dead DB on the machine - if I zip the folder structure in its entirety and pull to a test server, can I attempt a restart locally, or can it only run on the production instance as that’s where it was created?
I see this in your logs. Can you double check this IP and confirm that it’s reachable from alpha?
If you’re running on AWS or gcloud, your IP will change if you shutdown and restart the machine.
I do not see any panics in the logs you’ve shared.
This is a copy pulled down from AWS and running locally. If I run the Dgraph standalone it works fine; stop; kill the dgraph directory; unzip to that directory; and start up … then the above.
It also seems data related - the same shell-script starts the DB with an empty structure and with this populated one. It usually starts without issue … it then runs for some days before dying. After that, I can’t restart it without having to delete the entire structure, import it, and start again.
Also, if I start with a new directory structure, then import from an exported data set, all works … for a few days. Then it collapses again - if I don’t export regularly, once it’s gone, it’s impossible to reopen.
(P.S. On the panic front - I’ll re-run when I can get a quiet moment - Panics did run through the cluster startup - not sure if I’ve seen those on the standalone - both variants fail to start, however, so I’m unable to recover the database).
Looks like I haven’t preserved the logs - the new instance has cleared the old! However, I do have a “core.13” file in the dgraph data directory - is this a core file from a crashed instance, or something else? Can I delete this safely?
Additional information here - I’m wondering if this is actually a memory leak somewhere? I’ve kept an eye on the standalone Docker image and watched the memory footprint creep up as the day progresses - also, on restart, it seems that the startup cleans logs etc.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6037 root 20 0 14.6g 5.7g 381224 S 1.0 72.9 105:02.83 dgraph
6038 root 20 0 3243608 750744 41360 S 1.0 9.2 28:57.61 dgraph
After restarting the image:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13491 root 20 0 8288716 1.6g 100496 S 0.7 19.9 0:24.46 dgraph
13492 root 20 0 3191772 802640 58116 S 0.3 9.8 0:14.31 dgraph
So I’m wondering if it’s leaking memory somewhere and then crashing out. It still doesn’t allow me to open the DB when it’s died, however. So the original database remains corrupt and unusable.
Thanks - I wondered if the core.13 file is a docker core image file. I’ll delete it in that case. This morning, I have another core (core.12) file at the same timestamp as DG died.
-rw------- 1 root root 203927552 Aug 11 23:03 core.12
I0811 23:02:09.694392 35 draft.go:523] Creating snapshot at index: 8145811. ReadTs: 8963338.
I0811 23:02:10.417434 36 oracle.go:107] Purged below ts:8963338, len(o.commits):6, len(o.rowCommit):154
runtime/cgo: pthread_create failed: Resource temporarily unavailable
W0811 23:04:05.176556 35 groups.go:835] No membership update for 10s. Closing connection to Zero.
E0811 23:04:06.804167 35 groups.go:796] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
E0811 23:04:06.869526 35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = transport is closing
E0811 23:04:06.869825 35 groups.go:896] Error in oracle delta stream. Error: rpc error: code = Unavailable desc = transport is closing
W0811 23:04:06.870025 35 pool.go:254] Connection lost with localhost:5080. Error: rpc error: code = Unavailable desc = transport is closing
W0811 23:04:06.870105 35 draft.go:1211] While sending membership to Zero. Error: rpc error: code = Unavailable desc = transport is closing
E0811 23:04:06.889557 35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0811 23:04:07.290211 35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
I0811 23:04:08.057900 35 groups.go:856] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0811 23:04:08.057926 35 groups.go:865] Got Zero leader: localhost:5080
E0811 23:04:08.058262 35 groups.go:877] Error while calling Oracle rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
E0811 23:04:08.290335 35 groups.go:744] While sending membership update: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:5080: connect: connection refused"
I0811 23:04:09.058456 35 groups.go:856] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0811 23:09:09.290847 35 draft.go:1269] Found 1 old transactions. Acting to abort them.
I0811 23:09:09.290872 35 draft.go:1272] Done abortOldTransactions for 1 txns. Error: No connection exists
github.com/dgraph-io/dgraph/worker.init
/tmp/go/src/github.com/dgraph-io/dgraph/worker/draft.go:1218
runtime.doInit
/usr/local/go/src/runtime/proc.go:5414
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.main
/usr/local/go/src/runtime/proc.go:190
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1373
I0811 23:10:09.290683 35 draft.go:1269] Found 1 old transactions. Acting to abort them.
I0811 23:10:09.290980 35 draft.go:1272] Done abortOldTransactions for 1 txns. Error: No connection exists
github.com/dgraph-io/dgraph/worker.init
/tmp/go/src/github.com/dgraph-io/dgraph/worker/draft.go:1218
runtime.doInit
/usr/local/go/src/runtime/proc.go:5414
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.doInit
/usr/local/go/src/runtime/proc.go:5409
runtime.main
/usr/local/go/src/runtime/proc.go:190
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1373
The message runtime/cgo: pthread_create failed: Resource temporarily unavailable might be the reason for your crashes. I’ve never seen this kind of error message before.
From the logs, it looks like there is a cgo crash and followed by that raft starts having issues and the node is not able to communicate with other nodes.
@mikehawkes Have you tried running dgraph in a different environment? I think it might be because of some environment issues. You can try running the dgraph binary and not the standalone docker image.
If you can help me with all the details about where you’re running dgraph and how you’re running dgraph, I can try to reproduce the crash and investigate it further.
It’s running in docker within an AWS large instance. I dropped to standalone as the standard images had failed. I also run it on my dev machines (Mac Pro and Macbook Pro) and haven’t encountered this on these machines. I suspect some resource isn’t getting released - hence the gradual memory creep and core files. I note in another thread, someone also having issues with resource suddenly becoming unavailable … perhaps they’re related. That thread, however, deals with a docker image on Mac, if memory serves me correctly.
@mikehawkes Do you have a script or something that you use to deply dgraph on aws or do you just start the dgraph docker image on an EC2 machine? I want to run dgraph the same way you run on aws and see what happens.