Moved from GitHub dgraph/3377
Posted by emhagman:
cluster0-node0_1 | I0506 18:45:32.456017 902 node.go:83] Rolling up Created batch of size: 574 B in 29.516µs.
cluster0-node0_1 | I0506 18:45:32.466456 902 node.go:83] Rolling up Created batch of size: 574 B in 31.077µs.
cluster0-node0_1 | I0506 18:45:32.490438 902 node.go:83] Rolling up Created batch of size: 214 B in 36.473µs.
cluster0-node0_1 | I0506 18:45:32.493054 902 node.go:83] Rolling up Created batch of size: 1.4 MB in 36.946µs.
cluster0-node0_1 | I0506 18:45:32.700170 902 node.go:83] Rolling up Created batch of size: 104 B in 32.176µs.
cluster0-node0_1 | I0506 18:45:32.783064 902 node.go:83] Rolling up Created batch of size: 196 B in 37.519µs.
cluster0-node0_1 | I0506 18:45:32.783080 902 node.go:83] Rolling up Sent 30 keys
cluster0-node0_1 | I0506 18:45:32.784633 902 draft.go:832] Rolled up 30 keys. Done
cluster0-node0_1 | I0506 18:45:32.784648 902 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node0_1 | E0506 19:23:42.063406 902 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
cluster0-node1_1 | I0506 18:45:25.241424 899 node.go:83] Rolling up Created batch of size: 90 B in 38.712µs.
cluster0-node1_1 | I0506 18:45:25.241944 899 node.go:83] Rolling up Created batch of size: 90 B in 37.024µs.
cluster0-node1_1 | I0506 18:45:25.311981 899 node.go:83] Rolling up Created batch of size: 104 B in 27.995µs.
cluster0-node1_1 | I0506 18:45:25.333585 899 node.go:83] Rolling up Created batch of size: 1.4 MB in 27.567µs.
cluster0-node1_1 | I0506 18:45:25.385264 899 node.go:83] Rolling up Created batch of size: 214 B in 36.223µs.
cluster0-node1_1 | I0506 18:45:25.387943 899 node.go:83] Rolling up Sent 30 keys
cluster0-node1_1 | I0506 18:45:25.388293 899 draft.go:832] Rolled up 30 keys. Done
cluster0-node1_1 | I0506 18:45:25.388303 899 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node1_1 | E0506 19:23:42.063405 899 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
cluster0-node2_1 | I0506 18:45:28.122878 900 node.go:83] Rolling up Time elapsed: 02s, bytes sent: 505 kB, speed: 253 kB/sec
cluster0-node2_1 | I0506 18:45:28.357085 900 node.go:83] Rolling up Created batch of size: 574 B in 32.376µs.
cluster0-node2_1 | I0506 18:45:28.402563 900 node.go:83] Rolling up Created batch of size: 90 B in 21.845µs.
cluster0-node2_1 | I0506 18:45:28.470928 900 node.go:83] Rolling up Created batch of size: 214 B in 36.455µs.
cluster0-node2_1 | I0506 18:45:28.497968 900 node.go:83] Rolling up Created batch of size: 1.4 MB in 24.755µs.
cluster0-node2_1 | I0506 18:45:28.500456 900 node.go:83] Rolling up Created batch of size: 196 B in 25.823µs.
cluster0-node2_1 | I0506 18:45:28.580287 900 node.go:83] Rolling up Created batch of size: 574 B in 20.943µs.
cluster0-node2_1 | I0506 18:45:28.790954 900 node.go:83] Rolling up Created batch of size: 104 B in 21.884µs.
cluster0-node2_1 | I0506 18:45:28.790970 900 node.go:83] Rolling up Sent 30 keys
cluster0-node2_1 | I0506 18:45:28.792696 900 draft.go:832] Rolled up 30 keys. Done
cluster0-node2_1 | I0506 18:45:28.792708 900 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node2_1 | E0506 19:23:42.063364 900 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
zero_1 | I0506 17:53:01.839982 1 oracle.go:106] Purged below ts:2338623, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 17:58:09.839775 1 oracle.go:106] Purged below ts:2338643, len(o.commits):0, len(o.rowCommit):7
zero_1 | I0506 18:05:29.840001 1 oracle.go:106] Purged below ts:2338660, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 18:07:30.853869 1 oracle.go:106] Purged below ts:2341497, len(o.commits):24, len(o.rowCommit):209
zero_1 | I0506 18:07:39.924921 1 raft.go:580] Writing snapshot at index: 817486, applied mark: 817670
zero_1 | I0506 18:08:04.167479 1 oracle.go:106] Purged below ts:2344523, len(o.commits):314, len(o.rowCommit):1588
zero_1 | I0506 18:10:32.891883 1 oracle.go:106] Purged below ts:2345730, len(o.commits):0, len(o.rowCommit):7
zero_1 | I0506 18:10:39.870465 1 raft.go:580] Writing snapshot at index: 819021, applied mark: 819023
zero_1 | I0506 18:13:00.839959 1 oracle.go:106] Purged below ts:2345748, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 18:18:08.839854 1 oracle.go:106] Purged below ts:2345755, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 18:40:30.839689 1 oracle.go:106] Purged below ts:2345776, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 18:43:04.840096 1 oracle.go:106] Purged below ts:2345794, len(o.commits):0, len(o.rowCommit):8
zero_1 | I0506 19:23:42.063033 1 run.go:259] --- Received terminated signal
zero_1 | I0506 19:23:42.063117 1 run.go:271] Shutting down...
The last line in each node is from me shutting the server down. I still have all the data from the folders around but needed to start a new instance right away.
They sit behind nginx
and nginx
was giving me the following errors:
upstream timed out (110: Operation timed out) while reading response header from
upstream
upstream server temporarily disabled while reading response header from upstream
Both of which mean it could not connect to the nodes and that is was not an nginx
problem.
Here is my nginx
config:
# all servers
upstream grpcservers {
zone grpc 64k;
least_conn;
keepalive 100;
server cluster0-node0:9080 max_fails=3, fail_timeout=5s;
server cluster0-node1:9081 max_fails=3, fail_timeout=5s;
server cluster0-node2:9082 max_fails=3, fail_timeout=5s;
}
# dgraph grpc proxy
server {
listen 9080 http2;
access_log /var/log/nginx/access.log grpc;
http2_max_concurrent_streams 500000; # 500k
http2_max_requests 500000; # 500k
location / {
grpc_pass grpc://grpcservers;
grpc_socket_keepalive on;
error_page 502 = /error502grpc;
}
location = /error502grpc {
internal;
default_type application/grpc;
add_header grpc-status 14;
add_header grpc-message "unavailable";
return 204;
}
}
I tried killing and restarting nginx multiple times before finally giving up and creating a new server. That leads me to believe it was Dgraph because I’d assume all of the pooled connections would’ve been cleared once nginx shut down.
Here are the nginx logs:
19:20:06 [error] 7#7: *3669 no live upstreams while connecting to upstream, client: 10.1.1.192, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://grpcservers", host: "10.1.2.41:9602"
nginx_1 | 10.1.1.192 - - grpcservers [06/May/2019:19:20:06 +0000] "POST /api.Dgraph/Query HTTP/2.0" 204 0 "grpc-elixir/0.3.1" | 0.000 0.000 .
.3:9080", host: "10.1.2.41:9602"
nginx_1 | 2019/05/06 19:18:12 [error] 7#7: *8037 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.47, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.3:9080", host: "10.1.2.41:9602"
nginx_1 | 10.1.0.47 - - 172.28.0.3:9080 [06/May/2019:19:18:12 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.004 60.004 .
nginx_1 | 2019/05/06 19:18:14 [warn] 7#7: *8039 upstream server temporarily disabled while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.5:9081", host: "10.1.2.41:9602"
nginx_1 | 2019/05/06 19:18:14 [error] 7#7: *8039 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.5:9081", host: "10.1.2.41:9602"
nginx_1 | 10.1.0.22 - - 172.28.0.5:9081 [06/May/2019:19:18:14 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.000 59.999 .
nginx_1 | 2019/05/06 19:18:14 [warn] 7#7: *8041 upstream server temporarily disabled while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.4:9082", host: "10.1.2.41:9602"
nginx_1 | 2019/05/06 19:18:14 [error] 7#7: *8041 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.4:9082", host: "10.1.2.41:9602"
nginx_1 | 10.1.0.22 - - 172.28.0.4:9082 [06/May/2019:19:18:14 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.000 60.003 .
If you look, 60.000 60.003
means it took 60s from Dgraph before timing out, the first is the upstream and the second is the total response time for nginx.
Received this from dgraph-js
:
1 CANCELLED: Received http2 header with status: 504
The docs for upstream
talking about max_fails
: Module ngx_http_upstream_module
max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.
fail_timeout=time
sets
the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable;
and the period of time the server will be considered unavailable.
By default, the parameter is set to 10 seconds.
Meaning it failed to talk to dgraph alpha 3 times in 5 seconds in order for upstream server temporarily disabled while reading response header from upstream
to show up.
Here are the ulimits as well for the alphas and zero.
# ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
memory(kbytes) unlimited
locked memory(kbytes) 16384
process unlimited
nofiles 1048576
vmemory(kbytes) unlimited
locks unlimited
rtprio 0