Connections to all alphas started timing out around same moment

Moved from GitHub dgraph/3377

Posted by emhagman:

cluster0-node0_1  | I0506 18:45:32.456017     902 node.go:83] Rolling up Created batch of size: 574 B in 29.516µs.
cluster0-node0_1  | I0506 18:45:32.466456     902 node.go:83] Rolling up Created batch of size: 574 B in 31.077µs.
cluster0-node0_1  | I0506 18:45:32.490438     902 node.go:83] Rolling up Created batch of size: 214 B in 36.473µs.
cluster0-node0_1  | I0506 18:45:32.493054     902 node.go:83] Rolling up Created batch of size: 1.4 MB in 36.946µs.
cluster0-node0_1  | I0506 18:45:32.700170     902 node.go:83] Rolling up Created batch of size: 104 B in 32.176µs.
cluster0-node0_1  | I0506 18:45:32.783064     902 node.go:83] Rolling up Created batch of size: 196 B in 37.519µs.
cluster0-node0_1  | I0506 18:45:32.783080     902 node.go:83] Rolling up Sent 30 keys
cluster0-node0_1  | I0506 18:45:32.784633     902 draft.go:832] Rolled up 30 keys. Done
cluster0-node0_1  | I0506 18:45:32.784648     902 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node0_1  | E0506 19:23:42.063406     902 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
cluster0-node1_1  | I0506 18:45:25.241424     899 node.go:83] Rolling up Created batch of size: 90 B in 38.712µs.
cluster0-node1_1  | I0506 18:45:25.241944     899 node.go:83] Rolling up Created batch of size: 90 B in 37.024µs.
cluster0-node1_1  | I0506 18:45:25.311981     899 node.go:83] Rolling up Created batch of size: 104 B in 27.995µs.
cluster0-node1_1  | I0506 18:45:25.333585     899 node.go:83] Rolling up Created batch of size: 1.4 MB in 27.567µs.
cluster0-node1_1  | I0506 18:45:25.385264     899 node.go:83] Rolling up Created batch of size: 214 B in 36.223µs.
cluster0-node1_1  | I0506 18:45:25.387943     899 node.go:83] Rolling up Sent 30 keys
cluster0-node1_1  | I0506 18:45:25.388293     899 draft.go:832] Rolled up 30 keys. Done
cluster0-node1_1  | I0506 18:45:25.388303     899 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node1_1  | E0506 19:23:42.063405     899 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
cluster0-node2_1  | I0506 18:45:28.122878     900 node.go:83] Rolling up Time elapsed: 02s, bytes sent: 505 kB, speed: 253 kB/sec
cluster0-node2_1  | I0506 18:45:28.357085     900 node.go:83] Rolling up Created batch of size: 574 B in 32.376µs.
cluster0-node2_1  | I0506 18:45:28.402563     900 node.go:83] Rolling up Created batch of size: 90 B in 21.845µs.
cluster0-node2_1  | I0506 18:45:28.470928     900 node.go:83] Rolling up Created batch of size: 214 B in 36.455µs.
cluster0-node2_1  | I0506 18:45:28.497968     900 node.go:83] Rolling up Created batch of size: 1.4 MB in 24.755µs.
cluster0-node2_1  | I0506 18:45:28.500456     900 node.go:83] Rolling up Created batch of size: 196 B in 25.823µs.
cluster0-node2_1  | I0506 18:45:28.580287     900 node.go:83] Rolling up Created batch of size: 574 B in 20.943µs.
cluster0-node2_1  | I0506 18:45:28.790954     900 node.go:83] Rolling up Created batch of size: 104 B in 21.884µs.
cluster0-node2_1  | I0506 18:45:28.790970     900 node.go:83] Rolling up Sent 30 keys
cluster0-node2_1  | I0506 18:45:28.792696     900 draft.go:832] Rolled up 30 keys. Done
cluster0-node2_1  | I0506 18:45:28.792708     900 draft.go:353] List rollup at Ts 2345794: OK.
cluster0-node2_1  | E0506 19:23:42.063364     900 groups.go:693] Unable to sync memberships. Error: rpc error: code = Unknown desc = Server is being shut down. State: <nil>
zero_1            | I0506 17:53:01.839982       1 oracle.go:106] Purged below ts:2338623, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 17:58:09.839775       1 oracle.go:106] Purged below ts:2338643, len(o.commits):0, len(o.rowCommit):7
zero_1            | I0506 18:05:29.840001       1 oracle.go:106] Purged below ts:2338660, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 18:07:30.853869       1 oracle.go:106] Purged below ts:2341497, len(o.commits):24, len(o.rowCommit):209
zero_1            | I0506 18:07:39.924921       1 raft.go:580] Writing snapshot at index: 817486, applied mark: 817670
zero_1            | I0506 18:08:04.167479       1 oracle.go:106] Purged below ts:2344523, len(o.commits):314, len(o.rowCommit):1588
zero_1            | I0506 18:10:32.891883       1 oracle.go:106] Purged below ts:2345730, len(o.commits):0, len(o.rowCommit):7
zero_1            | I0506 18:10:39.870465       1 raft.go:580] Writing snapshot at index: 819021, applied mark: 819023
zero_1            | I0506 18:13:00.839959       1 oracle.go:106] Purged below ts:2345748, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 18:18:08.839854       1 oracle.go:106] Purged below ts:2345755, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 18:40:30.839689       1 oracle.go:106] Purged below ts:2345776, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 18:43:04.840096       1 oracle.go:106] Purged below ts:2345794, len(o.commits):0, len(o.rowCommit):8
zero_1            | I0506 19:23:42.063033       1 run.go:259] --- Received terminated signal
zero_1            | I0506 19:23:42.063117       1 run.go:271] Shutting down...

The last line in each node is from me shutting the server down. I still have all the data from the folders around but needed to start a new instance right away.

They sit behind nginx and nginx was giving me the following errors:

upstream timed out (110: Operation timed out) while reading response header from
upstream

upstream server temporarily disabled while reading response header from upstream

Both of which mean it could not connect to the nodes and that is was not an nginx problem.

Here is my nginx config:

# all servers
upstream grpcservers {
    zone grpc 64k;
    least_conn;
    keepalive 100;
    server cluster0-node0:9080 max_fails=3, fail_timeout=5s;
    server cluster0-node1:9081 max_fails=3, fail_timeout=5s;
    server cluster0-node2:9082 max_fails=3, fail_timeout=5s;
}

# dgraph grpc proxy
server {

  listen 9080 http2;
  access_log /var/log/nginx/access.log grpc;

  http2_max_concurrent_streams 500000; # 500k
  http2_max_requests 500000; # 500k

  location / {
    grpc_pass grpc://grpcservers;
    grpc_socket_keepalive on;
    error_page 502 = /error502grpc;
  }

  location = /error502grpc {
    internal;
    default_type application/grpc;
    add_header grpc-status 14;
    add_header grpc-message "unavailable";
    return 204;
  }

}

I tried killing and restarting nginx multiple times before finally giving up and creating a new server. That leads me to believe it was Dgraph because I’d assume all of the pooled connections would’ve been cleared once nginx shut down.

Here are the nginx logs:

19:20:06 [error] 7#7: *3669 no live upstreams while connecting to upstream, client: 10.1.1.192, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://grpcservers", host: "10.1.2.41:9602"
nginx_1           | 10.1.1.192 - - grpcservers [06/May/2019:19:20:06 +0000] "POST /api.Dgraph/Query HTTP/2.0" 204 0 "grpc-elixir/0.3.1" | 0.000 0.000 .
.3:9080", host: "10.1.2.41:9602"
nginx_1           | 2019/05/06 19:18:12 [error] 7#7: *8037 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.47, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.3:9080", host: "10.1.2.41:9602"
nginx_1           | 10.1.0.47 - - 172.28.0.3:9080 [06/May/2019:19:18:12 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.004 60.004 .
nginx_1           | 2019/05/06 19:18:14 [warn] 7#7: *8039 upstream server temporarily disabled while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.5:9081", host: "10.1.2.41:9602"
nginx_1           | 2019/05/06 19:18:14 [error] 7#7: *8039 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.5:9081", host: "10.1.2.41:9602"
nginx_1           | 10.1.0.22 - - 172.28.0.5:9081 [06/May/2019:19:18:14 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.000 59.999 .
nginx_1           | 2019/05/06 19:18:14 [warn] 7#7: *8041 upstream server temporarily disabled while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.4:9082", host: "10.1.2.41:9602"
nginx_1           | 2019/05/06 19:18:14 [error] 7#7: *8041 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 10.1.0.22, server: , request: "POST /api.Dgraph/Query HTTP/2.0", upstream: "grpc://172.28.0.4:9082", host: "10.1.2.41:9602"
nginx_1           | 10.1.0.22 - - 172.28.0.4:9082 [06/May/2019:19:18:14 +0000] "POST /api.Dgraph/Query HTTP/2.0" 504 167 "grpc-node/1.20.0 grpc-c/7.0.0 (linux; chttp2; godric)" | 60.000 60.003 .

If you look, 60.000 60.003 means it took 60s from Dgraph before timing out, the first is the upstream and the second is the total response time for nginx.

Received this from dgraph-js:
1 CANCELLED: Received http2 header with status: 504

The docs for upstream talking about max_fails: http://nginx.org/en/docs/http/ngx_http_upstream_module.html

max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.
fail_timeout=time
sets
the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable;
and the period of time the server will be considered unavailable.
By default, the parameter is set to 10 seconds.

Meaning it failed to talk to dgraph alpha 3 times in 5 seconds in order for upstream server temporarily disabled while reading response header from upstream to show up.

Here are the ulimits as well for the alphas and zero.

# ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
memory(kbytes)       unlimited
locked memory(kbytes) 16384
process              unlimited
nofiles              1048576
vmemory(kbytes)      unlimited
locks                unlimited
rtprio               0

emhagman commented :

Here is the data from Prometheus via Grafana today:

https://snapshot.raintank.io/dashboard/snapshot/Z5rn8esZ9Q4E3vydo53IkYu9Rx9TCZh6

Here is since the nginx config from April 23th:

https://snapshot.raintank.io/dashboard/snapshot/oENQbPNQqmDT7j75dUudKbr8iFJ8DS1s

danielmai commented :

@emhagman What Dgraph version did you see this issue with? And is it happening with the latest Dgraph release? It’s not obvious to me what the problem was or what caused it.