_Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE, Error received from peer

AYEG · January 5, 2019, 5:21pm

Hello,
Currently running a test setup of dgraph in a swarm with 1/1 replicas on a single linux host.

Note, dgraph is running together with this stack: GitHub - tiangolo/uwsgi-nginx-flask-docker: Docker image with uWSGI and Nginx for Flask applications in Python running in a single container. Optionally with Alpine Linux. behind traefik reverse proxy.

Everything is up and running and I’m able to fetch, store and alter data correctly, but on the first request or after a while of inactivity, the client returns:

_Rendezvous of RPC that terminated with:
status =
details = “OS Error”

debug_error_string = “{“created”:”@1546640683.283755066",“description”:“Error received from peer”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1036,“grpc_message”:“OS Error”,“grpc_status”:14}"

When I do the same request again afterwards it returns the correct data without any errors

What could be the problem here? Is this normal behavior? e.g. should I configure something on my host or perform regular “check-ins” from the backend to prevent this from happening?

MichelDiz · January 5, 2019, 10:57pm

Hi,

Is it safe to mix traefik and nginx? btw traefik needs a specific config for gRPC. Did you did?
If you’re only using HTTP I think that mix can’t be a issue tho. But tricky.

Try to review your stack. Check traefik stats and add Grafana to check Dgraph stats.

The code error from gRPC - Code = 14. This error is unusual with Dgraph
Share your logs if you see something odd.

github.com

grpc/grpc-go/blob/c71aa62423b37215980f9c3141eef06f4c35e998/codes/codes.go#L133



	// Unimplemented indicates operation is not implemented or not
	// supported/enabled in this service.
	Unimplemented Code = 12

	// Internal errors. Means some invariants expected by underlying
	// system has been broken. If you see one of these errors,
	// something is very broken.
	Internal Code = 13

	// Unavailable indicates the service is currently unavailable.
	// This is a most likely a transient condition and may be corrected
	// by retrying with a backoff.
	//
	// See litmus test above for deciding between FailedPrecondition,
	// Aborted, and Unavailable.
	Unavailable Code = 14

	// DataLoss indicates unrecoverable data loss or corruption.
	DataLoss Code = 15

AYEG · January 5, 2019, 11:39pm

Yes it worked well for a previous (albeit somewhat smaller) project. With traefik doing the load balancing and directing things to the right container, while Nginx duty is focussed on the flask uwsgi python backend container.

Going from dev in docker-compose to using swarm deployment is where I looks like the problems started happening (also had some connection issue with postgres, but that seems to be resolved using dnsrr docker setting)

I’ve used port labels for gRPC communication with traefik on the default network e.g.
zero:
deploy:
labels:
- traefik.enable=true
- traefik.gRPC-external.port=5080
- traefik.HTTP-external.port=5080
- traefik.tags=${TRAEFIK_TAG}
alpha_1:
deploy:
labels:
- traefik.enable=true
- traefik.gRPC-internal.port=7080
- traefik.gRPC-external.port=9080
- traefik.HTTP-external.port=8080
- traefik.tags=${TRAEFIK_TAG}

Thanks for pointing me to Grafana, will try it out to investigate further!

MichelDiz · January 5, 2019, 11:47pm

So, there’s the key. Single Docker env is very different from swarm behavior. Try to compare with Get started with Dgraph

AYEG · January 6, 2019, 12:12am

I tried to following that example as much as possible, but ran into trouble with 3 alpha’s sharing volume so I tried limiting the cluster to run one alpha. Another thought I had is that perhaps I need to create more volumes /dgraph /dgraph1 /dgraph2 or consider using VMs so I can run those 3 alpha’s on my linux machine, would that make sense?

MichelDiz · January 6, 2019, 12:18am

Since you’re using Docker, the type of machine you’re using does not matter that much. True, each Alpha instance needs its own volume. If all or some instances are using the same volume and path*. This will lead to many problems.

The volumes are defined in the docker-compose itself.

AYEG · January 6, 2019, 12:43am

Ok . Thinking about it some more, I remember I didn’t see any issues in the dgraph container logs when I ran only one Alpha in the cluster, everything was working as it should except for that grcp error. So I’ll focus on that first

AYEG · January 9, 2019, 10:27am

How to enable GRPC logging to the container log? In my env_file I’ve placed:

GRPC_VERBOSITY=DEBUG
GRPC_TRACE=api,channel,call_error,connectivity_state,http,server_channel

source: grpc/environment_variables.md at master · grpc/grpc · GitHub

MichelDiz · January 9, 2019, 6:37pm

I believe you can do a trace of gRPC via Jaeger. Dgraph exposes nothing beyond the trivial about gRPC.
But with Jaeger you have more complete info about what is going on with Dgraph.

https://docs.dgraph.io/deploy/#examining-traces-with-jaeger

AYEG · January 11, 2019, 6:24pm

Got Jeager up and running tried to integrate Jaeger python flask client, but unfortunately I’m getting all kinds of import errors or it just seems to freeze/deadlock.

I did manage to output the GRPC communication the container logs by setting flask log_level to debug. i.e.

if name == “main”:
log_level = logging.DEBUG

I’ll start collecting some data when an error happens to see if it can provide some more context about what is happening

AYEG · January 16, 2019, 1:23pm

Ok, so after trying many things I couldn’t find the underlying cause other than that the error happens after the connection goes idle after some time.

Doing a retry immediately after is possible so with the help of the grpc library it actually not so difficult to handle the exception (example http://avi.im/grpc-errors/#python)

import grpc
…

          if e.code():  # http://avi.im/grpc-errors/#python
                # e.details()    #            
                status_code = e.code()
                status = str(status_code)
                print("Exception Status Code", status_code)
                return status

then look for the error code in the response
e.g. if status == “StatusCode.UNAVAILABLE”:

and try again whenever necessary

Topic		Replies	Views
Error: <_Inactivstatus = StatusCode.UNAVAILABLE details = "Trying to connect an http1.x server" ... "grpc_status":14}" > Dgraph kind:bug	7	1635	June 18, 2020
Why some time api response 200 and some time 400 bad request Dgraph Cloud kind:question	2	703	September 29, 2022
Error: <_InactiveRpcError of RPC that terminated with: Dgraph Clients untagged , pydgraph	1	3102	July 11, 2020
gRPC error transport is closing Dgraph Clients dgo , kind:enhancement	2	1954	July 11, 2020
GRPC error when connecting to Slash Graphql Dgraph Cloud / Slash GraphQL	4	1563	March 25, 2021

_Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE, Error received from peer

Related topics