_Rendezvous of RPC that terminated with StatusCode.UNAVAILABLE, Error received from peer


(Ayeg) #1

Hello,
Currently running a test setup of dgraph in a swarm with 1/1 replicas on a single linux host.

Note, dgraph is running together with this stack: https://github.com/tiangolo/uwsgi-nginx-flask-docker behind traefik reverse proxy.

Everything is up and running and I’m able to fetch, store and alter data correctly, but on the first request or after a while of inactivity, the client returns:

_Rendezvous of RPC that terminated with:
status =
details = “OS Error”

debug_error_string = “{“created”:”@1546640683.283755066",“description”:“Error received from peer”,“file”:“src/core/lib/surface/call.cc”,“file_line”:1036,“grpc_message”:“OS Error”,“grpc_status”:14}"

When I do the same request again afterwards it returns the correct data without any errors

What could be the problem here? Is this normal behavior? e.g. should I configure something on my host or perform regular “check-ins” from the backend to prevent this from happening?


(Michel Conrado (Support Engineer)) #2

Hi,

Is it safe to mix traefik and nginx? btw traefik needs a specific config for gRPC. Did you did?
If you’re only using HTTP I think that mix can’t be a issue tho. But tricky.

Try to review your stack. Check traefik stats and add Grafana to check Dgraph stats.

The code error from gRPC - Code = 14. This error is unusual with Dgraph
Share your logs if you see something odd.


(Ayeg) #3

Yes it worked well for a previous (albeit somewhat smaller) project. With traefik doing the load balancing and directing things to the right container, while Nginx duty is focussed on the flask uwsgi python backend container.

Going from dev in docker-compose to using swarm deployment is where I looks like the problems started happening (also had some connection issue with postgres, but that seems to be resolved using dnsrr docker setting)

I’ve used port labels for gRPC communication with traefik on the default network e.g.
zero:
deploy:
labels:
- traefik.enable=true
- traefik.gRPC-external.port=5080
- traefik.HTTP-external.port=5080
- traefik.tags=${TRAEFIK_TAG}
alpha_1:
deploy:
labels:
- traefik.enable=true
- traefik.gRPC-internal.port=7080
- traefik.gRPC-external.port=9080
- traefik.HTTP-external.port=8080
- traefik.tags=${TRAEFIK_TAG}

Thanks for pointing me to Grafana, will try it out to investigate further!


(Michel Conrado (Support Engineer)) #4

So, there’s the key. Single Docker env is very different from swarm behavior. Try to compare with https://docs.dgraph.io/deploy/#using-docker-swarm


(Ayeg) #5

I tried to following that example as much as possible, but ran into trouble with 3 alpha’s sharing volume so I tried limiting the cluster to run one alpha. Another thought I had is that perhaps I need to create more volumes /dgraph /dgraph1 /dgraph2 or consider using VMs so I can run those 3 alpha’s on my linux machine, would that make sense?


(Michel Conrado (Support Engineer)) #6

Since you’re using Docker, the type of machine you’re using does not matter that much. True, each Alpha instance needs its own volume. If all or some instances are using the same volume and path*. This will lead to many problems.

The volumes are defined in the docker-compose itself.


(Ayeg) #7

Ok :+1:. Thinking about it some more, I remember I didn’t see any issues in the dgraph container logs when I ran only one Alpha in the cluster, everything was working as it should except for that grcp error. So I’ll focus on that first


(Ayeg) #8

How to enable GRPC logging to the container log? In my env_file I’ve placed:

GRPC_VERBOSITY=DEBUG
GRPC_TRACE=api,channel,call_error,connectivity_state,http,server_channel

source: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md


(Michel Conrado (Support Engineer)) #9

I believe you can do a trace of gRPC via Jaeger. Dgraph exposes nothing beyond the trivial about gRPC.
But with Jaeger you have more complete info about what is going on with Dgraph.

https://docs.dgraph.io/deploy/#examining-traces-with-jaeger


(Ayeg) #10

Got Jeager up and running tried to integrate Jaeger python flask client, but unfortunately I’m getting all kinds of import errors or it just seems to freeze/deadlock.

I did manage to output the GRPC communication the container logs by setting flask log_level to debug. i.e.

if name == “main”:
log_level = logging.DEBUG

I’ll start collecting some data when an error happens to see if it can provide some more context about what is happening


(Ayeg) #11

Ok, so after trying many things I couldn’t find the underlying cause other than that the error happens after the connection goes idle after some time.

Doing a retry immediately after is possible so with the help of the grpc library it actually not so difficult to handle the exception (example http://avi.im/grpc-errors/#python)

import grpc

          if e.code():  # http://avi.im/grpc-errors/#python
                # e.details()    #            
                status_code = e.code()
                status = str(status_code)
                print("Exception Status Code", status_code)
                return status

then look for the error code in the response
e.g. if status == “StatusCode.UNAVAILABLE”:

and try again whenever necessary