Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.
What you wanted to do
Be able to disable mTLS on health check and metrics endpoints when mTLS is used on Alpha nodes.
What you actually did
My deployment is on Kubernetes. For livenessProbe when using mTLS I’m forced to use exec.command instead of the standard http/https probe type:
For prometheus metrics, /debug/prometheus_metrics, I had to first create a dgraph-tls secret with the needed Certs/Key and then tell the ServiceMonitor to auth with it:
I don’t think the health check and prometheus metrics endpoints need to be secured with mTLS and it adds a lot of overhead Kubernetes to make it all play nice. Zero is “open”; it would make sense to make Alpha the same. At the very least, make it a configurable option.
Also, Azure Load Balancers do not support health probe types other than TCP (See #76657 so you cannot front Alpha (w/mTLS) without getting spammed with http: TLS handshake error from <ip>:<port>: EOF messages in the logs.
For the health endpoint, Dgraph can’t have only health and metrics be served over HTTP while the rest of the endpoints are served over HTTPS with mTLS. That would require a separate port to be exposed for insecure traffic only. That would be incompatible with the existing ports setup.
For health checks, we can add to the Dgraph docs and K8s configs the way to set up health checks for liveness and readiness with exec command probes for client authentication via curl.
For metrics, Prometheus has tls_config options to collect metrics from an mTLS source.
@danielmai Yea, we have Prometheus metrics configured with mTLS. It doesn’t break any “standard pattern” and works just fine. I would also add that Prometheus Metrics could be used in a malicious way so should be configured with mTLS along with the primary endpoint.
As for the Health Endpoint, using exec/curl, it introduces the following issues:
Breaks integration with Cloud LBs
Exec generally considered a last resort for HTTP based apps as it puts a higher load on the container (albeit small) and doesn’t bubble up the HTTP error code.
Curl opens up other security issues as it allows an attacker to pull down arbitrary code/data if they gained access to the container
Ideally Dgraph would run as a non-root user, blocking anyone from being able to install packages (ie. curl), and only contain packages that are absolutely necessary to run Dgraph (Alpha).
I was using a tcp check because of this issue, but the TLS handshake error in the logs isn’t great.
So just switched to the curl option. I think if you tweak the timeout to 10 seconds and have it run every 20 instead of 10 seconds, it should be a reasonable overhead, and the risk of zombie processes pretty low.
I guess we’ll find out.
We have made /health available without tls because orchestrators like kubernetes doesn’t support mtls. It’s merged in master and it will be available with 20.11 release