Allow metrics endpoint to run insecure when using mTLS on Alpha

Moved from GitHub dgraph/4910

Posted by fl-max:

Experience Report

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

Be able to disable mTLS on health check and metrics endpoints when mTLS is used on Alpha nodes.

What you actually did

My deployment is on Kubernetes. For livenessProbe when using mTLS I’m forced to use exec.command instead of the standard http/https probe type:

                - /bin/sh
                - '-c'
                - "curl https://$MY_POD_NAME:8080/health?live=1 --http1.1 --cacert {{ .Values.alpha.config.tls_dir }}/ca.crt --cert {{ .Values.alpha.config.tls_dir }}/client.dgraphadmin.crt --key {{ .Values.alpha.config.tls_dir }}/client.dgraphadmin.key"

For prometheus metrics, /debug/prometheus_metrics, I had to first create a dgraph-tls secret with the needed Certs/Key and then tell the ServiceMonitor to auth with it:

      {{- if .Values.alpha.config.tls_dir }}
      scheme: https
        insecureSkipVerify: true
            name: dgraph-tls
            key: "ca.crt"
            name: dgraph-tls
            key: "client.dgraphadmin.crt"
          name: dgraph-tls
          key: "client.dgraphadmin.key"
      {{- end }}

Why that wasn’t great, with examples

I don’t think the health check and prometheus metrics endpoints need to be secured with mTLS and it adds a lot of overhead Kubernetes to make it all play nice. Zero is “open”; it would make sense to make Alpha the same. At the very least, make it a configurable option.

Any external references to support your case

manishrjain commented :

Is that a common practice? Do you have examples from other DBs to back up the proposal?

fl-max commented :

darkn3rd commented :

For the livenessProbe, could you just add scheme field set to HTTPS, so that the kubelet sends HTTPS request skipping the certificate verification?

fl-max commented :

Also, Azure Load Balancers do not support health probe types other than TCP (See #76657 so you cannot front Alpha (w/mTLS) without getting spammed with http: TLS handshake error from <ip>:<port>: EOF messages in the logs.

danielmai commented :

For the health endpoint, Dgraph can’t have only health and metrics be served over HTTP while the rest of the endpoints are served over HTTPS with mTLS. That would require a separate port to be exposed for insecure traffic only. That would be incompatible with the existing ports setup.

For health checks, we can add to the Dgraph docs and K8s configs the way to set up health checks for liveness and readiness with exec command probes for client authentication via curl.

For metrics, Prometheus has tls_config options to collect metrics from an mTLS source.

fl-max commented :

@danielmai Yea, we have Prometheus metrics configured with mTLS. It doesn’t break any “standard pattern” and works just fine. I would also add that Prometheus Metrics could be used in a malicious way so should be configured with mTLS along with the primary endpoint.

As for the Health Endpoint, using exec/curl, it introduces the following issues:

  1. Breaks integration with Cloud LBs
  2. Exec generally considered a last resort for HTTP based apps as it puts a higher load on the container (albeit small) and doesn’t bubble up the HTTP error code.
  3. Exec has a propensity to create Zombie processes (See 10 Ways to Shoot Yourself in the Foot with Kubernetes [23:28] )
  4. Creates a hard dependency on the curl package
  5. Curl opens up other security issues as it allows an attacker to pull down arbitrary code/data if they gained access to the container

Ideally Dgraph would run as a non-root user, blocking anyone from being able to install packages (ie. curl), and only contain packages that are absolutely necessary to run Dgraph (Alpha).

I was using a tcp check because of this issue, but the TLS handshake error in the logs isn’t great.

So just switched to the curl option. I think if you tweak the timeout to 10 seconds and have it run every 20 instead of 10 seconds, it should be a reasonable overhead, and the risk of zombie processes pretty low.
I guess we’ll find out.

We have made /health available without tls because orchestrators like kubernetes doesn’t support mtls. It’s merged in master and it will be available with 20.11 release