Moving Zero HTTP endpoints to Alpha GraphQL admin

At present, Zero has two kinds of open ports:

  1. HTTP (default 6080)
  2. gPRC (default 5080)

Over HTTP, zero serves the following endpoints:

Endpoint Read/Write (R/W) Is security-critical? (Y/N) Currently available at Alpha /admin Info
1. /health R N Y (WIth actual health status) Just acts as a ping response provider. No information is emitted from here. Used by Kubernetes liveness probes.
2. /state R Y Y Exposes Membership information for all the Zeros and Alpha Groups. Also, tells which predicates are being served by which group.
3. /removeNode W Y N Used to remove a node from an Alpha group. Emits a success message or error.
4. /moveTablet W Y N Used to move a tablet from one group to another. Emits a success message or error
5. /assign W Y N Used to lease UIDs and timestamps. Responds with the start and end IDs of the lease.
6. /enterpriseLicense W Y N Used to apply enterprise license. Emits a success message or error.

Other than /health every other endpoint either changes something in the system or emits some information which may be critical from the security point of view. Only the system administrator is supposed to have access to these endpoints. So ideally, Zero’s HTTP port should not be exposed to the public domain. Still, from a security perspective, an open HTTP port can turn out to be a security risk. To overcome this problem, we are thinking of the following approaches:

  1. Completely move security-critical HTTP endpoints from Zero to Alpha’s GraphQL Admin endpoint (/admin). There they would be served in GraphQL over HTTP with proper access control mechanisms in place, as the /admin endpoint takes care of applying checks like IP Whitelisting, Poorman’s Auth and enterprise ACL. They would no longer be served by Zero over the HTTP port, instead would now be served over Zero’s internal gRPC port. Alpha’s GraphQL admin would now act as a proxy to Zero’s gRPC calls. Zero’s internal gRPC port is not intended to be exposed to the public domain and should have mTLS in place.

    • Pros
      • All the access control mechanisms will be in place including ACL.
      • All the admin operations will be at one place.
    • Cons
      • Moving them to Alpha would imply removing them from Zero’s HTTP server, which would be a breaking change if we are to make this change as part of a patch release for v20.07.
  2. Introduce one flag in Zero to enable/disable all security-critical HTTP endpoints at once.

    • Pros
      • Not a breaking change.
      • Gives users an option to disable these HTTP endpoints, in-case they can’t control port-level access in their environment.
    • Cons
      • If disabled, one would have to restart Zero to enable these HTTP endpoints to make any change to the configuration. That will cause downtime.
      • If one doesn’t want to restart, then we will have to support the functionality of these HTTP endpoints over Zero’s internal gRPC. At present, not all the functionality provided by these HTTP endpoints is available over Zero’s internal gRPC. gRPC would be inconvenient to use from an operations perspective.
      • Some users also want ACL checks to be enabled for these endpoints, which doesn’t seem possible in this approach.
  3. Allow authenticating access to HTTP port via Mutual TLS.

    • Pros
      • Not a breaking change, an enhancement instead.
      • Provides a trusted layer of authentication.
    • Cons
      • Some users also want ACL checks to be enabled for these endpoints, which doesn’t seem possible in this approach.
      • Users would want to add/remove clients to be trusted for mTLS, we would need to add and extra endpoint for that. That endpoint would have to have a single trusted root-like client. Becomes a single-point-of-failure, in cases like client private-key getting leaked. Also, Zero would have to be taken down if the root client’s public verification key is to be changed.
  4. A combination of approaches 1 and 2. Not remove the security-critical HTTP endpoints from Zero, but do introduce a flag in Zero to enable/disable them. Also, have them served by Alpha’s GraphQL admin.

    • Pros
      • Not a breaking change.
      • Users get a choice to expose these endpoints via Alpha with enterprise ACL or via Zero, or both.
    • Cons
      • Port hardening with mTLS is still required for the Zero gRPC port.

Looking for any suggestions and ideas regarding this.

cc: @dmai @vvbalaji @pawan @mrjn

UPDATE (5 Oct 2020)

We are proceeding with the 4th approach at present. We still need to decide whether to fully close Zero’s HTTP port or just close the security-critical HTTP endpoints and not the HTTP port itself.
PR: feat(GraphQL): Zero HTTP endpoints are now available at GraphQL admin (GRAPHQL-1118) by abhimanyusinghgaur · Pull Request #6649 · dgraph-io/dgraph · GitHub

The following points should be noted if we are to close the HTTP port:

  • Zero /health would no longer work if the port is disabled by the flag. Meaning, if Alphas are down, then there’s no way to know Zero health. Kubernetes liveness probes rely on /health for per-instance health-checking. We will need to find another way to do that.
  • Any debugging or profiling data collection on Zero requires HTTP port to be open, which won’t work if the flag disables them.

On the other hand, if we don’t close the port, but choose to close the security-critical HTTP endpoints, the following should be noted:

UPDATE (12 Oct 2020)

We are not going to pursue this at present.

https://github.com/dgraph-io/dgraph/pull/6867

UPDATE (31 March 2021)

This change is required now for the Dgraph Cloud architecture. So, we have gone ahead with the 4th approach and merged this change to master. This would be available in the v21.03 release.

https://github.com/dgraph-io/dgraph/pull/6649

which scenarios? can you elaborate?

If we are doing this for 20.11, breaking changes are allowed.

Right. But @aman-bansal is working on hardening there internal GRPC ports with mutual TLS. So, any external attempts to get in will be futile.

what would be the default? And if we allow HTTP-only then we should put a disclaimer / release note somewhere that this is not a security best practice. Alternatively, we can also still keep HTTP but require mTLS i.e. https. But that is more work and throwaway work if we are eventually going to remove all HTTP zero endpoints

This can be avoided by just bouncing the GRPC server without restarting the zero, no?

This will be throwaway work if the end goal is to eventually just remove the Zero’s http port.
cc @aman-bansal @ahsan

I prefer doing Option 1 and 2 with the caveat/disclaimer that Zero’s HTTP end point will eventually be removed and that enabling HTTP only Zero endpoint is a security risk.

1 Like

I vote for this. That would make all the operations safer and easier. The only problem is that there are users who rely on legacy things. They might have done some context pipeline of work that takes those endpoints into consideration.

I have a plan to create a small script in Deno (JS) that gives the user a legacy endpoint tho.

Also, I gonna change this Add a Bulk Move Tablet and/or A deterministic scheme for Tablets (Also Geo-Sharding Support) to GraphQL. I think that would be way better to use the GraphQL layer for what I proposed there. I had mentioned the GraphQL Admin there, but with no examples.

@abhimanyusinghgaur BTW, I think this RFC should include Subscriptions. Cuz if we let it as just Queries, the user or the app (e.g. Ratel) will need to do a cron-job to pull that information. As naturally via HTTP it is updated on the fly (Ratel uses /health and other endpoints to gather information about the cluster). So subscriptions can do the trick.

1 Like

I am not entirely aware of the exact scenario, but a customer had submitted a security assessment for their deployment, which requested this movement. @dmai would know the exact scenario.

The customer requested it to be a part of a patch release for 20.07. So, not removing the Zero endpoints for now.

Default is enabled. If one wants to disable, they will have to do it manually, this will make it a non-breaking change.

Previously we didn’t had any gRPC methods for all the HTTP endpoints in Zero. If we make the change of having the Zero HTTP endpoints in Alpha GraphQL admin, then only we will have the gRPC calls available.

I Agree.

Going with this for now, as also suggested by @vvbalaji.

Nice Idea! All of the endpoints in this RFC except /state are mutations, so only /state will have a subscription, out of the endpoints in this RFC. But, I agree that the queries in /admin should have subscriptions as well. Subscriptions should work because of the way we have implemented them in GraphQL. Will have a separate RFC for having subscriptions in GraphQL /admin.

What about /health?

/health in Zero is just a ping response, which is different than /health in Alpha and GraphQL Admin.

1 Like
  1. If we remove health from the zero endpoint, then what would be our preferred recommendation for checking Zero health?
  2. If we move status providing endpoints to alphas (health, and state) then it may happen that we have >1 versions of truth. This may happen when some alphas can see Zero while some others may not. From a monitoring point of view, an external client may be polling alphas in a round robin manner, and may see different parts of the whole truth. How do we handle this situation?

cc @dmai @joaquin

In the scope of changing /health or /state without having an easy way to elicit health and readiness would make zeros difficult to operationalize in Kubernetes or other orchestrators like Swarm, Nomad, Marathon, etc. This affects the reliability and availability of the zero service.

Adding further clarity.

  • health = single service is available
  • readiness (state) = the zero group is available to service requests, that is 2/3 nodes, 3/5 nodes, etc.

Dgraph isn’t a single monolithic database service. The are several nodes in a raft group acting as a single service, and a scheduling/orchestration platform needs to communicate with each other properly. Their requirements can be disruptive to the operation of a healthy cluster.

Thus please have a /state (or new alternative, such as /ready) that shows if zero group is ready to service.

I think that /health before some changes, just printed “OK” and 200 HTTP code. Now it has a lot of information. We can just let it as before and move the other informations to GraphQL Admin.

For now, let’s table this RFC. The requirement at the moment is to have TLS, and not necessarily that these endpoints are checked via ACL. Zero doesn’t know about ACL, so for now, let’s keep it that way.