Can I use multiple path for data storage

@korjavin For Dgraph HA with zero, having 3 x Zero is recommended. If you use the dgraph helm chart (link), the default is 3 x Alpha + 3 x Zero.

TL;DR If a GKE node goes down (which is a GCE vm instance in a nodepool) that data will persist and should not be lost if deployed using the Dgraph helm chart.

Details: The background is that the Dgraph helm chart deploys a StatefulSet controller, which includes a Persistent Volume Claim to allocate external storage outside of the GKE node’s boot disk.

With this, there is still could be an issue in regards to availability. With the Raft Consensus (https://raft.github.io/), which is what Dgraph uses under the hood, you need to have (1) and odd number of Alpha members deployed, e.g. 3, 5, 7, etc., and (2) for availability have 1 more than half their number online, so for 3, 5, or 7 members, you’d need 2, 3, or 5 members online.

Should a GKE node go down, and this event would make the membership go below this number, then Dgraph will not available, at least until the StatefulSet controller kicks in and redeploys pods on other available GKE nodes. Should you not have enough resources in available GKE nodes, then Dgraph will not be available until adequate Kubernetes resources are made available.

Another thing about the number of nodes, if there is only 1 shard (where 1 shard equals a Raft group), then 6 Alpha members (member = Dgraph node deployed as K8S pod) would not be appropriate, as it is not an odd number. If you had 2 shards, then 6 is appropriate as you’d have 2 Raft groups consisting of 3 alpha members each.

For production environments, I would recommend having at least 3 GKE nodes, where you have 1 GKE node per Availability Zone. This will increase HA at the for resource availability and more resilient to potential outages, though rare they may be. This works well with Dgraph HA.

1 Like

@korjavin Just to make sure we are on the same page.

  • Kubernetes: replicas represent the number of pods in a StatefulSet for either Zero or Alpha.
  • Dgraph Zero: --replicas is number of Dgraph Alphas members to run per data shard group (each data shard group = Dgraph Alpha group = Raft Group, so odd number of replicas is required, e.g. 3, 5, 7, etc.). In the helm chart, this is configured with shardReplicaCount to avoid confusion.

From my observation, regarding shards in general, this is not really needed until you have a high number of predicates and large amount of data (~750 GB). Rebalancing where predicates will be divided between # of shards could prove beneficial in improving performance. For starters, a 3 x Dgraph Alpha members in a single shard should be sufficient.

@korjavin @MichelDiz This is all taken care of through the automation configured in the helm chart with defaults helm chart values. Specifically, the K8S StatefulSet controller with a headless service that allows the pods to be named (and referenced by internal DNS) uniquely, so that you have, name-alpha-0, name-alpha-1, and name-alpha-2.

For deployment, the defaults are replica=3, you just run:

helm repo add dgraph https://charts.dgraph.io
helm install "my-release" dgraph/dgraph

Obviously, you need to have GKE cluster available and local KUBECONFIG that points to it as well before running helm.

Once you get going with this, you can access the pods using using kubectl port-forward pod-name port or from a client within the same GKE cluster.

If you need to access to the Dgraph Alpha outside of the GKE cluster, then you can use an Ingress (L7 LB) or service type of LoadBalancer (L4 LB). If you need help with this part, let us know.

@korjavin Regarding SSD with GKE: This is allocated through a storage class definition that enables SSD to be used on GKE. Once we deploy it, we can refer to it when we deploy Dgraph using the Dgraph Helm Chart.

For example, if you used the example from the article, and now have the default standard along with the new one faster, such as this below:

$ kubectl get storageclass
NAME                 PROVISIONER            AGE
faster               kubernetes.io/gce-pd   14s
standard (default)   kubernetes.io/gce-pd   7d18h

As an example, you could deploy the Dgraph Helm chart using this below to get ssd:

helm install "my-release" dgraph/dgraph \
  --set alpha.persistence.storageClass=faster

@joaquin Thank you very much!

I think I understood what you wrote above.

There is a difference in what I am struggling with.

From my experiments with dgraph and other services a PVC
(as you described there:)

a Persistent Volume Claim to allocate external storage outside of the GKE node’s boot disk.

has poor performance.

I have a positive experience with changing this disks to this type: About Local SSD disks  |  Compute Engine Documentation  |  Google Cloud

If an application can handle losing of data with some kind of replication, I definitely prefer this disks.

As you can see on the link I provided, there are a few scenarios how data can be lost.

That’s the reason of my questions.

  1. I’d like to find a way to deploy dgraph in a way, that will handle recreation of one GKE node automatically without losing data.

  2. I’d like to attach different hostPaths to different alpha-nodes, because of restrictions of google cloud (they attach local-sdd disks as /mnt/disk1, /mnt/disk2 … etc)

(sorry for new acc, some restrictions in action)

I run many nodes on GKE and use the SSD storage class with large disks (1TB each alpha) to get the IOPS required. This runs well and did not seem to underperform the mega fast scratch localssds available in GCP by enough to give up the operational ease of PVCs.

Note that google disk speed is tied directly to size of disk, and you should be using an SSD storage class (does not come standard on GKE installs).

@iluminae I understand what you are talking about, for some services I use provisioned disks as well.

I think my experience is a bit different.
I like local-sdd disks in google cloud more than provisioned so far for services which depend on latency.

I’d like to find a solution for my case.

But, moreover, I would like to understand exactly all those replica/shard patterns to be sure.
I am still not sure that I control these placements (on which node I will have my shard).

@korjavin2 For these specific questions (with further discussion below)

  1. This is not handled at the app layer (dgraph), but rather at provider layers (K8S platform, and cloud provider). When provisioning a GKE cluster, you create a nodepool, which uses MIG (managed instant group) template. This MIG will recreate the lost GCE instances that are the GKE nodes. Kubernetes, will redeploy lost Dgraph pods through StatefulSet to other available GKE nodes. If the persistence is decoupled, the data will persist with the newly allocated pod. If the persistence is coupled to the GKE node, then the data is lost, and you have to rely on replication DB back to a new alpha pod. Most serious is the state data to track membership and timestamps. For explanation below.
  2. hostpaths will only work on a single node clusters (ref), so as an alternative, local-disk can be used instead, or external storage with SSD, which essentially mounts the disks as well.

Recommendation

For HA in general, I recommend not using the local disk because you tether the pods’ persistence to the node, so if the node goes down, you lose the data and state.

For use case with external ssd or local-storage, a new storage class needs to be deployed, which will be referenced when deploying the statefulset through the helm chart.

Disaster Scenario: Lost Dgraph Alpha member

Should an Dgraph Alpha member become completely lost, which is the case with lost persistence with local-storage, data can replicate back to the new Alpha node. This can take some time, depending on the size of the data.

If the persistence is preserved with external storage, then only a subset of the data will need to be replicated, from the few seconds it takes to redeploy the Dgraph Alpha pod and reattach the storage.

Disaster Scenario: Lost Dgraph Zero member

Dgraph Zero will contain the state of cluster, including things like timestamps and membership. This is not too dissimilar to how some clusters might use service discovery with services like Zookeeper or Consul. Given this, if a Zero member with its persistence is lost, then you will need to do a removal operation through /removeNode endpoint described here.

This process involves removing the Dgraph Zero member of a given idx number, and adding a new Dgraph Zero member with an incremented idx number. For example, with 3 Dgraph Zero members (zero-0 (idx=1), zero-1 (idx=2), zero-2 (idx=3)), if zero-1 (idx=2) where lost, then that one would need to be removed, and a new Dgraph zero added with an incremented index, e.g. zero-1 (idx=4). Afterward, you’ll have a cluster of: zero-0 (idx=1), zero-1 (idx=4), zero-2 (idx=3).

If the persistence is preserved with external storage, then this process would not be needed on a GKE node failure.

Storage Method: External SSD

The default standard storage class on GKE is not SSD, so this is why it will be slow.

If you would like to use faster SSD for external storage, then you’d need to deploy a new storage class using the same driver, but specify specifically SSD. If we called this deployed storage class faster, then when deploying Dgraph using the helm chart, we would specify the faster storage class, otherwise the standard slow default will be used.

Example Snippet:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: faster
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

Storage Method: Local Storage

For the local storage, this would be how one could go about doing this:

  1. Create a nodepool that uses SSD specifically when provisioning a GKE cluster (per link mentioned earlier or similarly with another tool like Terraform)
  2. Create a storageclass that uses local storage, e.g. snippet.
  3. Deploy dgraph specifying persistence using that local storage storage class.

Example Snippet:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Storage Method: Others

Mentioned earlier was hostpath, which could be used for single node clusters.

I saw this one, but I have not yet touched it, and I mention it here for completeness:

1 Like

Thank you very much @joaquin, especially for “disaster cases”.

I will think more about it.