Hi @micky_mics, in our dgraph high availability cluster, we have been experiencing this problem for several months. Basically passing data to a single (leader) alpha, does not make the follower alphas to sync with it. I think I have an idea why it may be happening as I understood.
I will talk about the reasons in the end that I understood. @MichelDiz , may be you or any dev can understand and possibly file a bug, unless I am missing something. But first the workaround that works for us.
Workaround:
For our project, we just need high availability that at least 2 out of 3 alphas are serving. We don’t really care whether there’s 1 group or multiple groups. And we also want all data get passed and loaded to all alphas after bulk load as soon as possible. So, what we tried is that we assigned separate group id for all 3 alpha to form 3 groups of alpha.
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
idx=$(($ordinal + 1))`
dgraph alpha --my=$(hostname -f):7080 --raft="idx=$idx; group=$idx" --zero dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-1.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-2.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080`
As you can see we added --raft=“idx=$idx; group=$idx”
Now with this, if you pass the bulkloaded data to any alpha and scale down your dgraph-alpha stateful set to 3 and rescale it up to 3 (basically restarting alpha pods after ensuring that dgraph-alpha-0’s /dgraph/p directory has your newly bulkloaded data), all alphas immediately gets the data loaded. Because now each of the alphas are leaders in their own group and directly communicate with zero for syncing the schema. However, I haven’t tried scaling alphas to 6 alpha servers to see whether the followers get the data too. I am attaching the kubernetes yml files for 1 zero 3 alpha and 3 zero 3 alpha case.dgraph-ha-z1-a3-working.yaml (11.0 KB)
dgraph-ha-z3-a3-working.yaml (11.1 KB)
If you follow the yml files, you will see that I am creating hostPath based PVC (/tmp/dgraph-alpha-data) where I send the bulkloaded “out/0/p” directory and then further copy/replace it with “/dgraph/p” for the leader alpha before restarting the alpha.
Another benefits with this 3 alpha group approach is that, now you don’t have to wait for 2 alpha followers to be available to serve your data in your stateful set. As long as a single leader alpha is available, you can serve the data. We verified that through ratel.
However, for large dataset, there may be potential predicate sharding/rebalancing related inefficiency, that I haven’t investigated yet. This solution works for our purpose for now.
Reason that I understood:
You provided the “how to bulkload” link above. When we try to create 3 zero and 3 alpha, you will see that all 3 alphas are under same group (executing this command inside any zero/alpha server ‘curl localhost:8080/state | jq .groups.members’ will show you the group state for members ). If you look “For small datasets” section (Initial import (Bulk Loader) - Howto), you will see that it mentions at Step 4 “After confirming that the snapshot has been taken” before going to Step 5. But if you follow the leader zero log, you will most likely see this message “Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0”
I looked into the code. This log is printed from raft.go (dgraph/dgraph/cmd/zero/raft.go at 595b72da4ba55539967b24114c33a71c07c093b8 · dgraph-io/dgraph · GitHub). If you see the condition, because the number of group and checkpoint is not equal, it decides to not take a snapshot. Because snapshot is not taken for a group, the follower alphas don’t sync with leader alpha.
This is my understanding. Now, why does the checkpoint map is empty and length comes as 0, that is where I sense that something is wrong, when we are trying to create 3 alpha in a single group. So, basically tracking down the functions, I believe, the membership update is not happening, after successful connection establishment among zeros and alphas. I tried changing alpha options such as “snapshot-after”, even then no snapshots were taken and I got same errors.
@MichelDiz , can you see why the checkpoints (based on min timestamps I believe) are not being created and why snapshots are not being taken. Or please clarify if I misunderstood and potential problems with my workarounds. Thank you.