Dgraph bulk loader with smaller dataset sync not working from leader to the followers

What I want to do

I am running dgraph with 3 zeros and 3 alphas [one alpha group] and I did the bulk load to load goldendata.rdf file and then I manually copy the p directory to one of the alpha and start the alpha process and wait for few minutes and then I start the remaining 2 alphas in that group.

Looks like followers p directory is not getting sync with the alpha leader. I am following
https://dgraph.io/docs/howto/importdata/bulk-loader/#how-to-properly-bulk-load

Neither I saw the snapshot created message in the alpha leader logs nor any snapshot related logs in the follower alpha logs

@MichelDiz can you please provide some inputs

Please list the high level idea of what you want to do
Trying to use bulk load with smaller dataset

What I did

Please list the things you have tried.

Dgraph metadata

dgraph version

PASTE THE RESULTS OF dgraph version HERE.

please paste the flags used to start the cluster.

Three zeros I have started using:

dgraph zero --my=zero1:5080 --replicas 3 --raft=“idx=1”
dgraph zero -o 1 --my=zero2:5081 --replicas 3 --peer zero1:5080 --raft=“idx=2”
dgraph zero -o 2 --my=zero3:5082 --replicas 3 --peer zero1:5080 --raft=“idx=3”

After I executed the bulkload using 1miliiondataset taken from benchmarks/data at master · dgraph-io/benchmarks · GitHub

dgraph bulk -f /bulk/1million.rdf.gz -s /bulk/1million.schema --map_shards=1 --reduce_shards=1 --http localhost:8000 --zero=localhost:5080

Then I copy the generated p directory to one of the alpha node in the alpha replicaset and start the 1st alpha using
dgraph alpha --my=alpha1:7080 --zero=zero1:5080,zero2:5081,zero3:5082 --security whitelist=0.0.0.0/0

I didn’t see the "Creating snapshot message in the dgraph alpha log"
But in ratel I can query the dataset

After 5 mins I started the remaining two alphas using

dgraph alpha --my=alpha2:7081 --zero=zero1:5080,zero2:5081,zero3:5082 -o 1 --security whitelist=0.0.0.0/0
dgraph alpha --my=alpha3:7082 --zero=zero1:5080,zero2:5081,zero3:5082 -o 2 --security whitelist=0.0.0.0/0

And as per documentation for smaller datasets the p directory should gets sync from the leader to the followers alpha node but it is not happening.

when I try to query the dataset using second alpha no data is present

As per documentation, Initial import (Bulk Loader) - Howto
for smaller dataset the sync should work.

I am running docker compose environment running 3 alphas and 3 and zeros in a single host using dgraph/docker-compose-ha.yml at main · dgraph-io/dgraph · GitHub

Hi @MichelDiz, can you please help into this and due to this we’re blocked

Hi @micky_mics, in our dgraph high availability cluster, we have been experiencing this problem for several months. Basically passing data to a single (leader) alpha, does not make the follower alphas to sync with it. I think I have an idea why it may be happening as I understood.

I will talk about the reasons in the end that I understood. @MichelDiz , may be you or any dev can understand and possibly file a bug, unless I am missing something. But first the workaround that works for us.

Workaround:

For our project, we just need high availability that at least 2 out of 3 alphas are serving. We don’t really care whether there’s 1 group or multiple groups. And we also want all data get passed and loaded to all alphas after bulk load as soon as possible. So, what we tried is that we assigned separate group id for all 3 alpha to form 3 groups of alpha.

        [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
        ordinal=${BASH_REMATCH[1]}
        idx=$(($ordinal + 1))`
        dgraph alpha --my=$(hostname -f):7080 --raft="idx=$idx; group=$idx" --zero dgraph-zero-0.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-1.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-zero-2.dgraph-zero.${POD_NAMESPACE}.svc.cluster.local:5080`

As you can see we added --raft=“idx=$idx; group=$idx”

Now with this, if you pass the bulkloaded data to any alpha and scale down your dgraph-alpha stateful set to 3 and rescale it up to 3 (basically restarting alpha pods after ensuring that dgraph-alpha-0’s /dgraph/p directory has your newly bulkloaded data), all alphas immediately gets the data loaded. Because now each of the alphas are leaders in their own group and directly communicate with zero for syncing the schema. However, I haven’t tried scaling alphas to 6 alpha servers to see whether the followers get the data too. I am attaching the kubernetes yml files for 1 zero 3 alpha and 3 zero 3 alpha case.dgraph-ha-z1-a3-working.yaml (11.0 KB)
dgraph-ha-z3-a3-working.yaml (11.1 KB)
If you follow the yml files, you will see that I am creating hostPath based PVC (/tmp/dgraph-alpha-data) where I send the bulkloaded “out/0/p” directory and then further copy/replace it with “/dgraph/p” for the leader alpha before restarting the alpha.

Another benefits with this 3 alpha group approach is that, now you don’t have to wait for 2 alpha followers to be available to serve your data in your stateful set. As long as a single leader alpha is available, you can serve the data. We verified that through ratel.

However, for large dataset, there may be potential predicate sharding/rebalancing related inefficiency, that I haven’t investigated yet. This solution works for our purpose for now.

Reason that I understood:

You provided the “how to bulkload” link above. When we try to create 3 zero and 3 alpha, you will see that all 3 alphas are under same group (executing this command inside any zero/alpha server ‘curl localhost:8080/state | jq .groups.members’ will show you the group state for members ). If you look “For small datasets” section (Initial import (Bulk Loader) - Howto), you will see that it mentions at Step 4 “After confirming that the snapshot has been taken” before going to Step 5. But if you follow the leader zero log, you will most likely see this message “Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0”
I looked into the code. This log is printed from raft.go (dgraph/dgraph/cmd/zero/raft.go at 595b72da4ba55539967b24114c33a71c07c093b8 · dgraph-io/dgraph · GitHub). If you see the condition, because the number of group and checkpoint is not equal, it decides to not take a snapshot. Because snapshot is not taken for a group, the follower alphas don’t sync with leader alpha.
This is my understanding. Now, why does the checkpoint map is empty and length comes as 0, that is where I sense that something is wrong, when we are trying to create 3 alpha in a single group. So, basically tracking down the functions, I believe, the membership update is not happening, after successful connection establishment among zeros and alphas. I tried changing alpha options such as “snapshot-after”, even then no snapshots were taken and I got same errors.

@MichelDiz , can you see why the checkpoints (based on min timestamps I believe) are not being created and why snapshots are not being taken. Or please clarify if I misunderstood and potential problems with my workarounds. Thank you.

@brishtiteveja, I think you misunderstood the concept of dgraph and how it works. Please go through and read their documentation about dgraph replica groups and sharding.

If you don’t maintain the replicaset group in your alpha group then you’re actually deviating the main concept of how dgraph works.

I can suggest you not to try this in production. LOL

Wow! Interesting word choice and many assumptions! I have read their documentation many many times… I am not claiming that I understood every single details.

As I said, it works for the dataset we have, we do not require sharding at this moment. And I don’t think you have read what I wrote. First of all, you can easily just send the data to all alphas (leaders and followers) after bulkload for a single group as mentioned for the large dataset. The problem we are tackling here is “why do the follower alphas not sync? And how can you have 3 alphas instantly having the bulkloaded data?”. I also specifically said “there may be potential predicate sharding/rebalancing related inefficiency, that I haven’t investigated yet”.

And the workaround I mentioned is to “have 3 alphas” where they are all synced fast. I can easily increase the alpha numbers to 9 maintaining 3 groups thus having 3 alphas per group even if I needed sharding in that case.