Is there a command to explicitly force a reindex of ALL indices?

Mona_Fawzy · November 16, 2022, 5:40pm

Since the first version of Dgraph we have been experiencing our indices getting dropped and major instabilities around this.

We have worked around this in various ways, manually dropping indices, exporting and live loading, force mutations, but these are no longer tenable. Our database is getting larger and larger and we will need to find a different solution if we cannot resolve Dgraph randomly dropping indices.

Is there a command we can run to force a rebuild of indices?

Thanks for the help

MichelDiz · November 16, 2022, 6:13pm

No as far as I know. @rarvikar do you know if there is?
Maybe there is a way via Badger, not sure.

Back in the day, about 2 and a half years ago, we re-indexed all predicates with every new load of the same schema. This caused a lot of lag issues. Maybe giving this an option would be a good idea.

You can try bulk upsert to rename a predicate. This will create a new predicate for instance, then you drop that predicate(be careful there) and you can again rename it back via bulk upsert.

See https://dgraph.io/docs/mutations/upsert-block/#example-of-val-function

matthewmcneely · November 16, 2022, 6:20pm

@Mona_Fawzy Can you share more details? Dgraph version, deployment scenario, and how do you know the index was dropped, Ratel? Slow query performance?

This seems like a pretty outrageous bug and I’d like to try to replicate if possible.

Mona_Fawzy · November 16, 2022, 6:44pm

Hi Mathew, thanks so much for the reply.

Dgraph Version: v21.12.0

We have millions of nodes and growing. When this happens we first notice filters are returning no results or queries are missing objects. For example of type Experiments and we can start to see index issues when we do queries like:


{
	experiments(func: type(Experiment)) @filter(eq(hidden, true)){
		count(uid)
  }
}

We get a total of 0 when this index is dropped.
We can try to drop and readd the index via the schema, which sometimes does work, and returns all objects but a while later index disappears again.
In this case “hidden” is an index.
We are also seeing this issue with “dgraph.type” index and many others (unclear which are dropped) where a bunch of objects are missing from this index, which we cannot force a reindex.
When we have an id of an object, if we upsert the correct dgraph.type value and seems to then appear in the filter

We have been working around this issue by exporting and live loading and things like this, but we are no longer able to do this with the size of our data. The alphas need more than 300Gb of memory to live load

A few more notes: We have been restoring binary backups and wonder if there is a timestamp issue or something related causing this. Would just be very helpful to be able to force reindex via graphql or cli. @

matthewmcneely · November 16, 2022, 7:06pm

@Mona_Fawzy

Unfortunately you are seeing some of the issues that prompted the (new) Dgraph team to revert to v21.03. The @latest supported release is v22.0.1 which is based on v21.03. v21.12 is discontinued and unsupported. There were performance improvements introduced in v21.12 that were impressive, but also introduced some instability.

I recommend that you move to v22. This will require and export/import. Also, be aware some features introduced in v21.12 are not v22. I’m happy to help where I can but it seems like you’re already quite adept at export/import!

Victor_Shih · November 16, 2022, 7:07pm

One of the predicate indexes that are being dropped is the dgraph.type predicate and this solution doesn’t work for list types. Is there another workaround we can use?

MichelDiz · November 16, 2022, 7:12pm

This is not a solution. We need a way to reproduce and send it to the engineers to fix. It shouldn’t be happening anything like this.

We can try in our end. But it would be a huge help to have it from who have the issue experience.

This shows that there is a problem in the index system.

cc @Raphael

Mona_Fawzy · November 16, 2022, 7:18pm

I don’t think we will be able to live load we keep running out of memory on +500GB servers. Any other suggestions on how we can live load? Or do we think this issue is resolved (we can try it out)

Also v21.03 we had these indexing issues to and were prompted to upgrade also. We will try to live load now on v22.0.1 and let you know if we still run out of memory or have issues.

Additionally one of the alphas just got in a crash loop

Unable to find txn with start ts: 647

matthewmcneely · November 16, 2022, 7:28pm

I think you’ll need to bulk load

You might be interested in this thread: Critical bug in v21.12 permanently crashloops whole groups - #9 by gkocur

Mona_Fawzy · November 16, 2022, 8:42pm

Ok thanks, I forgot about bulk load since we normally restore. We can give it a try

Mona_Fawzy · November 16, 2022, 9:12pm

Ok so we exported and loading onto v22 but now are seeing this when trying to reference the schema

We can update the schema after the fact, just FYI

Mona_Fawzy · November 16, 2022, 10:24pm

We are getting this log trying to bulk load, only one zero. What are we missing? We have never bulk loaded before

matthewmcneely · November 17, 2022, 12:21am

Not knowing much about your environment, it’s hard to say. I’m guessing you’re following the bulk loader instructions in the doc. Can you ping 172.20.26.155? Seems like a straightforward connection issue.

Mona_Fawzy · November 17, 2022, 12:40am

Ok yup, will spend more time on this. Live loading has always worked for us, and with this version we are not seeing OOM, but this “connection reset by peer” after like 10 minutes of loading. We can tweak some live loading settings, if you have any ideas would be very helpful. We are loading into a cluster with only one alpha and zero, and the alpha is not crashing, neither is zero and memory is stable. I can send you our logs if that helps.

matthewmcneely · November 17, 2022, 12:58am

Right, but I think the amount of data you have is most likely too much for the live loader to handle. The bulk loader was designed for the large data sets that you described. https://dgraph.io/docs/deploy/fast-data-loading/overview/#bulk-loader

Mona_Fawzy · November 17, 2022, 1:12am

ok thanks will try to get this running then and ping tomorrow, thanks for the help today!

MichelDiz · November 17, 2022, 3:14am

Make sure you have a clean set up. And no Alpha running at the same time. Bulk just need Zero group. If you are using K8s, you have to find a way to delay the startup of Alphas.

This error could be two things. Zero is having a problem(config or some files from old instances in the path that wasn’t correctly cleaned). Or The cluster has Alpha running.

A possible problem could be in the dataset. If you use the --new_uids flag it might solve it. Pay attention. Even if it does, you need to make sure you started a cluster from scratch.

Share the command used to run the bulkloader.

Mona_Fawzy · November 17, 2022, 4:05am

I think its the volume had some existing data, will try in the morning on an a brand new volume. The command is

dgraph bulk -f s3:///.../export/dgraph.r19.u1116.1709/g01.rdf.g -s s3:///.../export/dgraph.r19.u1116.1709/g01.schema.gz --zero 172.20.26.155:5080

In other news, I tweaked a few live load params to reduce batch and concurrency (from default) and it finished in a couple hours:

Mona_Fawzy · November 17, 2022, 5:58pm

Ok so we are seeing on this latest version, the alphas are available and responding fine, then randomly they get busy and we start seeing connection errors:

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Do you know why this is happening?

MichelDiz · November 17, 2022, 6:35pm

There are several background tasks that Dgraph runs on the fly. It is “random” cuz some other tasks are prioritized more than others. So, a current task is paused in order to other with high priority to run.

Depending on the intensity of the Load, the configuration of your set up. How many resources you have available, how many Alphas are available. How many shards, the location of the tablets. Depending on many factors. This can cause resource drain. If you have a single Alpha where the vast majority of tablets are located and you are loading pointing only to that Alpha. It will choke.

Try to balance the load.

You can also remove compression with the flag --badger compression=none;
Change these values too --raft snapshot-after-entries=10000; snapshot-after-duration=30m; pending-proposals=256;

But don’t let your cluster with those configs changes for a long period. Dgraph needs log compaction, snapshots and so on.

Topic		Replies	Views
Manual index rebuild Dgraph	1	700	March 26, 2020
Drop all partially works for schema Users	8	585	November 4, 2018
Migrating (renaming predicates, etc) Users	4	1912	November 14, 2019
Best way to update a predicate on all nodes in dgraph Dgraph	7	741	February 7, 2023
Error while rebuilding predicates Dgraph	4	707	December 13, 2018

Is there a command to explicitly force a reindex of ALL indices?

Related topics