Is there a command to explicitly force a reindex of ALL indices?

Mona_Fawzy · November 17, 2022, 6:52pm

We have been running on 3 alphas and 3 zeros distributed without issues in the past, we have a kubernetes setup and really powerful nodes and havent had an issue like this on v21.12.0.

Could there be something wrong with the other two alphas? The logs seem ok and are not reporting much. I tried to scale down and scale back up to see if it resolves but the alpha chokes like 30 mins later (and is fine for the first 30).

Any ideas what to look for? Is there some pending live load process maybe?

I think I figured out how to get live loading working on the older cluster that is on v21.12 so will let you know if I see the same issues (also on a 3 alpha, 3 zero setup seemingly well distributed)

Thank you so much for all of the help and time!

MichelDiz · November 17, 2022, 6:58pm

In high load 24/7?

I can’t tell without some time looking into it.

That should not be the case. Liveload is just a small program that uses Dgo. It would be the case if the operations, protobufs and RDF changes. Some incompatible function or API.

Is that replica? or 3 groups?

Mona_Fawzy · November 17, 2022, 7:06pm

Not in high load 24/7, load seems to be internally manifesting right now and the reason it is choking and not because of external load. Though the individual requests are large, there is not a ton of them. Could be the large size of the requested data choking the alphas?
I only mention live loading on v21.12 because we were running out of memory and why we upgraded, but seems I found parameter tweaks that are helping this
I am not totally clear on the difference between replicas and groups, but we only have one group and looks like 3 replicas?

MichelDiz · November 17, 2022, 8:53pm

how big? and what are the stats of your machines? Each Alpha has its own env? or they share the same host?

Did you OOM locally with the Liveload or the cluster gets OOM?

Yes. If you have one group. You have Replica=3 config.

See. Everything that you do in one of the Alphas are replicated immediately to the other Alpha’s cuz that’s how replica works. If they are in the same host. This means that you will have the triple of usage in resources. If you separate them into their own machines. This will reduce choking.

Mona_Fawzy · November 18, 2022, 12:09am

Thanks for the detailed description.

They are all on separate hosts: We upgraded to m5.16xlarge for loading, which are pretty large
OOM locally with live load process but see all three alphas hit 256gb, on default settings of like 10 concurrency, and 1000 quads which i dropped down to 2 and 100 and no OOM.

MichelDiz · November 18, 2022, 12:17am

So, this solves the local live load OOM right? What happen to the Alphas itself? In the previous load that you got this OOM and log compaction.

Mona_Fawzy · November 18, 2022, 12:18am

I think all the live loads we ran that disconnected from the job are still in queue and being applied. Can we tell if a live load is running?

MichelDiz · November 18, 2022, 12:27am

Heres the default values txn-abort-after=5m; max-retries=-1;max-pending-queries=10000" of all clusters. Based on this, any transaction that fails will have at least one retry(liveloader that will retry). If that retry fails again it will abort if it stay in the pending transaction after 5 minutes.

Liveloader is a program. If it breaks/panics it stops completely. But the transactions that went through will pass in that steps I mentioned above.

Mona_Fawzy · November 18, 2022, 1:49pm

Well we rebuilt both clusters, current and new version from scratch, with the tweaked live load settings we were able to get our indices back and the clusters are stable and responsive.

Its not ok that we had to rebuild our entire 200GB db, but seems the only way to get dgraph actively normally again. Hoping we can find ways to avoid this in the future.

Thanks for all of the help and support

Topic		Replies	Views
Dgraph crashed during live loading using dgraph live and unable to start the db Dgraph	12	839	February 24, 2019
How to solve mutation conflict Dgraph status:accepted , ticket:created	28	1768	February 14, 2023
Bulk Loader with cluster - 'Attribute xyz is not indexed' on followers Dgraph kind:bug , area:bulk-loader	3	632	September 19, 2021
Dgraph live load crashing after few min Dgraph	11	761	March 23, 2021
Fast Data Loading - Deploy Documentation	1	766	October 2, 2020

Is there a command to explicitly force a reindex of ALL indices?

Related topics