Is there a command to explicitly force a reindex of ALL indices?

We have been running on 3 alphas and 3 zeros distributed without issues in the past, we have a kubernetes setup and really powerful nodes and havent had an issue like this on v21.12.0.

Could there be something wrong with the other two alphas? The logs seem ok and are not reporting much. I tried to scale down and scale back up to see if it resolves but the alpha chokes like 30 mins later (and is fine for the first 30).

Any ideas what to look for? Is there some pending live load process maybe?

I think I figured out how to get live loading working on the older cluster that is on v21.12 so will let you know if I see the same issues (also on a 3 alpha, 3 zero setup seemingly well distributed)

Thank you so much for all of the help and time!

In high load 24/7?

I can’t tell without some time looking into it.

That should not be the case. Liveload is just a small program that uses Dgo. It would be the case if the operations, protobufs and RDF changes. Some incompatible function or API.

Is that replica? or 3 groups?

  1. Not in high load 24/7, load seems to be internally manifesting right now and the reason it is choking and not because of external load. Though the individual requests are large, there is not a ton of them. Could be the large size of the requested data choking the alphas?
  2. I only mention live loading on v21.12 because we were running out of memory and why we upgraded, but seems I found parameter tweaks that are helping this
  3. I am not totally clear on the difference between replicas and groups, but we only have one group and looks like 3 replicas?

how big? and what are the stats of your machines? Each Alpha has its own env? or they share the same host?

Did you OOM locally with the Liveload or the cluster gets OOM?

Yes. If you have one group. You have Replica=3 config.

See. Everything that you do in one of the Alphas are replicated immediately to the other Alpha’s cuz that’s how replica works. If they are in the same host. This means that you will have the triple of usage in resources. If you separate them into their own machines. This will reduce choking.

Thanks for the detailed description.

They are all on separate hosts: We upgraded to m5.16xlarge for loading, which are pretty large
OOM locally with live load process but see all three alphas hit 256gb, on default settings of like 10 concurrency, and 1000 quads which i dropped down to 2 and 100 and no OOM.

So, this solves the local live load OOM right? What happen to the Alphas itself? In the previous load that you got this OOM and log compaction.

I think all the live loads we ran that disconnected from the job are still in queue and being applied. Can we tell if a live load is running?

Heres the default values txn-abort-after=5m; max-retries=-1;max-pending-queries=10000" of all clusters. Based on this, any transaction that fails will have at least one retry(liveloader that will retry). If that retry fails again it will abort if it stay in the pending transaction after 5 minutes.

Liveloader is a program. If it breaks/panics it stops completely. But the transactions that went through will pass in that steps I mentioned above.

Well we rebuilt both clusters, current and new version from scratch, with the tweaked live load settings we were able to get our indices back and the clusters are stable and responsive.

Its not ok that we had to rebuild our entire 200GB db, but seems the only way to get dgraph actively normally again. Hoping we can find ways to avoid this in the future.

Thanks for all of the help and support