Context Exceeded recurring again, Unusable DB... (Shared Instance)

Problem recurring again on the ap-south-1.
My apps are no longer usable and customers complaining since this morning.

This is the 4th or 5 time that this issue happens, regardless of which region I choose.
Please have an action plan to solve this issue once in for all.

We are paying for this service, we expect a minimum level of availability/reliability for Dgraph Cloud.
Please take our requests into consideration.

Thanks

2 Likes

The team is investigating the case.

Cheers.

Issue resolved around 2pm PST.

any other information on how these specific issues are being resolved and how to prevent them from happening again later?

2 Likes

I don’t think so, but I’ll check if it’s possible.

Thanks for resolving the issue.

However if you say there are no plans to prevent in the future then we don’t have any other choice than to move out.

It seems that Dedicated instances are not impacted so this is what I will be using temporarily for my existing projects.

Good luck

Can you not even explain what steps were taken to fix the problem? Was it adding more servers to the cluster, offloading namespaces to other non-busy clusters, stopping some kind of concurrent task, etc…

3 Likes

Plans they exist, there is no way to maintain a service without worrying about it - an issue for you is an issue for us all. There is a team today dedicated to dig and solving these problems.

No, cuz this envolve internal context. So due to security, I don’t think it’s a good idea to share these details - Even more so when it comes to shared instances. But I’ll ask.

Given that this problem seems to be occurring regularly, would it make sense to release a statement about it? Not compromising information, just acknowledgement that it is an ongoing issue, and a confirmation that steps are being taken to address the ongoing issue? Maybe even addressing whether or not the source of the problem has been identified? Has it been identified?

Thanks.

EDIT: In fact, I’m still facing the issue on my shared instance right now - us-west-2

2 Likes

Hi Jackl and thanks for the update on us-west-2. We dug into the “context deadline exceeded” messages in the logs and checked our monitoring, but don’t see anything unusual there. Can you post the query and tell us if it is a repeated or transient behavior there? If it is serious, please open a support ticket as well.

Note that “context deadline exceeded” is a very large umbrella error that is thrown by golang when an HTTP call is initiated and the response does not come back within a timeout period. So almost any slow response due to an overloaded system, software bug, or network issue can cause it.

So in this case the logs did not look like the ap-south-1 logs and we are pretty sure us-west-1 is not seeing the ap-south-1 error. I’ll post another note in a minute on what we did to resolve the (more serious) error in ap-south-1 and what we know now.

2 Likes

Thanks, I appreciate the response. There are tons of queries which result in this error periodically. I feel like one commonality may be queries/mutations involving edges which are either Interfaces or Unions.

Is there any guidance on what sort of queries are more likely to result in the error? For example, is it as simple as “the more complicated the query, the more likely it is to result in context deadline exceeded”?

1 Like

This happens on locally deployed databases as well of the open source version. (Just search this forum)

If this cannot be explained how to avoid/handle this problem then no one in their right mind would still consider Dgraph to be production ready.

Prove me wrong.

2 Likes

This what? context deadline exceeded? Well as Damon mentioned, it is a very broad thing. Everything related to this needs detailed context to determine the source of the problem. Particularly speaking I only see this log when I heavily abuse the database in my machine(I’m not saying that’s the case, but it usually is).

That’s normal. Every application reaches an execution limit based on the resources you have. Big(o) notation*. But this error can also be things like “deadlock”. Only this log doesn’t tell us that. You need to delve into the error and debug.

Okay. How exactly?

“Extraordinary claims require extraordinary evidence.” Carl Sagan.

Google it “context deadline exceeded” and you will see that hundreds of thousands of people are on the same page.

What I can say is that some issues on the Cloud cannot be shared as it is a confidential matter. But when it’s something related to a common problem, I think it’s okay to share.

I hope @Damon can help with that.

Nabil and Anthony - sorry we did not have more info sooner on the timeouts (context deadline exceeded) in ap-south-1. As I just wrote above, that’s a very broad message that bubbles up whenever a golang HTTP call times out, so it can mean many things - often just a slow query or overloaded system.

But the issue in ap-south-1 was worse and also unique. The context deadline exceeded errors correlated to a drop off in mutations (as seen by monitoring the increase rate of the max timestamp for the cluster). We also saw “num pending txns: 1” many times in the logs. We are pretty sure a problematic mutation transaction was submitted at 9:15 and 9:30 UTC, respectively, on two successive days causing two partial outages where only (read) queries were still working.

To fix this and get the cluster healthy, we shut down the alpha, cleared the write-ahead log (w directory) of queued updates, and brought it back up to work around the issue both times. Note there was no data loss because these updates were queued in a submitted state and never committed at all. Queries were still processing during this partial outage. We have saved the WAL that we suspect has a root cause mutation in it, and are working to clarify that root cause. Because this has only happened twice in ap-south-1 (nowhere else and no other time) it will probably be worked as a normal priority bug. While we don’t have a root cause yet, I hope this helps clarify what we did, the workaround, and what we know so far.

At this point, we have an alert set for a halt in the max timestamp (specifically we watch for the ApplyCh queue size rising too high) and we now know how to work around the issue by clearing the WAL. I hope this workaround helps anyone who encounters it, and we are also working to find the root cause - once we find that we will share the nature of the triggering query or other cause. But again, it seems very rare, so I would not say this is a critical process for everyone to document or monitor for based on what we know so far.

5 Likes

Here is a quote from the landing page:

“Scale seamlessly Horizontally scale with ease to maintain high-throughput and low-latency even while serving terabytes of data. Dgraph is the next generation graph system designed for Google scale, built by ex-Googlerses”

So scaling should not face these kind of limitations/problems.

But the problem occurs because some suspected mutation somewhere was bad. So this leads me that there is a vulnerability somewhere (root unknown at this time) that can break the whole cluster. So if scaled horizontally with namespacing, then there is still this risk. And some of the general workarounds commonly is just to keep trying and eventually it should work.

This does not sound like a production ready system.

Prove to me Dgraph IS PRODUCTION READY

IMHO Dgraph is not production ready for “Google scale” or even production use scale.

1 Like

You can scale, the problem is that you can’t foresee when everyone will need to scale up at the same time. Scaling up/down is tricky. But you’re right, the K8s infra should scale under heavy usage.

I don’t think so, it sounded like someone speeding up usage during a rush period.

This is not about Dgraph. Is about Dgraph along with K8s. K8s is wonderful, but with Dgraph in my opinion it is somewhat limiting. I don’t know, just my opinion. I just have a feeling, not a proof. I always felt like I was getting the best out of Dgraph, outside of containers.

This is on the petabyte scale. Dgraph’s design supports this just fine if you have $$$ to have the infra and the skill. The problem is orchestration + IO access in my opinion. I’m a Bare-metal purist, I think containers only serve to facilitate one of the steps in the process. But if you want to get the most out of it, you should go bare-metal. But that is just me! I’m just a guy! And cloud is complicated to carry on your arm.

The burden of proof is usually on the accuser. I have no way of proving it, as my proofs would be biased. If you want hard answers, you need to present hard evidence on the scale you need. Right?

Cheers.

1 Like

I don’t know any other databases that put the burden of production readiness proofs on the user.


I’m guessing that all of the context problems are not arising from petabytes of data systems. Just an educated guess though on that part.

2 Likes

I think what @amaster507 is trying to say here is that this seems to be a continued problem. Perhaps the root cause is different every time, but the problem persists regardless.

We want to get Dgraph to succeed, but having the same issue repeat itself is not giving users confidence.

I think @Damon being open about these problems gives us more confidence. You guys need to remember we have been too much in the dark for far too long.

Keep us updated on this progress, on every iteration of this problem, and hopefully you guys can get it knocked out.

I think we could care less about features until we know the database will not have downtime, and our data is secure. Let’s work on getting confidence in both of these things number 1.

You guys are working slim, and we appreciate all the hard work going on.

Keep us up to date, as this problem seems to be the highest priority.

J

5 Likes

@MichelDiz this is the same answer as on my last post about the server drop outs. As much as I appreciate your security concern, the least you can (and should) do is to release a statement what is causing the issue and how you guys tackle the problem.

To be clear, you don’t need to give away source code insights but you can tell where the problem is and what is being done about it! I mean, AWS puts out a track list every time a service was out, explaining what has happened.

Not willing to do so does not put a lot of trust into the product! Again, I don’t want to disregard your work but you need to understand that some of us (customers) have built products which rely on your service. We also have obligations towards our customers, so if you want to have Dgraph continued as a product with lots of users paying for it, you should try a bit harder involve active users a bit more.

@Damon that you for clarifying! This is how it should be done!

4 Likes