Dgraph Cloud and Uptime

Hello Hivemind,

I am posting this to gather feedback and experience from the early adopters of Dgraph Cloud. We have recently started using Dgraph Cloud for a greenfield application but have been noticing downtimes and odd blips with the functioning of the service even when the status page says all is good. Unfortunately, this also coincided with a client demo today and these are not encouraging signs. I don’t want to self-manage Dgraph but we are not getting a good sense of reliability so far from the managed service offering. I wanted to poll the community on its Dgraph Cloud experience.

4 Likes

We have been seeing this recently as well. Not sure what exactly the deal is other than it usually happens when we are hitting it pretty hard with a sync script. Haven’t been able yet to track down if it is a bad query, too many request per second, or something else

1 Like

In our case, we have little data and no heavy usage/querying as we are still prototyping. Everyone and everything is blocked when the service is down as the database forms the core.

Hi @cjog,

Just wanted to make sure you were not having problems due to cold start times.

J

Thanks @jdgamble555 . It does not seem to be a cold start issue. We currently have a shared instance and not a free version and have experienced issues dropping data and the schema not loading up. The schema not loading up has now occurred on 2 occasions in 2 weeks.Attaching a screenshot for reference. Has anyone else seen a similar problem? I imagine for a critical business application this unplanned downtime can turn out to be quite costly.

There is no reference of a downtime on the status page - https://status.dgraph.io/ for yesterday the 24th June 2021 when we have actually experienced this 2 times in 2 weeks.

The schema not loading up is due to cold start times. I cannot speak for dropping data.

:frowning:

J

We are on a paid backend. The support person suggested that the issue was something to do with bin-packing algorithm . Nevertheless going by the SLA terms listed here Dgraph | Cloud Service Level Agreement (SLA) the 99.9% uptime guarantees have been very easily breached for us within these 2-3 weeks and does not give us a lot of confidence in the solidity of the offering at this point. This is also a reason for me to poll the early adopters for their experiences.

I don’t have any specific statistics however I work on a project using Dgraph Cloud ~3 times a week for about 3 hours each session. I have noticed that about the service goes down for ~2-5mins a couple of times every odd session.

1 Like

@charklewis @cjog You guys are both referring to cold start times. Even paid tiers have this unless you pay for a dedicated instance.

This is normal. This is the same way google cloud run works for instance.

J

@jdgamble555 Based on the answer I got from the support person and the downtime we experienced I am pretty sure the issue was not related to a cold start.

I set a monitor on ours a few days ago that checks to see if 1) it is up and 2) if the instance has been restarted since the last check (via the uptime param on the /health endpoint), and it has been triggered 4x in the last 2 days with extremely light usage. Often it crashes when querying a single field in a single type.

Recently it has only been down for about 3min per time, but it has been up to 7-8m in recent history.

We were also hunting for bad queries, but it doesnt sound like that is the issue if all of you are seeing the same things.

Makes me nervous for sure.

1 Like

We recently spun up two Dgraph Cloud environments, one for staging that is a single instance, and one for production that is a high-availability cluster (3 zeros and 3 alphas, costing about $2,400 per month).

Our API server runs on Node.js, and we’ve seen an alarming number of timeouts from the dgraph-js client in both staging and production. And this is with very light load—fewer than 30 users in our closed beta. The client seems to get into a zombie state where it either stops responding, or it’s not getting a response from Dgraph Cloud. When this happens, all requests to Dgraph hang for as long as 16-17 minutes, then Dgraph finally wakes up and dumps dozens and dozens of errors into our logs. We have an open ticket with Dgraph support about this.

Assuming it’s an issue with the gRPC client, we’ve just finished migrating to the dgraph-js-http HTTP client. We’re hoping it will perform better, but we’re definitely nervous about the instability we’ve seen with Dgraph Cloud thus far.

1 Like

Hey guys,

Just wanted to address some of the issues raised here.

Downtimes – we were using an out of the box solution for machine scheduler, which wasn’t working as well as we expected it to. We’re now switching to a home grown scheduler, which does a much better job – we’re rolling it out this week to various regions. US West would be the last one, considering it’s our biggest region, which we expect to do by end of the week. So, starting next week, you guys would see much better uptimes.

Also, feel free to reach out to us if you experienced any downtimes – we’d be able to issue discounts if it went beyond the 99.9% SLA we guarantee.

For dgraph-js client, we are seeing some issues – we think we have a probable cause figured out. I’ll circle back with more information as we determine it.

5 Likes

Thank you for addressing this major concern. No need for any refund on our account.

1 Like