Incident Management Reports: Slash GraphQL

Sankalan13 · July 29, 2020, 12:27pm

This post will hold Incident Management notes for production breakage that causes downtime and affects the users. The format used for the report will be as follows:

Incident
Why it happened?
Who was affected by it?
How do we ensure it does not happen again?

Sankalan13 · July 29, 2020, 12:49pm

Incident Report July 28, 2020

1. Incident
During release of gRPC changes, midway we realised that the Introspection Query was broken because of which Docs section in API Explorer was not loading and GraphQL clients like Altair was throwing a no schema added error.

2. Why it happened?
During the release, we updated the Alpha image with the slash tag. We assumed that this would be the most recent master. Our assumption was wrong and that caused an old bug (that was fixed in the latest master) to show up in production.

3. Who was affected by it?
All users who were using the Slash GraphQL service between 3:30 pm IST - 4:40 pm IST were affected by this incident. The incident lasted a bit over an hour.

4. How do we ensure it does not happen again?
There are two things that we will do to make sure this does not happen again.
Firstly, we will check the slash image build date before we update the Alpha images, deploy it in staging and thoroughly test it for any kind of regression.
Secondly, we will be adding the introspection query as part of our API Test suite that will run on our CI. If the query fails, we will know about it.

Sankalan13 · August 5, 2020, 8:19am

Incident Report August 5, 2020

1. Incident
When releasing OAuth logins, we introduced an error, user’s signing up with an invite were unable to view/use their credits (as it was set to zero).

2. Why it happened?
We missed a use-case when testing/implementing OAuth changes and our regression suite was unable to capture the error. We did not run this scenario manually as well. The case was assumed to be covered in our automation suite so no further tests were done manually.

3. Who was affected by it?
All the users who signed up on Slash with an invite between 30th July - 5th August were affected by this incident. We were able to verify that a single user faced this problem. The reported problem has now been fixed and the user has been provided with 10K credits.

4. How do we ensure it does not happen again?
We need to add more tests on both Unit Test level and End to End level to be able to automatically catch such issues. Also, improved vigilance on regression test cases is required for all manual test runs.

mrjn · August 5, 2020, 1:27pm

Thanks for the write-up. I’m concerned that this was not tested manually by @gja and/or @Sankalan13 .

Sankalan13 · August 5, 2020, 1:35pm

Hey Manish,
It was a miss on my end. I did not add invite signup as a part of test run for OAuth changes. I will make sure that these critical parts are properly tested before release.
Again, apologies to the users for the negligence.

gja · August 5, 2020, 5:33pm

Thanks @Sankalan13 for owning up to that, but it’s the entire team that needs to take responsibility over here. This is a slip up on multiple points, including

Automated coverage didn’t pick this up
Ensuring everyone in the team understands all the points of integration so that they can test
More thorough code review
Following the post deploy sanity more thoroughly

So there are lessons for everyone here.

Topic		Replies	Views
Slash GraphQL Release August 14 2020 Announce release	0	520	August 14, 2020
Where to get error Stacktrace? GraphQL status:accepted , kind:bug , ticket:created	16	1757	December 23, 2020
Slash GraphQL Release January 15th 2021 Announce release	1	568	January 15, 2021
Slash GraphQL Release October 6th 2020 Announce release	0	526	October 6, 2020
Slash GraphQL Release July 21 2020 Announce release	2	587	July 21, 2020

Incident Management Reports: Slash GraphQL

Related Topics