Incident Management Reports: Slash GraphQL

This post will hold Incident Management notes for production breakage that causes downtime and affects the users. The format used for the report will be as follows:

  1. Incident
  2. Why it happened?
  3. Who was affected by it?
  4. How do we ensure it does not happen again?

Incident Report July 28, 2020

1. Incident
During release of gRPC changes, midway we realised that the Introspection Query was broken because of which Docs section in API Explorer was not loading and GraphQL clients like Altair was throwing a no schema added error.

2. Why it happened?
During the release, we updated the Alpha image with the slash tag. We assumed that this would be the most recent master. Our assumption was wrong and that caused an old bug (that was fixed in the latest master) to show up in production.

3. Who was affected by it?
All users who were using the Slash GraphQL service between 3:30 pm IST - 4:40 pm IST were affected by this incident. The incident lasted a bit over an hour.

4. How do we ensure it does not happen again?
There are two things that we will do to make sure this does not happen again.
Firstly, we will check the slash image build date before we update the Alpha images, deploy it in staging and thoroughly test it for any kind of regression.
Secondly, we will be adding the introspection query as part of our API Test suite that will run on our CI. If the query fails, we will know about it.

4 Likes

Incident Report August 5, 2020

1. Incident
When releasing OAuth logins, we introduced an error, user’s signing up with an invite were unable to view/use their credits (as it was set to zero).

2. Why it happened?
We missed a use-case when testing/implementing OAuth changes and our regression suite was unable to capture the error. We did not run this scenario manually as well. The case was assumed to be covered in our automation suite so no further tests were done manually.

3. Who was affected by it?
All the users who signed up on Slash with an invite between 30th July - 5th August were affected by this incident. We were able to verify that a single user faced this problem. The reported problem has now been fixed and the user has been provided with 10K credits.

4. How do we ensure it does not happen again?
We need to add more tests on both Unit Test level and End to End level to be able to automatically catch such issues. Also, improved vigilance on regression test cases is required for all manual test runs.

3 Likes

Thanks for the write-up. I’m concerned that this was not tested manually by @gja and/or @Sankalan13 .

Hey Manish,
It was a miss on my end. I did not add invite signup as a part of test run for OAuth changes. I will make sure that these critical parts are properly tested before release.
Again, apologies to the users for the negligence.

1 Like

Thanks @Sankalan13 for owning up to that, but it’s the entire team that needs to take responsibility over here. This is a slip up on multiple points, including

  • Automated coverage didn’t pick this up
  • Ensuring everyone in the team understands all the points of integration so that they can test
  • More thorough code review
  • Following the post deploy sanity more thoroughly

So there are lessons for everyone here.

2 Likes