What is Sentry?
Sentry is a powerful service that allows applications to send arbitrary events, messages, exceptions, bread-crumbs (logs) to your sentry account. In simplest terms, it is a dial-home service but also has a rich feature set including event filtering, data scrubbing, several SDKs, tagging, environment and release tagging, and integration with Slack, GitHub etc.
Events are the smallest unit of transaction between an application and Sentry servers.
Sentry at Dgraph Requirements
At Dgraph, we wanted an approach to be able to capture panics that are seen in the field. Chief among which was to capture not just manual panics in the code but also runtime panics (such as index out of bound) in the entire application. This gives us the following benefits:
Panics are reported to us in near real-time even before a customer reaches out to support.
The event has the panic stack trace which allows us to pin point the exact location of the panic.
This allows engineers to quickly triage and potentially fix the panic/bug and do a patch release based on the severity.
Support can pro-actively reach out to affected customers letting them know about the issue even before they report it.
To that end, the basic requirements are as follows:
- Simple dial-home framework
- Easy to integrate with Dgraph with minimal performance overhead
- Rich event data sent back
- Capture manual and runtime panics
- Notifications preferably over slack for each issue.
- Basic Integration
We chose sentry based on its rich feature set and simple to use SDK.
Panics (runtime and manual)
For manual panics anywhere in the code, sentry.CaptureException() API is called.
For runtime panics, Sentry does not have any native method. After researching this and on Sentry’s developer recommendation, we chose the approach of a wrapper process to capture these panics. The basic idea is that whenever a dgraph instance is started, a 2nd monitoring process is started whose only job is to monitor the stderr for panics of the monitored process. When a panic is seen, it is reported back to sentry via the CaptureException API.
Each event is tagged with the release version, environment, timestamp, tags as explained below.
Release: This is the release version string of the Dgraph instance. Environments: We have defined 4 environments dev-oss / dev-enterprise: These are events seen on non-released / local developer builds. prod-oss/prod-enterprise: These are events on released version such as v20.03.0. Events in this category are also sent on the sentry-events slack channel. Tags: dgraph: This tag can have values “zero” or “alpha” depending on which sub-command saw the panic/exception.
A new flag “enable_sentry” is introduced for zero and alpha. This flag allows completely turning on/off sending events to Sentry. Default is on.
Known Issues /TODO
The panics reported are all grouped into the same Sentry Issue. This is because Sentry’s grouping algorithm is based on the back-trace leading up-to the call to CaptureException API. However, as explained above, for runtime panics, this back-trace will always be of the wrapper process. We still capture the panic and its stack-trace as a “message” to the event.
Data Scrubbing is largely kept default. This, in most cases, should be fine. See “Data Handling’ section below.
As Dgraph starts reporting panics and events to Sentry, there will inevitably be questions from our customer on what exact data is sent and how is it protected. To address those concerns, here are some defenses that we have:
Event Selection: As of now, only panic events are sent to Sentry from Dgraph.
Data in Transit: Events sent from the SDK to the Sentry server is encrypted on the wire with industry-standard TLS protocol with 256 bit AES Cipher.
Data at rest: Events on the Sentry server are also encrypted with 256 bit AES cipher. Sentry is hosted on GCP and as such physical access is tightly controlled. Logical access is only available to sentry approved officials.
Data Retention: Sentry stores events only for 90 days after which they are removed permanently.
SDK Scrubbing: Currently, we dont do any scrubbing on the SDK side before sending an event.
Server Side Scrubbing: The Data Scrcubber option (default: on) in Sentry’s settings ensures PII doesn’t get sent to or stored on Sentry’s servers, automatically removing any values that look like they contain sensitive information for values that contain the following strings:
password secret passwd api_key apikey access_token auth_token credentials mysql_pwd stripetoken card[number] ip addresses
Configuration Control: Sentry reporting is on by default. However, starting from v20.03.1 and v20.07.0, there is a configuration flag “enable-sentry” which can be used to completely turn off Sentry events reporting.