Sentry Integration With Dgraph

What is Sentry?

Sentry is a powerful service that allows applications to send arbitrary events, messages, exceptions, bread-crumbs (logs) to your sentry account. In simplest terms, it is a dial-home service but also has a rich feature set including event filtering, data scrubbing, several SDKs, tagging, environment and release tagging, and integration with Slack, GitHub etc.

Events are the smallest unit of transaction between an application and Sentry servers.

Sentry at Dgraph Requirements

At Dgraph, we wanted an approach to be able to capture panics that are seen in the field. Chief among which was to capture not just manual panics in the code but also runtime panics (such as index out of bound) in the entire application. This gives us the following benefits:

Panics are reported to us in near real-time even before a customer reaches out to support.
The event has the panic stack trace which allows us to pin point the exact location of the panic.

This allows engineers to quickly triage and potentially fix the panic/bug and do a patch release based on the severity.

Support can pro-actively reach out to affected customers letting them know about the issue even before they report it.

To that end, the basic requirements are as follows:

  • Simple dial-home framework
  • Easy to integrate with Dgraph with minimal performance overhead
  • Rich event data sent back
  • Capture manual and runtime panics
  • Notifications preferably over slack for each issue.
  • Basic Integration

We chose sentry based on its rich feature set and simple to use SDK.
Panics (runtime and manual)

For manual panics anywhere in the code, sentry.CaptureException() API is called.

For runtime panics, Sentry does not have any native method. After researching this and on Sentry’s developer recommendation, we chose the approach of a wrapper process to capture these panics. The basic idea is that whenever a dgraph instance is started, a 2nd monitoring process is started whose only job is to monitor the stderr for panics of the monitored process. When a panic is seen, it is reported back to sentry via the CaptureException API.

Reporting

Each event is tagged with the release version, environment, timestamp, tags as explained below.

Release:

    This is the release version string of the Dgraph instance.

Environments:

    We have defined 4 environments

        dev-oss / dev-enterprise: These are events seen on non-released / local developer builds.

        prod-oss/prod-enterprise: These are events on released version such as v20.03.0. Events in this category are also sent on the sentry-events slack channel. 

Tags:

    dgraph: This tag can have values “zero” or “alpha” depending on which sub-command saw the panic/exception.

Configuration

A new flag “enable_sentry” is introduced for zero and alpha. This flag allows completely turning on/off sending events to Sentry. Default is on.
Known Issues /TODO

The panics reported are all grouped into the same Sentry Issue. This is because Sentry’s grouping algorithm is based on the back-trace leading up-to the call to CaptureException API. However, as explained above, for runtime panics, this back-trace will always be of the wrapper process. We still capture the panic and its stack-trace as a “message” to the event.

Data Scrubbing is largely kept default. This, in most cases, should be fine. See “Data Handling’ section below.

Data Handling

As Dgraph starts reporting panics and events to Sentry, there will inevitably be questions from our customer on what exact data is sent and how is it protected. To address those concerns, here are some defenses that we have:

  • Event Selection: As of now, only panic events are sent to Sentry from Dgraph.

  • Data in Transit: Events sent from the SDK to the Sentry server is encrypted on the wire with industry-standard TLS protocol with 256 bit AES Cipher.

  • Data at rest: Events on the Sentry server are also encrypted with 256 bit AES cipher. Sentry is hosted on GCP and as such physical access is tightly controlled. Logical access is only available to sentry approved officials.

  • Data Retention: Sentry stores events only for 90 days after which they are removed permanently.

  • Data Scrubbing

    • SDK Scrubbing: Currently, we dont do any scrubbing on the SDK side before sending an event.

    • Server Side Scrubbing: The Data Scrcubber option (default: on) in Sentry’s settings ensures PII doesn’t get sent to or stored on Sentry’s servers, automatically removing any values that look like they contain sensitive information for values that contain the following strings:

        password
      
        secret
      
        passwd
      
        api_key
      
        apikey
      
        access_token
      
        auth_token
      
        credentials
      
        mysql_pwd
      
        stripetoken
      
        card[number]
      
        ip addresses 
      
  • Configuration Control: Sentry reporting is on by default. However, starting from v20.03.1 and v20.07.0, there is a configuration flag “enable-sentry” which can be used to completely turn off Sentry events reporting.

References:

https://sentry.io/security/

https://docs.sentry.io/data-management/sensitive-data/

Why don’t we convert this into a blog post? CC: @katharine

Created a Jira ticket so won’t forget: https://dgraph.atlassian.net/jira/software/projects/DEVRELTASK/boards/29/backlog?selectedIssue=DEVRELTASK-229

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

I have this as one of the 3 blogs that I plan to write.

1 Like