[RFC] GraphQL API for long-running tasks

Motivation

Dgraph’s current backup/export API makes the user wait until the whole operation is complete. While this synchronous behaviour is great for smaller databases, it times out if it takes longer than 10 minutes; after this, the only way of knowing its status is checking logs. As @dmai pointed out, this starts occurring when the p directory is ~50 GB (see here).

It’s generally not considered good API design to make the user wait for long-running operations. Not only is it an inconvenience for a user who doesn’t want to wait around for the operation to complete, it’s also using up an HTTP connection for no good reason. In fact, the timeout that occurs is happening server-side, to prevent slow clients from leaking HTTP connections.

Architecture

Long running tasks such as backups and exports will be added to a task queue when created. Upon creation, they immediately return an ID to the user, who can then use this ID to get the current status of the task. Since none of these are blocking operations, they will not experience the problems mentioned above.

When querying a Task from the queue, the schema would be as follows:

interface Task @withSubscription {
    id: ID!
    status: Status!
}

enum Status {
    Queued
    Running
    Error
    Success
}

type BackupTask implements Task {
    # etc.
}

type ExportTask implements Task {
    # etc.
}

Upon completion, tasks will be automatically deleted in two weeks. If a task has failed, it will be not be deleted unless the user manually deletes it.

User Impact

In order to keep it user-friendly:

  • The old synchronous API will remain as is - it works very well for smaller databases.
  • A window can be added to Ratel to monitor the progress of currently running tasks, and inspect failed tasks.

Further Reading

cc: @dmai @Paras

1 Like

Here a related issue for a use case Show query progress, specifically for long running queries

1 Like

So, Will there be two different backup APIs, synchronous and asynchronous ?
This may confuse users to decide on which one to use. Can we just have a single asynchronous Backup API ?
Moreover, can we also add an API to stop/kill an ongoing Task ?

Is 2 weeks to allow a user to query the status of a task? We could make it configurable. Also, for failed, why keep it forever? I think we can just do 2 weeks (or whatever is configured)

We could reuse the API and add an optional flag (async = true//false, default false for example). This will not break existing users and allow for the async behavior.
On another note, we are allowed to make breaking changes in major releases as well.

Linking related feature request from User: Export duration .

enum Status {
    Queued
    Running
    Error
    Success
}

I have implemented a state machine solution for orchestrating through a sequence of tasks in the MDM demo. It’s a fairly easy to configure and reusable solution. Here is the post.