Remote Export in Dgraph

Background

Currently, dgraph export works by saving exported RDF and JSON files on to the disk of alpha leader. This makes it difficult to save the export if you do not have access to alpha (as is the case for Slash). Thus, it makes sense to store this export in some sort of cloud storage such as s3 or Google Cloud Storage.

Proposed API

I propose that we accept the destination to export to via the /admin GraphQL endpoint. The remote location is in a similar format to how backup accepts remote location

mutation {
  export(input: {format: "json", destination: "s3:///<bucketname>/path"}) {
    response {
      message
      code
      files
    }
  }
}

The files output field will contain a list of files, relative to the destination path (ex: export/dgraph.r20007.u0709.0416/g01.json.gz).

Implementation

Export is accepted by an alpha, and each group leader does an export of data present in that group. These files are exported into the dgraph/export directory, with a structure that looks like this:

export/dgraph.r20007.u0709.0416/g01.json.gz
export/dgraph.r20007.u0709.0416/g01.schema.gz

Here,

  • 20007 is the timestamp from the oracle
  • 0709.0416 is the current time
  • g01 indicates that this is for group 1

This means that files generated will never conflict, as the directory name will be different between exports, and the group is different between an export.

After exporting these files to disk (at the end of export()), we will copy them to their final destination, and then delete these files from the dgraph/export directory, so as to reclaim space. We will need to return the names of the files written

It may be possible to stream directly to s3 via minio, but this should be considered a nice-to-have.

Notes for Slash

If you are exporting data from a Slash Endpoint, then Slash will return s3-signed URLs which will allow downloading the files in the export for 24 hours. Slash will likely have a different endpoint for export than alpha.

May be also allow exporting to Google cloud storage bucket

Good point, our minio client already supports google cloud. I’ll update the RFC accordingly.

This is great - infinitely more useful for us that run dgraph in kubernetes where getting to the disk is a huge chore. +1 for google cloud storage native support. Possibly encrypt using local encryption keys before reaching object storage? I would probably rely on google’s encryption but I am sure some would rather hold the keys themselves.

1 Like

BackupInput has 5 more fields apart from destination.
Should export also support those settings?
(forceFull probably doesn’t make sense but other can be useful)

That way we could even separate out a DestinationDefinition (better name needed) to guarantee consistency with backup settings. I think backup and export are quite similar, so it’s reasonable for API consumers to expect those two to accept almost identical arguments.

Makes sense. The relevant fields are accessKey, secretKey, sessionToken, anonymous and destination.

UPDATE: Just noticed that this might be a bit off of topic. I think this post is about Slash GraphQL. I wasn’t clear, but the idea still valid on my opinion for other cases.

I had this idea Temp http server for "export" feature - Making it downloadable. · Issue #2515 · dgraph-io/dgraph · GitHub long time ago, but wasn’t
related to buckets and so on. The idea was just to “make the export downloadable”.
e.g

curl https://mysite.io/admin/dgraph.r20007.u0709.0416.gz -sSf

Maybe this logic is simpler, due to the fact that don’t relies on third-party tools or services.

PS. For safety, we can combine this with Poor man’s ACL.

A logical extension is to, then, even allow remote (s3 or gcp bucket) import via bulk or live into Dgraph.

1 Like

Does something have moved around this feature idea ?

We’ve added support for remote exports. I will need to check the version but it’s either 20.7 or the upcoming 20.11 release.

This is however available in slash graphql (and the v20.07-slash branch)

Will we ever see functionality to send the exports back over the connection that requests them?

For example, if I’m using Postgres, I can run pg_dump and initiate a server data dump that’s wired back to my computer and stored there, and I can process it + tag it to my needs. Or I can send the dump to a network share. Or or or… Same with mongodb + mongodump.

Both the behavior of the original export, where I have to dig around on an alpha to find where it is then extract it (somehow!), and this, where I have to go rooting around on cloud storage, is really hard to work with.