Remote Export in Dgraph

gja · July 9, 2020, 11:00am

Background

Currently, dgraph export works by saving exported RDF and JSON files on to the disk of alpha leader. This makes it difficult to save the export if you do not have access to alpha (as is the case for Slash). Thus, it makes sense to store this export in some sort of cloud storage such as s3 or Google Cloud Storage.

Proposed API

I propose that we accept the destination to export to via the /admin GraphQL endpoint. The remote location is in a similar format to how backup accepts remote location

mutation {
  export(input: {format: "json", destination: "s3:///<bucketname>/path"}) {
    response {
      message
      code
      files
    }
  }
}

The files output field will contain a list of files, relative to the destination path (ex: export/dgraph.r20007.u0709.0416/g01.json.gz).

Implementation

Export is accepted by an alpha, and each group leader does an export of data present in that group. These files are exported into the dgraph/export directory, with a structure that looks like this:

export/dgraph.r20007.u0709.0416/g01.json.gz
export/dgraph.r20007.u0709.0416/g01.schema.gz

Here,

20007 is the timestamp from the oracle
0709.0416 is the current time
g01 indicates that this is for group 1

This means that files generated will never conflict, as the directory name will be different between exports, and the group is different between an export.

After exporting these files to disk (at the end of export()), we will copy them to their final destination, and then delete these files from the dgraph/export directory, so as to reclaim space. We will need to return the names of the files written

It may be possible to stream directly to s3 via minio, but this should be considered a nice-to-have.

Notes for Slash

If you are exporting data from a Slash Endpoint, then Slash will return s3-signed URLs which will allow downloading the files in the export for 24 hours. Slash will likely have a different endpoint for export than alpha.

Rahul · July 9, 2020, 11:47am

May be also allow exporting to Google cloud storage bucket

gja · July 9, 2020, 11:56am

Good point, our minio client already supports google cloud. I’ll update the RFC accordingly.

iluminae · July 9, 2020, 2:09pm

This is great - infinitely more useful for us that run dgraph in kubernetes where getting to the disk is a huge chore. +1 for google cloud storage native support. Possibly encrypt using local encryption keys before reaching object storage? I would probably rely on google’s encryption but I am sure some would rather hold the keys themselves.

paulftw · July 9, 2020, 3:50pm

BackupInput has 5 more fields apart from destination.
Should export also support those settings?
(forceFull probably doesn’t make sense but other can be useful)

That way we could even separate out a DestinationDefinition (better name needed) to guarantee consistency with backup settings. I think backup and export are quite similar, so it’s reasonable for API consumers to expect those two to accept almost identical arguments.

gja · July 9, 2020, 4:26pm

Makes sense. The relevant fields are accessKey, secretKey, sessionToken, anonymous and destination.

MichelDiz · July 9, 2020, 5:26pm

UPDATE: Just noticed that this might be a bit off of topic. I think this post is about Slash GraphQL. I wasn’t clear, but the idea still valid on my opinion for other cases.

I had this idea Temp http server for "export" feature - Making it downloadable. · Issue #2515 · dgraph-io/dgraph · GitHub long time ago, but wasn’t
related to buckets and so on. The idea was just to “make the export downloadable”.
e.g

curl https://mysite.io/admin/dgraph.r20007.u0709.0416.gz -sSf

Maybe this logic is simpler, due to the fact that don’t relies on third-party tools or services.

PS. For safety, we can combine this with Poor man’s ACL.

Paras · July 9, 2020, 5:58pm

A logical extension is to, then, even allow remote (s3 or gcp bucket) import via bulk or live into Dgraph.

Miaourt · November 7, 2020, 3:54pm

Does something have moved around this feature idea ?

gja · November 8, 2020, 4:12am

We’ve added support for remote exports. I will need to check the version but it’s either 20.7 or the upcoming 20.11 release.

This is however available in slash graphql (and the v20.07-slash branch)

m18e · November 30, 2020, 10:06pm

Will we ever see functionality to send the exports back over the connection that requests them?

For example, if I’m using Postgres, I can run pg_dump and initiate a server data dump that’s wired back to my computer and stored there, and I can process it + tag it to my needs. Or I can send the dump to a network share. Or or or… Same with mongodb + mongodump.

Both the behavior of the original export, where I have to dig around on an alpha to find where it is then extract it (somehow!), and this, where I have to go rooting around on cloud storage, is really hard to work with.

Topic		Replies	Views
Slash graphql data backup Dgraph Cloud	6	516	August 6, 2020
Importing and Exporting data from Slash GraphQL - Slash graphql Documentation	0	473	August 28, 2020
Export data and input into remote server Dgraph	4	821	July 22, 2020
Export DGraph Data to S3 Dgraph dgraph	1	422	April 3, 2023
Queries regarding export/schema update Dgraph mutation , schema	7	760	July 3, 2020

Remote Export in Dgraph

Background

Proposed API

Implementation

Notes for Slash

Related topics