Background
Currently, dgraph export works by saving exported RDF and JSON files on to the disk of alpha leader. This makes it difficult to save the export if you do not have access to alpha (as is the case for Slash). Thus, it makes sense to store this export in some sort of cloud storage such as s3 or Google Cloud Storage.
Proposed API
I propose that we accept the destination to export to via the /admin
GraphQL endpoint. The remote location is in a similar format to how backup accepts remote location
mutation {
export(input: {format: "json", destination: "s3:///<bucketname>/path"}) {
response {
message
code
files
}
}
}
The files
output field will contain a list of files, relative to the destination path (ex: export/dgraph.r20007.u0709.0416/g01.json.gz
).
Implementation
Export is accepted by an alpha, and each group leader does an export of data present in that group. These files are exported into the dgraph/export
directory, with a structure that looks like this:
export/dgraph.r20007.u0709.0416/g01.json.gz
export/dgraph.r20007.u0709.0416/g01.schema.gz
Here,
- 20007 is the timestamp from the oracle
- 0709.0416 is the current time
- g01 indicates that this is for group 1
This means that files generated will never conflict, as the directory name will be different between exports, and the group is different between an export.
After exporting these files to disk (at the end of export()), we will copy them to their final destination, and then delete these files from the dgraph/export
directory, so as to reclaim space. We will need to return the names of the files written
It may be possible to stream directly to s3 via minio, but this should be considered a nice-to-have.
Notes for Slash
If you are exporting data from a Slash Endpoint, then Slash will return s3-signed URLs which will allow downloading the files in the export for 24 hours. Slash will likely have a different endpoint for export than alpha.