Unable to Deploy DGraph Live Loader to Load .rdf.gz files from AWS S3

Hi all,

I want to deploy a dgraph live loader to load the .rdf files for a server. It works fine when I manually download the .rdf.gz file, unzip it and deploy the live loader locally, by using the following code:

dgraph live -f g01.rdf -a alpha:9080

However, when I tried to do get the g01.rdf.gz directly from the AWS S3, it always shows up that there is no files in the folder. Here is the code that I tried:

dgraph live -C -f s3:///bucket-name/directory-with-rdf -a alpha:9080

I have input my access id and secret id as environment variable so this part should be fine. Can anyone help me with this? Thanks!

@joaquin are you able to help with this? Thanks.

@MichelDiz Acknowledging this. I was out several days.

@mattZhang17 I am assuming this is with v21.03.0. I will try this out locally and follow-up.

@mattZhang17 I think I found the problem and solution. The S3 URI has to be the long form (s3://s3.<region>.amazonaws.com/<bucket>), instead of the short form (s3:///<bucket>) used with aws cli.

For example:

  • FAILS: s3://happy-dgraph-data/path/data.rdf.gz
  • SUCCESS (workaround): s3://s3.us-west-2.amazonaws.com/happy-dgraph-data/path/data.rdf.gz

In testing this I used for -f to reference a filename, not directory name with the rdf. I also use -s to reference the filename for the schema.

So for example (using my docker-compose env):

BUCKET_LONG_URI="s3://s3.us-east-2.amazonaws.com/happy-dgraph-data"

docker exec -t alpha \
  dgraph live -C \
   -s ${BUCKET_LONG_URI}/dataset/1million.schema \
   -f ${BUCKET_LONG_URI}/dataset/1million.rdf.gz \
   -z http://zero:5080 -a http://alpha:9080
1 Like

@MichelDiz @mattZhang17 Following up, I filed an official bug related to this issue:

Following up further: I realized that I did not include the triple slash form short for of S3 URL, e.g. s3:///<my-bucket>. After making this correction, I did not encounter problems, and thus I was not able to reproduce this.

I ran the command inside the container, as it has dgraph binary in it. The container has AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as environment variables.

Both types of S3 URLs worked for me:

docker exec -t alpha \
  dgraph live \
    -s s3:///<bucket>/<path>/1million.schema \
    -f s3:///<bucket>/<path>/ \
    -z zero:5080 \
    -a alpha:9080

docker exec -t alpha \
  dgraph live \
    -s s3://s3.<region>.amazonaws.com/<bucket>/<path>/1million.schema \
    -f s3://s3.<region>.amazonaws.com/<bucket>/<path>/ \
    -z zero:5080 \
    -a alpha:9080

In case this is useful, this is how I ran my test.

  1. Configure an S3 bucket with IAM profile/role with appropriate permissions in the attached policy. I used something similar to this: dgraph/contrib/config/backups/s3/terraform at master · dgraph-io/dgraph · GitHub
  2. Configure testing environment with docker-compose
    version: "3.5"
    services:
      zero:
        image: dgraph/dgraph:v21.03.0
        command: dgraph zero --my=zero:5080 --replicas 1 --raft idx=1
        container_name: zero
    
      alpha:
        image: dgraph/dgraph:v21.03.0
        environment:
          AWS_ACCESS_KEY_ID: REDACTED
          AWS_SECRET_ACCESS_KEY: REDACTED
        command: dgraph alpha --my=alpha:7080 --zero=zero:5080
        container_name: alpha
    
  3. Download Example Schema and upload to bucket
    mkdir data && pushd data
    PREFIX="https://github.com/dgraph-io/benchmarks/raw/master/data/"
    FILES=(1million.schema 1million.rdf.gz)
    export AWS_PROFILE="<profile-with-priv>"
    
    # upload data and schema
    for FILE in ${FILES[*]}; do
      curl --silent --location --remote-name $PREFIX/$FILE
      aws s3 cp $FILE s3://<bucket>/<path>/
    done
    
    # verify
    aws s3 ls "s3://<bucket>/<path>/"
    
    popd
    
  4. Perform Live Load
    BUCKET_LONG_URL="s3://s3.<region>.amazonaws.com/<bucket>"
    BUCKET_SHORT_URL="s3:///<bucket>"
    ALPHA_SERVER="alpha:9080"
    ZERO_SERVER="zero:5080"
    
    docker exec -t alpha \
      dgraph live \
        -s ${BUCKET_SHORT_URL}/<path>/1million.schema \
        -f ${BUCKET_SHORT_URL}/<path>/ \
        -z $ZERO_SERVER \
        -a $ALPHA_SERVER
    
    
    docker exec -t alpha \
      dgraph live \
        -s ${BUCKET_LONG_URL}/<path>/1million.schema \
        -f ${BUCKET_LONG_URL}/<path>/ \
        -z $ZERO_SERVER \
        -a $ALPHA_SERVER