Live load kills an alpha on a 6 node cluster on docker Mac


Report a Dgraph Bug

When running live load (1 million rdf dataset) on a 6 node cluster using docker-compose (3 alpha/3 zero), one of the alpha dies with error code 255 but the live load completes successfully. If the leader dies, the live load fails. This can only be reproduced in Mac OS.

What version of Dgraph are you using?

This is reproducible in all versions of dgraph.

Have you tried reproducing the issue with the latest release?

Yes

What is the hardware spec (RAM, OS)?

16 Gb memory, 2 core processor, Mac OS Catalina version 10.15.6 (19G2021)

Steps to reproduce the issue (command/config used to run Dgraph).

  • Go to dgraph/compose directory and generate the docker-compose file using ./compose -a 3 -w 3 -d ./data -w
  • For any version of dgraph, run make install in root directory to install the binary
  • Run docker-compose up -d
  • Apply the 1 million dataset schema found in the benchmarks repository using the command:
curl --location --request POST 'localhost:8080/alter' --header 'Content-Type: application/rdf' --data '@schema'
  • Run live load using the following command:
dgraph live -f 1million.rdf.gz --alpha localhost:9180 --zero localhost:5180
  • Let the live load run through, you will notice that one of the alpha docker container exits
  • The issue is reproducible when schema is passed in the dgraph live command with the -s flag, there were no memory fluctuations during the live load

Expected behaviour and actual result.

None of the alphas are supposed to exit during live load
Alpha container exits without any error logs on verbosity 3, the live load completes if a follower exits, else the live load errors out.


If this is running on docker, it isn’t related to macOS. You should be able to reproduce in other OS. Maybe the issue is related to the default resources available to the VM running docker. Please, share this info. So I can reproduce it.

I also use Catalina and I have tested 1million RDF several times and 21 million. And this never happened so far. I had OOM simulating low resources tho.

PS. This comment doesn’t invalidate the bug report. Just trying to find who is the guilty.

2 Likes

Thanks Michel!
Yes I did have a feeling it might be related to resource usage. Not sure if 2 GB memory and 1 GB swap is enough for 6 nodes. Was able to reproduce it in a different Mac as well. Here is my docker resource allocation:

1 Like

I would say that 2GB is enough for a single node. Have you tried with jemalloc enabled?

Also, are you using the multiple address config? example

dgraph live -s 21.schema -f 21million.rdf.gz -a "localhost:9080,localhost:9081,localhost:9082 -z localhost:5080

Also you can use multiple Zero addresses on the Alphas. So they can keep looking for a new leader if so.

dgraph alpha -z "127.0.0.1:5080,127.0.0.1:5081,127.0.0.1:5082"