Better upgrade guide

I just finished upgrading from Dgraph 1.0.5 to 1.0.8. And I’m hoping the documentation on upgrading can be improved. Here’s what I did this time.

First I made sure I did an export by calling /admin/export. Then I updated the docker image for zero to 1.0.6 and let it reboot. And then updated docker image for server to 1.0.6 and let it reboot.
The result was zero would no longer start and crashed with “Assert failed”

So I updated all the images straight to 1.0.8, deleted the data directories from each of the 4 pod volumes by hand (3 server and 1 zero)
Rebooted them all and used live loader to import the backup.

This is not a small amount of down-time. How can upgrades be improved in the future so we have minimal down time of the database?

The underlying data format in Dgraph can change between release versions, which is why the docs a data export and importing the data into a new cluster running the updated version.

A way to minimize downtime when upgrading is to do a blue-green deployment for upgrades. That is, keep the original cluster online for clients while setting up a new Dgraph cluster. Once the upgraded cluster is set up, you can redirect clients over to the new cluster and then take down the previous version.

Can I just clarify how you would do this. As the new deployment would need to have the data replicated to it. Are you saying you’d do an export/import to the new deployment, or is it possible to do a live replication from the existing deployment?
For eg. would it look like this?

  1. Deploy new dgraph servers that are pointed to the existing zero
  2. Once data is replicated point the servers to an upgraded zero
  3. Clients start connecting to the new servers

Or is there no way to do live replication?

Yes, an export/import. Live replication is not currently a feature.

Don’t connect servers and zeros running different versions. Everything should be the same version within a cluster.

So, this means that there’s no way to upgrade versions without downtime?
It would be nice to be able to upgrade without a downtime or to be able to upgrade node by node or to have async replication when adding a new node with an upgraded version.
I’ve not found related documentation.
Thoughts?

Hey @robregonm. There can be cases where specific changes between versions would allow a rolling update that would maintain availability of the cluster in an HA setup. But this is not usually the case in general.

As mentioned in this thread, a blue-green deployment would be one way to perform an upgrade. The original cluster could also be set to read-only mode so both clusters would contain the same data and then read/write traffic can be directed to the new cluster.

Hi all,

I have created a FR in GitHub to propose an improvement of the current upgrade process. Please feel free to subscribe there / comment / vote:

Many Thanks,