Current state
In master, all the Jepsen tests are passing. However, tests sometimes cannot start due to issues related to cluster setup (e.g running apt-get update
fails). Usually, retrying these tests fixes the issue but it’s annoying to do and it means that not all tests are guaranteed to run. Fortunately, none of these failures seem related to Dgraph itself and they mostly happen at the beginning of the test, which means retrying is not very time consuming.
Kyle said some of these issues are fixed in the latest Jepsen master but when I tried to merge the newest changes I have run into other issues issues (for example Incomplete tests in Dgraph test suite. · Issue #451 · jepsen-io/jepsen · GitHub). There has also been some refactoring going on which broke the Dgraph test suite. These issues should be addressed eventually but given that aside from the incomplete tests the tests run fine, I think it’s better to try to run the tests as they are right now.
Proposed solution
While manually running the tests I have found that running the tests in a fresh cluster decreases the amount of flaky tests so I have made two changes to the Jepsen script in contrib/jepsen
:
- Added a new command line option to destroy and create the cluster before each test is run.
- Retry incomplete tests. There’s already a way to tell incomplete and failing tests apart and my changes are taking advantage of that. Failing tests are not retried.
When I ran the full test suite with those changes, only three tests (out of 36) were incomplete and all of them succeeded after only one retry.
I think retrying the incomplete tests is a more robust solution than trying to fix every possible cause of flakiness. We don’t have a lot of experience with the internals of Jepsen, our fixes won’t be exhaustive, and the fixes need to be merged into Jepsen, which can be a slow process.
PR with the changes to our Jepsen tool: test: Deal with incomplete tests in Jepsen tool by martinmr · Pull Request #5804 · dgraph-io/dgraph · GitHub
Running the tests
The next step would be running the tests in TeamCity. There are 36 total tests so I don’t think running every one for each PR is feasible. I propose the following.
- Running the full suite of tests nightly.
- Running a small set of tests (no more than four or five) for each PR. This should serve as a basic sanity check.
The tests are independent of each other so they could be sharded and run by multiple agents.