Beyond the obvious goal of providing end-to-end system test coverage, there are a few less obvious goals that you should bear in mind when designing, writing and debugging your end-to-end tests. In particular, “flaky” tests, which pass most of the time but fail intermittently for difficult-to-diagnose reasons are extremely costly in terms of blurring our regression signals and slowing down our automated merge queue. Up-front time and effort designing your test to be reliable is very well spent. Bear in mind that we have hundreds of tests, each running in dozens of different environments, and if any test in any test environment fails, we have to assume that we potentially have some sort of regression. So if a significant number of tests fail even only 1% of the time, basic statistics dictates that we will almost never have a “green” regression indicator. Stated another way, writing a test that is only 99% reliable is just about useless in the harsh reality of a CI environment. In fact it’s worse than useless, because not only does it not provide a reliable regression indicator, but it also costs a lot of subsequent debugging time, and delayed merges.
If your test fails, it should provide as detailed as possible reasons for the failure in its output. “Timeout” is not a useful error message. “Timed out after 60 seconds waiting for pod xxx to enter running state, still in pending state” is much more useful to someone trying to figure out why your test failed and what to do about it. Specifically, assertion code like the following generates rather useless errors:
Rather annotate your assertion with something like this:
Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
On the other hand, overly verbose logging, particularly of non-error conditions, can make it unnecessarily difficult to figure out whether a test failed and if so why? So don’t log lots of irrelevant stuff either.
To reduce end-to-end delay and improve resource utilization when running e2e tests, we try, where possible, to run large numbers of tests in parallel against the same test cluster. This means that:
We have hundreds of e2e tests, some of which we run in serial, one after the other, in some cases. If each test takes just a few minutes to run, that very quickly adds up to many, many hours of total execution time. We try to keep such total execution time down to a few tens of minutes at most. Therefore, try (very hard) to keep the execution time of your individual tests below 2 minutes, ideally shorter than that. Concretely, adding inappropriately long ‘sleep’ statements or other gratuitous waits to tests is a killer. If under normal circumstances your pod enters the running state within 10 seconds, and 99.9% of the time within 30 seconds, it would be gratuitous to wait 5 minutes for this to happen. Rather just fail after 30 seconds, with a clear error message as to why your test failed (“e.g. Pod x failed to become ready after 30 seconds, it usually takes 10 seconds”). If you do have a truly legitimate reason for waiting longer than that, or writing a test which takes longer than 2 minutes to run, comment very clearly in the code why this is necessary, and label the test as “[Slow]“, so that it’s easy to identify and avoid in test runs that are required to complete timeously (for example those that are run against every code submission before it is allowed to be merged). Note that completing within, say, 2 minutes only when the test passes is not generally good enough. Your test should also fail in a reasonable time. We have seen tests that, for example, wait up to 10 minutes for each of several pods to become ready. Under good conditions these tests might pass within a few seconds, but if the pods never become ready (e.g. due to a system regression) they take a very long time to fail and typically cause the entire test run to time out, so that no results are produced. Again, this is a lot less useful than a test that fails reliably within a minute or two when the system is not working correctly.
Remember that your test will be run many thousands of times, at different times of day and night, probably on different cloud providers, under different load conditions. And often the underlying state of these systems is stored in eventually consistent data stores. So, for example, if a resource creation request is theoretically asynchronous, even if you observe it to be practically synchronous most of the time, write your test to assume that it’s asynchronous (e.g. make the “create” call, and poll or watch the resource until it’s in the correct state before proceeding). Similarly, don’t assume that API endpoints are 100% available. They’re not. Under high load conditions, API calls might temporarily fail or time-out. In such cases it’s appropriate to back off and retry a few times before failing your test completely (in which case make the error message very clear about what happened, e.g. “Retried http://… 3 times - all failed with xxx”. Use the standard retry mechanisms provided in the libraries detailed below.
Obviously most of the above goals apply to many tests, not just yours. So we’ve developed a set of reusable test infrastructure, libraries and best practices to help you to do the right thing, or at least do the same thing as other tests, so that if that turns out to be the wrong thing, it can be fixed in one place, not hundreds, to be the right thing.
Here are a few pointers:
Describeclause and nested
Itclauses. So for example
Describe("Pods",...).... It(""should be scheduled with cpu and memory limits")produces a sane test identifier and descriptor
Pods should be scheduled with cpu and memory limits, which makes it clear what’s being tested, and hence what’s not working if it fails. Other good examples include:
CAdvisor should be healthy on every node
Daemon set should run and stop complex daemon
On the contrary (these are real examples), the following are less good test descriptors:
KubeProxy should test kube-proxy
Nodes [Disruptive] Network when a node becomes unreachable [replication controller] recreates pods scheduled on the unreachable node AND allows scheduling of pods on a node after it rejoins the cluster
An improvement might be
Unreachable nodes are evacuated and then repopulated upon rejoining [Disruptive]
Note that opening issues for specific better tooling is welcome, and code implementing that tooling is even more welcome :-).