by Shyam JVS, Google Inc (with inputs from Brian Grant, Wojciech Tyczynski & Dan Winship)
This document serves as a catalog of issues we’ve known/discovered with kubernetes services as of June 2018, focusing on their scalability/performance. The purpose of the document is to make the information common knowledge for the community, so we can work together towards improving it. Listing them below in no particular order.
Iptables can be slow in packet processing when a large number of services exist. As the number of service ports increases, the KUBE-SERVICES chain gets longer and since it’s also evaluated very frequently that can bring down the performance. There have been some recent improvements in this area (#56164, #57461, #60306) - but we still need to measure and come up with safe bounds for number of services that can be ‘decently’ supported (10k seems to be a reasonable estimate from past discussions). Also, further improvements might be required based on those results.
When we’re running services worth a large number of backends (> 100k) on our large-cluster scalability tests, we’re noticing that kube-proxy is timing out while trying to do iptables-restore due to failing to acquire lock over iptables. There are at least two parts to this problem:
iptables-restoreimplementation is such that it grabs the lock before it parses its input, and simply parsing tens of thousands of iptables rules takes a noticeable amount of time. So if two iptables commands start at the same time, the first might grab the lock, and then start parsing its input, and burn through half of the other iptables command’s
--waittime before it even gets to the point of passing the rules off to the kernel.
TODO: Find if there’s any issue arising here from k8s side (relevant bug linked below).
For the first problem:
For the second problem:
iptables-restorewas written to share code with the main
iptablesbinary, and no one is working on fixing it because the official plan is to move to nftables instead. If the world ends up moving to iptables-over-eBPF rather than nftables then this will probably need to be fixed at some point.
Currently while serving watches, the apiserver is deep-copying deserialized endpoints objects from etcd and serializing it once for each kube-proxy watch it serves (which is 5k watches in our largest clusters).
Endpoints object for a service contains all the individual endpoints of that service. As a result, whenever even a single pod in a service is added/updated/deleted, the whole endpoints object (which includes even the other endpoints that didn’t change) is re-computed, written to storage and sent to all readers. Not being able to efficiently read/update individual endpoint changes can lead to (for e.g during rolling upgrade of a service) endpoints operations that are quadratic in the number of its elements. If you consider watches in the picture (there’s one from each kube-proxy), the situation becomes even worse as the quadratic traffic gets multiplied further with number of watches (usually equal to #nodes in the cluster).
Overall, this is a serious performance drawback affecting multiple components in the control-plane (apiserver, etcd, endpoints-controller, kube-proxy). The current endpoints API (which was designed at a very early stage of kubernetes when people weren’t really thinking too much about scalability/performance) makes it hard to solve this problem without introducing breaking changes.
We need to measure the e2e endpoints propagation latency, i.e the time from when a service is created until its endpoints are populated, given that all of the associated pods are already running and ready. We also need to come up with reasonable SLOs for the same and verify that they’re satisfied at scale.
Currently we store the leader lock for master components like the scheduler in the corresponding endpoints object. This has the undesirable side effect of sending a notification down the kube-proxy <-> master (and kube-dns <-> master) watch every second, which is not actually required.