scalability-regressions-case-studies

Kubernetes Scalability/Performance Regressions - Case Studies & Insights

by Shyam JVS, Google Inc

February 2018

Overview

This document is a compilation of some interesting scalability/performance regression stories from the past. These were identified/studied/fixed largely by sig-scalability. We begin by listing them down, along with their succinct explanations, features/components that were involved, and relevant SIGs (besides sig-scalability). We also accompany them with data on what was the smallest scale, both for real and simulated (i.e kubemark) clusters, that managed to catch those regressions. At the end of the document we draw some useful insights based on the case studies.

Case Studies

Issue Brief Description Details Relevant feature(s)/components(s) Relevant SIG(s) Smallest real cluster affected Smallest kubemark cluster affected
#60035 Kubemark-scale fails with couple of hollow-nodes getting pre-empted due to higher mem usage Few hollow-nodes started getting pre-empted by kubelets due to memory shortage for running critical pods. The increase in memory usage of hollow-nodes (more specifically hollow kube-proxy) was due to resolving a recent bug with endpoints in kubemark (#59823).
  • Pre-emption (feature)
  • Kubelet
  • Kube-proxy mock
- - 5000
#59823 Endpoints objects in kubemark are empty, leading to misleading performance results Endpoints objects weren’t getting populated with more than a single entry, due to conflicting node names for same pod IP. The reason for pod IPs being the same is a bug with our mock docker-client, which assigned a constant IP to all fake pods. This is probably a regression that didn’t exist about an year back. It had significant performance implications (see the bug).
  • Kubelet mock
  • Docker-client mock
  • Kube-proxy mock
  • Endpoints-controller
  • Apiserver
  • Etcd
sig-network - 100
#56061 Apiserver memory usage increased by 10-20% after addition of admission metrics A bunch of admission metrics were added to the apiserver for monitoring admission plugins, webhooks, etc. Soon after that change we started seeing a 100-200MB increase in memory usage of apiserver on a 100-node cluster. Thanks to the resource usage checks in our performance tests, we were able to spot the regression. It was fixed later by making those metrics lighter (i.e removing some SummaryVec metrics, reducing histogram buckets)
  • Admission control (feature)
  • Apiserver
  • Prometheus
sig-api-machinery
sig-instrumentation
100 -
#55695 Metadata-proxy not able to handle too many pods per node Metadata-proxy, a newly enabled node agent for proxy’ing metadata requests coming from pods on the node, was unable to handle load from >70 pods due to memory starvation. This violated our official k8s support for 110 pods/node.
  • Metadata concealment (feature)
  • Metadata-proxy agent
sig-auth
sig-node
- 500
#55060 Increase in pod startup latency due to Duplicate Address Detection in CNI plugin An update in the Container Network Interface (CNI) library introduced a new step for DAD, that caused a delay for the CNI plugins waiting on it to finish. Since this was along the code path for container creation, it led to increase in pod startup latency on the kubelet side by more than a second. As a result, we saw violation of our 5s pod-startup latency SLO on reasonably large clusters (where we were already close enough to the SLO earlier).
  • Container networking (feature)
  • Kubelet
sig-node
sig-network
2000 (though some effect was also seen at 100) -
#54164 Kube-dns pods coming up super slowly in large clusters due to inter-pod anti-affinity Kube-dns, a default deployment for k8s clusters, introduced node-level soft inter-pod anti-affinity in order to spread those pods across different nodes. However, the O(pods^2) implementation of the anti-affinity in the scheduler, made their scheduling super-slow. As a result, cluster creation was failing with timeout.
  • Inter-pod anti-affinity (feature)
  • Scheduler
  • Kube-dns
sig-scheduling
sig-network
2000 -
#53327 (part) Performance tests seeing a huge drop in scheduler throughput due to one predicate slowing down One of the scheduler predicates was changed to compute a random 32-length string. That made the predicate super-slow as it started starving for randomness (especially with the predicate running for each of 1000s of pods) and hugely reduced the scheduler throughput (by ~10 times). After few optimizations to the random pkg (eventually getting rid of the rand() call), this was fixed.
  • Scheduler
sig-scheduling 100 (mild signal) 500 (strong signal)
#53327 (part) Kubemark performance tests fail with timeout during pod deletion due to bug in kubelet mock The kubelet mock (hollow-kubelet) started showing behavioral difference from the real kubelet due to some changes in the latter. As a result, the hollow-kubelet was failing to delete pods forever under a corner condition, which is - a “DELETE pod” event is received for a pod while kubelet is in the middle of it’s container creation. A tricky regression needing quite some hunting before we could set the mock right.
  • Kubelet
  • Kubelet mock
sig-node - 5000 (also 500, but flakily)
#52284 CIDR allocation super slow with IP aliases This was a performance issue existing from before, but got exposed as a regression when we turned on IP aliasing for large clusters. CIDR-allocator (part of controller-manager) was having poor performance due to bad design. The main reasons being lack of concurrency and synchronous processing of events from shared informers. A bunch of optimizations later (#52292) fixed it’s performance.
  • IP-aliasing (feature)
  • Controller-manager (cidr-allocator)
sig-network 2000 -
#51903 Few nodes failing to start in kubemark due to reduced PIDs limit for docker in newer COS image When COS m60 image was introduced, we started seeing that some of the kubemark hollow-node pods were failing to start due to docker on the host-node crossing the PID limit. This is a risky regression in terms of the damage it could’ve caused if rolled out to production, and our scalability tests caught it. Besides the low PID threshold issue, it helped also catch another issue on containerd-shim starting too many threads.
  • Kubelet
  • Docker
  • Containerd-shim
sig-node - 500
#51899 (part) “PATCH node-status” calls seeing high latency due to blocking on audit-logging Those calls are made by kubelets once every X seconds - which adds up to be quite some qps for large clusters. Part of handling those calls is audit-logging them. When a change moving the default audit-log format to JSON was made, a performance issue with the design was exposed. The update handler for those calls was doing the audit-writing synchronously (instead of buffering + asynchronous writing), which slowed down those calls by an order of magnitude.
  • Audit-logging (feature)
  • Apiserver
sig-auth
sig-instrumentation
sig-api-machinery
2000 -
#51899 (part) “DELETE pods” API call latencies shot up on large cluster tests due to kubelet thundering herd A change to kubelet pod deletion resulted in delete pod api calls from kubelets being concentrated immediately after container garbage collection. When performing deletion of large numbers (O(10k)) of pods across large numbers (O(1k)) of nodes, the resulting concentrated delete calls from the kubelets cause increased latency of “DELETE pods” API calls (above our target SLO of 1s).
  • Container GC (feature)
  • Kubelet
sig-node 2000 -
#51099 gRPC update causing failure of API calls with large responses When gRPC vendor library was updated to v1.5.1, the default MTU for response size b/w apiserver <-> etcd changed to 4MB. This could only be caught by scalability tests, as our regular tests run at a much smaller scale - so they don’t actually encounter such large response sizes.
  • gRPC framework (feature)
  • Etcd
  • Apiserver
sig-api-machinery 100 100
#50854 Route-controller timing out while listing routes from cloud-provider Route-controller was failing to list routes from the cloud-provider API and in turn failed to create routes for the nodes. The reason was that the project in which the cluster was being created, started to have another huge cluster running there (with O(5k) routes) which was interfering with the list routes call for this cluster, due to cloud-provider side issues.
  • Controller-manager (route-controller)
  • Cloud-provider API (GCE)
sig-network
sig-gcp
- 5000 (running besides a real 5000 cluster)
#50366 Failing to fit some pods on cluster due to accidentally increased fluentd resource request Some change around setting fluentd resource requests accidentally doubled it’s CPU request. This was caught by our kubemark scalability test where we tightly fit our hollow-node pods onto a small set of nodes. With the fluentd increase, some of those pods couldn’t be scheduled due to CPU shortage and we caught it. This bug was risky for production, as it could’ve preempted some of the users pods for fluentd (a critical pod).
  • Resource requests (feature)
  • Fluentd
sig-instrumentation - 500
#48700 Apiserver panic while logging a request in TooManyRequests handler A change in the ordering of apiserver request handlers (where one of them is the TooManyRequests handler) caused a panic while instrumenting the request. Though this is not a scalability regression per se, this is a scenario which was exposed only by our scale tests where we actually see 429s (TooManyRequests) due to the scale at which we run the clusters (unlike normal scale tests).
  • Apiserver
sig-api-machinery 100 500
#47419 Performance tests failing due to newly exposed high LIST api latencies After fixing a notorious bug in the instrumentation code for the ‘API request latency’ metric, we started seeing performance test failures due to high LIST call latencies. Though it seemed like a regression at first, it was actually a hidden performance issue that was brought to light by the fix. We then realized that list calls were not actually satisfying our 1s api latency SLO and tuned it for them appropriately.
  • Apiserver
sig-api-machinery
sig-instrumentation
2000 5000
#45216 Upgrade to Go 1.8. resulted in significant performance regression When k8s was upgraded to go-1.8, we were seeing timeouts in our kubemark-scale tests due to ~2x increase in the time taken to create services. After some experimenting/profiling, it seemed to originate from changes to the net/http.(*http2serverConn).serve library function which had some extra cases added to a select statement. One of them added some logic for gracefulShutdown which slowed down the function a lot. It was eventually fixed in a patch release by the golang team.
  • Golang (net/http library)
- - 5000
#42000 Kube-proxy backlog processing causing CPU starvation for kubelet to start new pods Kube-proxies were slow in processing endpoints updates. As a result, they were building up a backlog of work to be done while load test (which creates many services) was running. Later when the density test ran (where we create 1000s of pods), the kube-proxies were still busy processing the backlog from load test and hence consuming high memory. This memory-starved the kubelets from creating the density pods after cgroups were enabled. Before cgroups, this issue was hidden.
  • Cgroups (feature)
  • Kubelet
  • Kube-proxy
sig-network
sig-node
- 500

Insights

  • On many occasions our scalability tests caught critical/risky bugs which were missed by most other tests. If not caught, those could’ve seriously jeopardized production-readiness of k8s.
  • SIG-Scalability has caught/fixed several important issues that span across various components, features and SIGs.
  • Around 60% of times (possibly even more), we catch scalability regressions with just our medium-scale (and fast) tests, i.e gce-100 and kubemark-500. Making them run as presubmits should act as a strong shield against regressions.
  • Majority of the remaining ones are caught by our large-scale (and slow) tests, i.e kubemark-5k and gce-2k. Making them as post-submit blockers (given they’re “usually” quite healthy) should act as a second layer of protection against regressions.