Kubernetes provides an external loadbalancer service type which creates a virtual external ip (in supported cloud provider environments) that can be used to load-balance traffic to the pods matching the service pod-selector.
The current implementation requires that the cloud loadbalancer balances traffic across all Kubernetes worker nodes, and this traffic is then equally distributed to all the backend pods for that service. Due to the DNAT required to redirect the traffic to its ultimate destination, the return path for each session MUST traverse the same node again. To ensure this, the node also performs a SNAT, replacing the source ip with its own.
This causes the service endpoint to see the session as originating from a cluster local ip address. The original external source IP is lost
This is not a satisfactory solution - the original external source IP MUST be preserved for a lot of applications and customer use-cases.
The double hop must be prevented by programming the external load balancer to direct traffic only to nodes that have local pods for the service. This can be accomplished in two ways, either by API calls to add/delete nodes from the LB node pool or by adding health checking to the LB and failing/passing health checks depending on the presence of local pods.
This approach requires that the Cloud LB be reprogrammed to be in sync with endpoint presence. Whenever the first service endpoint is scheduled onto a node, the node is added to the LB pool. Whenever the last service endpoint is unhealthy on a node, the node needs to be removed from the LB pool.
This is a slow operation, on the order of 30-60 seconds, and involves the Cloud Provider API path. If the API endpoint is temporarily unavailable, the datapath will be misprogrammed till the reprogramming is successful and the API->datapath tables are updated by the cloud provider backend.
This approach requires that all worker nodes in the cluster be programmed into the LB target pool.
To steer traffic only onto nodes that have endpoints for the service, we program the LB to perform
node healthchecks. The kube-proxy daemons running on each node will be responsible for responding
to these healthcheck requests (URL
/healthz) from the cloud provider LB healthchecker. An additional nodePort
will be allocated for these health check for this purpose.
kube-proxy already watches for Service and Endpoint changes, it will maintain an in-memory lookup
table indicating the number of local endpoints for each service.
For a value of zero local endpoints, it responds with a health check failure (503 Service Unavailable),
and success (200 OK) for non-zero values.
Healthchecks are programmable with a min period of 1 second on most cloud provider LBs, and min failures to trigger node health state change can be configurable from 2 through 5.
This will allow much faster transition times on the order of 1-5 seconds, and involve no API calls to the cloud provider (and hence reduce the impact of API unreliability), keeping the time window where traffic might get directed to nodes with no local endpoints to a minimum.
The cloud provider package may choose either of these approaches. kube-proxy will provide these healthcheck responder capabilities, regardless of the cloud provider configured on a cluster.
To allow kube-proxy to recognize if an endpoint is local requires that the EndpointAddress struct should also contain the NodeName it resides on. This new string field will be read-only and populated only by the Endpoints Controller.
A new annotation
service.alpha.kubernetes.io/external-traffic will be recognized
by the service controller only for services of Type LoadBalancer. Services that wish to opt-in to
the new LoadBalancer behaviour must annotate the Service to request the new ESIPP behavior.
Supported values for this annotation are OnlyLocal and Global.
- OnlyLocal activates the new logic (described in this proposal) and balances locally within a node.
- Global activates the old logic of balancing traffic across the entire cluster.
An additional nodePort allocation will be necessary for services that are of type LoadBalancer and
have the new annotation specified. This additional nodePort is necessary for kube-proxy to listen for
healthcheck requests on all nodes.
This NodePort will be added as an annotation (
the Service after allocation (in the alpha release). The value of this annotation may also be
specified during the Create call and the allocator will reserve that specific nodePort.
When the last endpoint on the node has gone away and the LB has not marked the node as unhealthy, worst-case window size = (N+1) * HCP, where N = minimum failed healthchecks and HCP = Health Check Period, external traffic will still be steered to the node. This traffic will be blackholed and not forwarded to other endpoints elsewhere in the cluster.
Internal pod to pod traffic should behave as before, with equal probability across all pods.
GCE/AWS load balancers do not provide weights for their target pools. This was not an issue with the old LB kube-proxy rules which would correctly balance across all endpoints.
With the new functionality, the external traffic will not be equally load balanced across pods, but rather equally balanced at the node level (because GCE/AWS and other external LB implementations do not have the ability for specifying the weight per node, they balance equally across all target nodes, disregarding the number of pods on each node).
We can, however, state that for NumServicePods << NumNodes or NumServicePods >> NumNodes, a fairly close-to-equal distribution will be seen, even without weights.
Once the external load balancers provide weights, this functionality can be added to the LB programming path. Future Work: No support for weights is provided for the 1.4 release, but may be added at a future date
This feature is added as an opt-in annotation. Default behaviour of LoadBalancer type services will be unchanged for all Cloud providers. The annotation will be ignored by existing cloud provider libraries until they add support.
For the 1.4 release, this feature will be implemented for the GCE cloud provider.
Node: On the node, we expect to see the real source IP of the client. Destination IP will be the Service Virtual External IP.
Pod: For processes running inside the Pod network namespace, the source IP will be the real client source IP. The destination address will the be Pod IP.
kube-proxy listens on the health check node port for TCP health checks on :::. This allow responding to health checks when the destination IP is either the VM IP or the Service Virtual External IP. In practice, tcpdump traces on GCE show source IP is 169.254.169.254 and destination address is the Service Virtual External IP.
TBD discuss timelines and feasibility with Kubernetes sig-aws team members
This functionality may not be introduced in Openstack in the near term.
Note from Openstack team member @anguslees Underlying vendor devices might be able to do this, but we only expose full-NAT/proxy loadbalancing through the OpenStack API (LBaaS v1/v2 and Octavia). So I’m afraid this will be unsupported on OpenStack, afaics.
To be confirmed For the 1.4 release, this feature will be implemented for the Azure cloud provider.
The cases we should test are:
1.1 Source IP Preservation
Test the main intent of this change, source ip preservation - use the all-in-one network tests container with new functionality that responds with the client IP. Verify the container is seeing the external IP of the test client.
1.2 Health Check responses
Testcases use pods explicitly pinned to nodes and delete/add to nodes randomly. Validate that healthchecks succeed and fail on the expected nodes as endpoints move around. Gather LB response times (time from pod declares ready to time for Cloud LB to declare node healthy and vice versa) to endpoint changes.
Validate that internal cluster communications are still possible from nodes without local endpoints. This change is only for externally sourced traffic.
Validate that old and new functionality can simultaneously exist in a single cluster. Create services with and without the annotation, and validate datapath correctness.
The only part of the design that changes for beta is the API, which is upgraded from annotation-based to first class fields.
service.alpha.kubernetes.io/node-local-loadbalancer will switch to a Service object field.
Post-1.4 feature ideas. These are not fully-fleshed designs.