networking

Networking

There are 4 distinct networking problems to solve:

  1. Highly-coupled container-to-container communications
  2. Pod-to-Pod communications
  3. Pod-to-Service communications
  4. External-to-internal communications

Model and motivation

Kubernetes deviates from the default Docker networking model (though as of Docker 1.8 their network plugins are getting closer). The goal is for each pod to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.

Dynamic port allocation, on the other hand, requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g. consul or etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g. web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g. using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems.

Container to container

All containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. This offers simplicity (static ports know a priori), security (ports bound to localhost are visible within the pod but never outside it), and performance. This also reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts. People running application stacks together on the same host have already figured out how to make ports not conflict and have arranged for clients to find them.

The approach does reduce isolation between containers within a pod — ports could conflict, and there can be no container-private ports, but these seem to be relatively minor issues with plausible future workarounds. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don’t control what pods land together on a host.

Pod to pod

Because every pod gets a “real” (not machine-private) IP address, pods can communicate without proxies or translations. The pod can use well-known port numbers and can avoid the use of higher-level service discovery systems like DNS-SD, Consul, or Etcd.

When any container calls ioctl(SIOCGIFADDR) (get the address of an interface), it sees the same IP that any peer container would see them coming from — each pod has its own IP address that other pods can know. By making IP addresses and ports the same both inside and outside the pods, we create a NAT-less, flat address space. Running “ip addr show” should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.

This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short — you can never self-register anything from a container, because a container can not be reached on its private IP.

An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.

Like Docker, ports can still be published to the host node’s interface(s), but the need for this is radically diminished.

Implementation

For the Google Compute Engine cluster configuration scripts, we use advanced routing rules and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the ‘main’ IP address assigned to the VM that is NAT-ed for Internet access. The container bridge (called cbr0 to differentiate it from docker0) is set up outside of Docker proper.

Example of GCE’s advanced routing rules:

gcloud compute routes add "${NODE_NAMES[$i]}" \
  --project "${PROJECT}" \
  --destination-range "${NODE_IP_RANGES[$i]}" \
  --network "${NETWORK}" \
  --next-hop-instance "${NODE_NAMES[$i]}" \
  --next-hop-instance-zone "${ZONE}" &

GCE itself does not know anything about these IPs, though. This means that when a pod tries to egress beyond GCE’s project the packets must be SNAT’ed (masqueraded) to the VM’s IP, which GCE recognizes and allows.

Other implementations

With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE. - OpenVSwitch with GRE/VxLAN - Flannel - L2 networks (“With Linux Bridge devices” section) - Weave is yet another way to build an overlay network, primarily aiming at Docker integration. - Calico uses BGP to enable real container IPs. - Cilium supports Overlay Network mode (IPv4/IPv6) and Direct Routing model (IPv6)

Pod to service

The service abstraction provides a way to group pods under a common access policy (e.g. load-balanced). The implementation of this creates a virtual IP which clients can access and which is transparently proxied to the pods in a Service. Each node runs a kube-proxy process which programs iptables rules to trap access to service IPs and redirect them to the correct backends. This provides a highly-available load-balancing solution with low performance overhead by balancing client traffic from a node on that same node.

External to internal

So far the discussion has been about how to access a pod or service from within the cluster. Accessing a pod from outside the cluster is a bit more tricky. We want to offer highly-available, high-performance load balancing to target Kubernetes Services. Most public cloud providers are simply not flexible enough yet.

The way this is generally implemented is to set up external load balancers (e.g. GCE’s ForwardingRules or AWS’s ELB) which target all nodes in a cluster. When traffic arrives at a node it is recognized as being part of a particular Service and routed to an appropriate backend Pod. This does mean that some traffic will get double-bounced on the network. Once cloud providers have better offerings we can take advantage of those.

Challenges and future work

Docker API

Right now, docker inspect doesn’t show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow.

External IP assignment

We want to be able to assign IP addresses externally from Docker #6743 so that we don’t need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across pod infra container restarts (#2801), and to facilitate pod migration. Right now, if the pod infra container dies, all the user containers must be stopped and restarted because the netns of the pod infra container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below).

IPv6

IPv6 support would be nice but requires significant internal changes in a few areas. First pods should be able to report multiple IP addresses Kubernetes issue #27398 and the network plugin architecture Kubernetes uses needs to allow returning IPv6 addresses too CNI issue #245. Kubernetes code that deals with IP addresses must then be audited and fixed to support both IPv4 and IPv6 addresses and not assume IPv4. AWS started rolling out basic ipv6 support, but direct ipv6 assignment to instances doesn’t appear to be supported by other major cloud providers (e.g. GCE) yet. We’d happily take pull requests from people running Kubernetes on bare metal, though. :-)