Principles to follow when extending Kubernetes.
See also the API conventions.
- All APIs should be declarative.
- API objects should be complementary and composable, not opaque wrappers.
- The control plane should be transparent – there are no hidden internal APIs.
- The cost of API operations should be proportional to the number of objects
intentionally operated upon. Therefore, common filtered lookups must be indexed.
Beware of patterns of multiple API calls that would incur quadratic behavior.
- Object status must be 100% reconstructable by observation. Any history kept
must be just an optimization and not required for correct operation.
- Cluster-wide invariants are difficult to enforce correctly. Try not to add
them. If you must have them, don’t enforce them atomically in master components,
that is contention-prone and doesn’t provide a recovery path in the case of a
bug allowing the invariant to be violated. Instead, provide a series of checks
to reduce the probability of a violation, and make every component involved able
to recover from an invariant violation.
- Low-level APIs should be designed for control by higher-level systems.
Higher-level APIs should be intent-oriented (think SLOs) rather than
implementation-oriented (think control knobs).
- Functionality must be level-based, meaning the system must operate correctly
given the desired state and the current/observed state, regardless of how many
intermediate state updates may have been missed. Edge-triggered behavior must be
just an optimization.
- There should be a CAP-like theorem regarding the tradeoffs between driving control loops via polling or events about simultaneously achieving high performance, reliability, and simplicity – pick any 2.
- Assume an open world: continually verify assumptions and gracefully adapt to
external events and/or actors. Example: we allow users to kill pods under
control of a replication controller; it just replaces them.
- Do not define comprehensive state machines for objects with behaviors
associated with state transitions and/or “assumed” states that cannot be
ascertained by observation.
- Don’t assume a component’s decisions will not be overridden or rejected, nor
for the component to always understand why. For example, etcd may reject writes.
Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry,
but back off and/or make alternative decisions.
- Components should be self-healing. For example, if you must keep some state
(e.g., cache) the content needs to be periodically refreshed, so that if an item
does get erroneously stored or a deletion event is missed etc, it will be soon
fixed, ideally on timescales that are shorter than what will attract attention
- Component behavior should degrade gracefully. Prioritize actions so that the
most important activities can continue to function even when overloaded and/or
in states of partial failure.
- Only the apiserver should communicate with etcd/store, and not other
components (scheduler, kubelet, etc.).
- Compromising a single node shouldn’t compromise the cluster.
- Components should continue to do what they were last told in the absence of
new instructions (e.g., due to network partition or component outage).
- All components should keep all relevant state in memory all the time. The
apiserver should write through to etcd/store, other components should write
through to the apiserver, and they should watch for updates made by other
- Watch is preferred over polling.
- Self-hosting of all components is a goal.
- Minimize the number of dependencies, particularly those required for
- Stratify the dependencies that remain via principled layering.
- Break any circular dependencies by converting hard dependencies to soft
- Also accept that data from other components from another source, such as
local files, which can then be manually populated at bootstrap time and then
continuously updated once those other components are available.
- State should be rediscoverable and/or reconstructable.
- Make it easy to run temporary, bootstrap instances of all components in
order to create the runtime state needed to run the components in the steady
state; use a lock (master election for distributed components, file lock for
local components like Kubelet) to coordinate handoff. We call this technique
- Have a solution to restart dead components. For distributed components,
replication works well. For local components such as Kubelet, a process manager
or even a simple shell loop works.