Status: During Review
Author: Marek Grabowski (gmarek@)
The Event API change proposal is the first step towards having useful Events in the system. Another step is to formalize the Event style guide, i.e. set of properties that developers need to ensure when adding new Events to the system. This is necessary to ensure that we have a system in which all components emit consistently structured Events.
Events are expected to provide important insights for the application developer/operator on the state of their application. Events relevant to cluster administrators are acceptable, as well, though they usually also have the option of looking at component logs. Events are much more expensive than logs, thus they’re not expected to provide in-depth system debugging information. Instead concentrate on things that are important from the application developer’s perspective. Events need to be either actionable, or be useful to understand past or future system’s behavior. Events are not intended to drive automation. Watching resource status should be sufficient for controllers.
Following are the guidelines for adding Events to the system. Those are not hard-and-fast rules, but should be considered by all contributors adding new Events and members doing reviews. 1. Emit events only when state of the system changes/attempts to change. Events “it’s still running” are not interesting. Also, changes that do not add information beyond what is observable by watching the altered resources should not be duplicated as events. Note that adding a reason for some action that can’t be inferred from the state change is considered additional information. 1. Limit Events to no more than one per change/attempt. There’s no need for Events on “About to do X” AND “Did X”/“Failed to do X”. Result is more interesting and implies an attempt. 1. It may give impression that this gets tricky with scale events, e.g. Deployment scales ReplicaSet which creates/deletes Pods. For us those are 3 (or more) separate Events (3 different objects are affected) so it’s fine to emit multiple Events. 1. When an error occurs that prevents a user application from starting or from enacting other normal system behavior, such as object creation, an Event should be emitted (e.g. invalid image). 1. Note that Events are garbage collected so every user-actionable error needs to be surfaced via resource status as well. 1. It’s usually OK to emit failure Events for each failure. Dedup mechanism will deal with that. The exception is failures that are frequent but typically ephemeral and automatically repairable/recoverable, such as broken socket connections, in which case they should only be reported if persistent and unrepairable, in order to mitigate event spam. 1. When a user application stops running for any reason, an Event should be emitted (e.g. Pod evicted because Node is under memory pressure) 1. If it’s a system-wide change of state that may impact currently running applications or have an may have severe impact on future workload schedulability, an Event should be emitted (e.g. Node became unreachable, 1. Failed to create route for Node). 1. If it doesn’t fit any of above scenarios you should consider not emitting Event.
New Event API tries to use more descriptive field names to influence how Events are structured. Event has following fields: * Regarding * Related * ReportingController * ReportingInstance * Action * Reason * Type * Note
The Event should be structured in a way that following sentence “makes sense”: