Date: Apr 2016
Status: Design in progress, early implementation of requirements
Users should be able to request GPU resources for their workloads, as easily as for CPU or memory. Kubernetes should keep an inventory of machines with GPU hardware, schedule containers on appropriate nodes and set up the container environment with all that’s necessary to access the GPU. All of this should eventually be supported for clusters on either bare metal or cloud providers.
An increasing number of workloads, such as machine learning and seismic survey processing, benefits from offloading computations to graphic hardware. While not as tuned as traditional, dedicated high performance computing systems such as MPI, a Kubernetes cluster can still be a great environment for organizations that need a variety of additional, “classic” workloads, such as database, web serving, etc.
GPU support is hard to provide extensively and will thus take time to tame completely, because
Currently, this document is mostly focused on the basic use case: run GPU code
g2.2xlarge EC2 machine instances using Docker. It constitutes a narrow
enough scenario that it does not require large amounts of generic code yet. GCE
doesn’t support GPUs at all; bare metal systems throw a lot of extra variables
into the mix.
Later sections will outline future work to support a broader set of hardware, environments and container runtimes.
Before any scheduling can occur, we need to know what’s available out there. In
v0, we’ll hardcode capacity detected by the kubelet based on a flag,
--experimental-nvidia-gpu. This will result in the user-defined resource
alpha.kubernetes.io/nvidia-gpu to be reported for
NodeAllocatable, as well as a node label.
GPUs will be visible as first-class resources. In v0, we’ll only assign whole devices; sharing among multiple pods is left to future implementations. It’s probable that GPUs will exacerbate the need for a rescheduler or pod priorities, especially if the nodes in a cluster are not homogeneous. Consider these two cases:
Only half of the machines have a GPU and they’re all busy with other workloads. The other half of the cluster is doing very little work. A GPU workload comes, but it can’t schedule, because the devices are sitting idle on nodes that are running something else and the nodes with little load lack the hardware.
Some or all the machines have two graphic cards each. A number of jobs get scheduled, requesting one device per pod. The scheduler puts them all on different machines, spreading the load, perhaps by design. Then a new job comes in, requiring two devices per pod, but it can’t schedule anywhere, because all we can find, at most, is one unused device per node.
Once we know where to run the container, it’s time to set up its environment. At
a minimum, we’ll need to map the host device(s) into the container. Because each
manufacturer exposes different device nodes (
but also the required
/dev/nvidia-uvm), some of the logic
needs to be hardware-specific, mapping from a logical device to a list of device
nodes necessary for software to talk to it.
Support binaries and libraries are often versioned along with the kernel module,
so there should be further hooks to project those under
/bin and some kind of
/lib before the application is started. This can be done for Docker with the
use of a versioned Docker
with upcoming Kubernetes-specific hooks such as init containers and volume
containers. In v0, images are expected to bundle everything they need.
The first implementation and testing ground will be for NVIDIA devices, by far the most common setup.
In v0, the
--experimental-nvidia-gpu flag will also result in the host devices
(limited to those required to drive the first card,
nvidia0) to be mapped into
the container by the dockertools library.
This is what happens before and after a user schedules a GPU pod.
Administrator installs a number of Kubernetes nodes with GPUs. The correct
kernel modules and device nodes under
/dev/ are present.
Administrator makes sure the latest CUDA/driver versions are installed.
--experimental-nvidia-gpu on kubelets
Kubelets update node status with information about the GPU device, in addition to cAdvisor’s usual data about CPU/memory/disk
User creates a Docker image compiling their application for CUDA, bundling the necessary libraries. We ignore any versioning requirements in the image using labels based on NVIDIA’s conventions.
User creates a pod using the image, requiring
Scheduler picks a node for the pod
The kubelet notices the GPU requirement and maps the three devices. In Docker’s engine-api, this means it’ll add them to the Resources.Devices list.
Docker runs the container to completion
The scheduler notices that the device is available again
For v0, we discussed at length, but decided to leave aside initially the nvidia-docker plugin. The plugin is an officially supported solution, thus avoiding a lot of new low level code, as it takes care of functionality such as:
nvidia-smiand shared libraries
/deventry names for each device, as well as control ones like
nvidia-docker wrapper also verifies that the CUDA version required by a
given image is supported by the host drivers, through inspection of well-known
image labels, if present. We should try to provide equivalent checks, either
for CUDA or OpenCL.
This is current sample output from
nvidia-docker-plugin, wrapped for
$ curl -s localhost:3476/docker/cli --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver=nvidia-docker --volume=nvidia_driver_352.68:/usr/local/nvidia:ro
It runs as a daemon listening for HTTP requests on port 3476. The endpoint above returns flags that need to be added to the Docker command line in order to expose GPUs to the containers. There are optional URL arguments to request specific devices if more than one are present on the system, as well as specific versions of the support software. An obvious improvement is an additional endpoint for JSON output.
The unresolved question is whether
nvidia-docker-plugin would run standalone
as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
resource API) or whether the relevant code from its
nvidia package should be
linked directly into kubelet. A partial list of tradeoffs:
|External binary||Linked in|
|Use of cgo||Confined to binary||Linked into kubelet, but with lazy binding|
|Expandibility||Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon.||Can reuse the
|Bloat||None||Larger kubelet, even for systems without GPUs|
|Reliability||Need to handle the binary disappearing at any time||Fewer headeaches|
|(Un)Marshalling||Need to talk over JSON||None|
|Administration cost||One more daemon to install, configure and monitor||No extra work required, other than perhaps configuring flags|
|Releases||Potentially on its own schedule||Tied to Kubernetes|
The first two tracks can progress in parallel.
setNodeStatusMachineInforeport the resource
kubectl describeoutput. Optional for non-GPU users?
Above all, we need to collect feedback from real users and use that to set priorities for any of the items below.
It makes sense to turn the output of this project (external resource plugins, etc.) into a more generic abstraction at some point.
There should be knobs for the cluster administrator to only allow certain users or roles to schedule GPU workloads. Overcommitting or sharing the same device across different pods is not considered safe. It should be possible to segregate such GPU-sharing pods by user, namespace or a combination thereof.