Hardware Accelerated SSL/TLS Termination in Ingress Controllers using Kubernetes Device Plugins and RuntimeClass
Authors: Mikko Ylinen (Intel)
A Kubernetes Ingress is a way to connect cluster services to the world outside the cluster. In order to correctly route the traffic to service backends, the cluster needs an Ingress controller. The Ingress controller is responsible for setting the right destinations to backends based on the Ingress API objects’ information. The actual traffic is routed through a proxy server that is responsible for tasks such as load balancing and SSL/TLS (later “SSL” refers to both SSL or TLS ) termination. The SSL termination is a CPU heavy operation due to the crypto operations involved. To offload some of the CPU intensive work away from the CPU, OpenSSL based proxy servers can take the benefit of OpenSSL Engine API and dedicated crypto hardware. This frees CPU cycles for other things and improves the overall throughput of the proxy server.
In this blog post, we will show how easy it is to make hardware accelerated crypto available for containers running the Ingress controller proxy using some of the recently created Kubernetes building blocks: Device plugin framework and RuntimeClass. At the end, a reference setup is given using an HAproxy based Ingress controller accelerated using Intel® QuickAssist Technology cards.
About Proxies, OpenSSL Engine and Crypto Hardware
The proxy server plays a vital role in a Kubernetes Ingress Controller function. It proxies the traffic to the backends per Ingress objects routes. Under heavy traffic load, the performance becomes critical especially if the proxying involves CPU intensive operations like SSL crypto.
The OpenSSL project provides the widely adopted library for implementing the SSL protocol. Of the commonly known proxy servers used by Kubernetes Ingress controllers, Nginx and HAproxy use OpenSSL. The CNCF graduated Envoy proxy uses BoringSSL but there seems to be community interest in having OpenSSL as the alternative for it too.
The OpenSSL SSL protocol library relies on libcrypto that implements the cryptographic functions. For quite some time now (first introduced in 0.9.6 release), OpenSSL has provided an ENGINE concept that allows these cryptographic operations to be offloaded to a dedicated crypto acceleration hardware. Later, a special dynamic ENGINE enabled the crypto hardware specific pieces to be implemented in an independent loadable module that can be developed outside the OpenSSL code base and distributed separately. From the application’s perspective, this is also ideal because they don’t need to know the details of how to use the hardware, and the hardware specific module can be loaded/used when the hardware is available.
Hardware based crypto can greatly improve Cloud applications’ performance due to hardware accelerated processing in SSL operations as discussed, and can provide other crypto services like key/random number generation. Clouds can make the hardware easily available using the dynamic ENGINE and several loadable module implementations exist, for example, CloudHSM, IBMCA, or QAT Engine.
For Cloud deployments, the ideal scenario is for these modules to be shipped as part of the container workload. The workload would get scheduled on a node that provides the underlying hardware that the module needs to access. On the other hand, the workloads should run the same way and without code modifications regardless of the crypto acceleration hardware being available or not. The OpenSSL dynamic engine enables this. Figure 1 below illustrates these two scenarios using a typical Ingress Controller container as an example. The red colored boxes indicate the differences between a container with a crypto hardware engine enabled container vs. a “standard” one. It’s worth pointing out that the configuration changes shown do not necessarily require another version of the container since the configurations could be managed, e.g., using ConfigMaps.
Hardware Resources and Isolation
To be able to deploy workloads with hardware dependencies, Kubernetes provides excellent extension and configurability mechanisms. Let’s take a closer look into Kubernetes the device plugin framework (beta in 1.14) and RuntimeClass (beta in 1.14) and learn how they can be leveraged to expose crypto hardware to workloads.
The device plugin framework, first introduced in Kubernetes 1.8, provides a way for hardware vendors to register and allocate node hardware resources to Kubelets. The plugins implement the hardware specific initialization logic and resource management. The pods can request hardware resources in their PodSpec, which also guarantees the pod is scheduled on a node that can provide those resources.
The device resource allocation for containers is non-trivial. For applications dealing with security, the hardware level isolation is critical. The PCIe based crypto acceleration device functions can benefit from IO hardware virtualization, through an I/O Memory Management Unit (IOMMU), to provide the isolation: an IOMMU group the device belongs to provides the isolated resource for a workload (assuming the crypto cards do not share the IOMMU group with other devices). The number of isolated resources can be further increased if the PCIe device supports the Single-Root I/O Virtualization (SR-IOV) specification. SR-IOV allows the PCIe device to be split further to virtual functions (VF), derived from physical function (PF) devices, and each belonging to their own IOMMU group. To expose these IOMMU isolated device functions to user space and containers, the host kernel should bind them to a specific device driver. In Linux, this driver is vfio-pci and it makes each device available through a character device in user space. The kernel vfio-pci driver provides user space applications with a direct, IOMMU backed access to PCIe devices and functions, using a mechanism called PCI passthrough. The interface can be leveraged by user space frameworks, such as the Data Plane Development Kit (DPDK). Additionally, virtual machine (VM) hypervisors can provide these user space device nodes to VMs and expose them as PCI devices to the guest kernel. Assuming support from the guest kernel, the VM gets close to native performant direct access to the underlying host devices.
To advertise these device resources to Kubernetes, we can have a simple Kubernetes device plugin
that runs the initialization (i.e., binding), calls kubelet’s
Registration gRPC service, and
implements the DevicePlugin gRPC service that kubelet calls to, e.g., to
Allocate the resources
upon Pod creation.
Device Assignment and Pod Deployment
At this point, you may ask what the container could do with a VFIO device node? The answer comes after we first take a quick look into the Kubernetes RuntimeClass.
The Kubernetes RuntimeClass was created to provide better control and configurability over a variety of runtimes (an earlier blog post goes into the details of the needs, status and roadmap for it) that are available in the cluster. In essence, the RuntimeClass provides cluster users better tools to pick and use the runtime that best suits for the pod use case.
The OCI compatible Kata Containers runtime provides workloads with a hardware virtualized
isolation layer. In addition to workload isolation, the Kata Containers VM has the added
side benefit that the VFIO devices, as
Allocate’d by the device plugin, can be passed
through to the container as hardware isolated devices. The only requirement is that the
Kata Containers kernel has driver for the exposed device enabled.
That’s all it really takes to enable hardware accelerated crypto for container workloads. To summarize:
- Cluster needs a device plugin running on the node that provides the hardware
- Device plugin exposes the hardware to user space using the VFIO driver
- Pod requests the device resources and Kata Containers as the RuntimeClass in the PodSpec
- The container has the hardware adaptation library and the OpenSSL engine module
Figure 2 shows the overall setup using the Container A illustrated earlier.
Finally, we describe the necessary building blocks and steps to build a functional setup described in Figure 2 that enables hardware accelerated SSL termination in an Ingress Controller using an Intel® QuickAssist Technology (QAT) PCIe device. It should be noted that the use cases are not limited to Ingress controllers, but any OpenSSL based workload can be accelerated.
- Kubernetes 1.14 (
DevicePluginfeature gates enabled (both are
- RuntimeClass ready runtime and Kata Containers configured
- Intel® QAT driver release with the kernel drivers installed for both host kernel and Kata Containers kernel (or on a rootfs as loadable modules)
- QAT device plugin DaemonSet deployed
Ingress controller configuration and deployment:
- HAproxy-ingress ingress controller in a modified container that has
- the QAT HW HAL user space library (part of Intel® QAT SW release) and
- the OpenSSL QAT Engine built in
- Haproxy-ingress ConfigMap to enable QAT engine usage
- Haproxy-ingress deployment
runtimeClassName: kata-containers(name value depends on cluster config)
- (QAT device config file for each requested device resource with OpenSSL engine configured available in the container)
Once the building blocks are available, the hardware accelerated SSL/TLS can be tested by following the TLS termination
example steps. In order to verify the hardware is used, you can check
/sys/kernel/debug/*/fw_counters files on host as they
get updated by the Intel® QAT firmware.
Haproxy-ingress and HAproxy are used because HAproxy can be directly configured to use the OpenSSL engine using
ssl-engine <name> [algo ALGOs] configuration flag without modifications to the global openssl configuration file.
Moreover, HAproxy can offload configured algorithms using asynchronous calls (with
ssl-mode-async) to further improve performance.
Call to Action
In this blog post we have shown how Kubernetes Device Plugins and RuntimeClass can be used to provide isolated hardware access for applications in pods to offload crypto operations to hardware accelerators. Hardware accelerators can be used to speed up crypto operations and also save CPU cycles to other tasks. We demonstrated the setup using HAproxy that already supports asynchronous crypto offload with OpenSSL.
The next steps for our team is to repeat the same for Envoy (with an OpenSSL based TLS transport socket built as an extension). Furthermore, we are working to enhance Envoy to be able to offload BoringSSL asynchronous private key operations to a crypto acceleration hardware. Any review feedback or help is appreciated!
How many CPU cycles can your crypto application save for other tasks when offloading crypto processing to a dedicated accelerator?