NFTables mode for kube-proxy
A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)
Why nftables? Part 1: data plane latency
The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.
In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:
# If the packet is addressed to 172.30.0.41:80, then jump to the chain
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
# If the packet is addressed to 172.30.0.42:443, then...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
# etc...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
This means that when a packet comes in, the time it takes the kernel
to check it against all of the Service rules is O(n) in the number
of Services. As the number of Services increases, both the average and
the worst-case latency for the first packet of a new connection
increases (with the difference between best-case, average, and
worst-case being mostly determined by whether a given Service IP
address appears earlier or later in the KUBE-SERVICES
chain).
By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:
table ip kube-proxy {
# The service-ips verdict map indicates the action to take for each matching packet.
map service-ips {
type ipv4_addr . inet_proto . inet_service : verdict
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
... }
}
# Now we just need a single rule to process all packets matching an
# element in the map. (This rule says, "construct a tuple from the
# destination IP address, layer 4 protocol, and destination port; look
# that tuple up in "service-ips"; and if there's a match, execute the
# associated verdict.)
chain services {
ip daddr . meta l4proto . th dport vmap @service-ips
}
...
}
Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:
But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:
Why nftables? Part 2: control plane latency
While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.
With both iptables and nftables, the total size of the ruleset as a
whole (actual rules, plus associated data) is O(n) in the combined
number of Services and their endpoints. Originally, the iptables
backend would rewrite every rule on every update, and with tens of
thousands of Services, this could grow to be hundreds of thousands of
iptables rules. Starting in Kubernetes 1.26, we began improving
kube-proxy so that it could skip updating most of the unchanged
rules in each update, but the limitations of iptables-restore
as an
API meant that it was still always necessary to send an update that's
O(n) in the number of Services (though with a noticeably smaller
constant than it used to be). Even with those optimizations, it can
still be necessary to make use of kube-proxy's minSyncPeriod
config
option to ensure that it doesn't spend every waking second trying to
push iptables updates.
The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.
(Unfortunately I don't have cool graphs for this part.)
Why not nftables?
All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.
First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.
Second, the nftables mode will not work on older Linux distributions;
currently it requires a 5.13 or newer kernel. Additionally, because of
bugs in early versions of the nft
command line tool, you should not
run kube-proxy in nftables mode on nodes that have an old (earlier
than 1.0.0) version of nft
in the host filesystem (or else
kube-proxy's use of nftables may interfere with other uses of nftables
on the system).
Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.
Finally, kube-proxy in nftables mode is intentionally not 100%
compatible with kube-proxy in iptables mode. There are a few old
kube-proxy features whose default behaviors are less secure, less
performant, or less intuitive than we'd like, but where we felt that
changing the default would be a compatibility break. Since the
nftables mode is opt-in, this gave us a chance to fix those bad
defaults without breaking users who weren't expecting changes. (In
particular, with nftables mode, NodePort Services are now only
reachable on their nodes' default IPs, as opposed to being reachable
on all IPs, including 127.0.0.1
, with iptables mode.) The
kube-proxy documentation has more information about this, including
information about metrics you can look at to determine if you are
relying on any of the changed functionality, and what configuration
options are available to get more backward-compatible behavior.
Trying out nftables mode
Ready to try it out? In Kubernetes 1.31 and later, you just need to
pass --proxy-mode nftables
to kube-proxy (or set mode: nftables
in
your kube-proxy config file).
If you are using kubeadm to set up your cluster, the kubeadm
documentation explains how to pass a KubeProxyConfiguration
to
kubeadm init
. You can also deploy nftables-based clusters with
kind
.
You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)
Future plans
As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.
The future of the IPVS mode of kube-proxy is less certain: its main
advantage over iptables was that it was faster, but certain aspects of
the IPVS architecture and APIs were awkward for kube-proxy's purposes
(for example, the fact that the kube-ipvs0
device needs to have
every Service IP address assigned to it), and some parts of
Kubernetes Service proxying semantics were difficult to implement
using IPVS (particularly the fact that some Services had to have
different endpoints depending on whether you connected to them from a
local or remote client). And now, the nftables mode has the same
performance as IPVS mode (actually, slightly better), without any of
the downsides:
(In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)
While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).
Learn more
"KEP-3866: Add an nftables-based kube-proxy backend" has the history of the new feature.
"How the Tables Have Turned: Kubernetes Says Goodbye to IPTables", from KubeCon/CloudNativeCon North America 2024, talks about porting kube-proxy and Calico from iptables to nftables.
"From Observability to Performance", from KubeCon/CloudNativeCon North America 2024. (This is where the kube-proxy latency data came from; the raw data for the charts is also available.)