linux-stable/drivers
Lukas Wunner 42df6e1d22 netfilter: Introduce egress hook
Support classifying packets with netfilter on egress to satisfy user
requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
  on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
  such as local ARP traffic generated for clustering purposes or DHCP
  (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
  and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)

The egress hook introduced herein complements the ingress hook added by
commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key").  A patch for nftables to hook up
egress rules from user space has been submitted separately, so users may
immediately take advantage of the feature.

Alternatively or in addition to netfilter, packets can be classified
with traffic control (tc).  On ingress, packets are classified first by
tc, then by netfilter.  On egress, the order is reversed for symmetry.
Conceptually, tc and netfilter can be thought of as layers, with
netfilter layered above tc.

Traffic control is capable of redirecting packets to another interface
(man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
host namespace to a container via a veth connection:
tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

In this case, netfilter egress classifying is not performed when leaving
the host namespace!  That's because the packet is still on the tc layer.
If tc redirects the packet to a physical interface in the host namespace
such that it leaves the system, the packet is never subjected to
netfilter egress classifying.  That is only logical since it hasn't
passed through netfilter ingress classifying either.

Packets can alternatively be redirected at the netfilter layer using
nft fwd.  Such a packet *is* subjected to netfilter egress classifying
since it has reached the netfilter layer.

Internally, the skb->nf_skip_egress flag controls whether netfilter is
invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
be called recursively by tunnel drivers such as vxlan, the flag is
reverted to false after sch_handle_egress().  This ensures that
netfilter is applied both on the overlay and underlying network.

Interaction between tc and netfilter is possible by setting and querying
skb->mark.

If netfilter egress classifying is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:

Before:             1537 1538 1538 1537 1538 1537 Mb/sec
After:              1536 1534 1539 1539 1539 1540 Mb/sec
Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec

When netfilter egress classifying is enabled on at least one interface,
a minimal performance penalty is incurred for every egress packet, even
if the interface it's transmitted over doesn't have any netfilter egress
rules configured.  That is caused by checking dev->nf_hooks_egress
against NULL.

Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14 23:06:28 +02:00
..
accessibility
acpi Revert "ACPI: Add memory semantics to acpi_os_map_memory()" 2021-09-23 20:39:36 +02:00
amba
android binder: make sure fd closes complete 2021-09-14 09:02:13 +02:00
ata
atm
auxdisplay
base device property: move mac addr helpers to eth.c 2021-10-07 13:39:51 +01:00
bcma bcma: drop unneeded initialization value 2021-10-05 08:32:30 +03:00
block virtio,vdpa,vhost: features, fixes 2021-09-11 14:48:42 -07:00
bluetooth Bluetooth: Rename driver .prevent_wake to .wakeup 2021-10-01 15:46:15 -07:00
bus
cdrom
char IPMI: A couple of very minor fixes for style and rate limiting 2021-09-12 11:44:58 -07:00
clk One patch to fix an unused variable warning in a Qualcomm clk driver. 2021-09-11 10:05:56 -07:00
clocksource
comedi comedi: Fix memory leak in compat_insnlist() 2021-09-21 17:53:54 +02:00
connector
counter
cpufreq Power management fixes for 5.15-rc2 2021-09-17 12:05:04 -07:00
cpuidle - Core Frameworks 2021-09-07 12:38:59 -07:00
crypto crypto: ccp - fix resource leaks in ccp_run_aes_gcm_cmd() 2021-09-24 15:58:41 +08:00
cxl cxl for v5.15 2021-09-09 11:48:27 -07:00
dax libnvdimm for v5.15 2021-09-09 11:39:57 -07:00
dca
devfreq devfreq: use HZ macros 2021-09-08 11:50:26 -07:00
dio
dma dmaengine updates for v5.15-rc1 2021-09-09 11:07:47 -07:00
dma-buf dma-buf: DMABUF_SYSFS_STATS should depend on DMA_SHARED_BUFFER 2021-09-07 12:42:21 +05:30
edac EDAC/dmc520: Assign the proper type to dimm->edac_mode 2021-09-16 11:00:12 +02:00
eisa
extcon
firewire FireWire (IEEE 1394) subsystem updates: 2021-09-11 09:47:33 -07:00
firmware - Add the tegra3 thermal sensor and fix the compilation testing on 2021-09-11 09:20:57 -07:00
fpga fpga: dfl: Avoid reads to AFU CSRs during enumeration 2021-09-16 15:20:55 -07:00
fsi
gnss
gpio gpio fixes for v5.15-rc4 2021-09-30 12:11:35 -07:00
gpu drm fixes for 5.15-rc3 2021-09-24 10:14:19 -07:00
greybus
hid HID: amd_sfh: Fix potential NULL pointer dereference 2021-09-27 10:00:43 +02:00
hsi
hv hyperv-fixes for 5.15-rc2 2021-09-15 17:18:56 -07:00
hwmon Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
hwspinlock
hwtracing coresight: syscfg: Fix compiler warning 2021-09-14 09:03:16 +02:00
i2c
i3c
idle
iio Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
infiniband mlx4: replace mlx4_mac_to_u64() with ether_addr_to_u64() 2021-10-05 13:15:35 +01:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2021-09-11 09:08:28 -07:00
interconnect
iommu virtio,vdpa,vhost: features, fixes 2021-09-11 14:48:42 -07:00
ipack
irqchip irqchip/gic: Work around broken Renesas integration 2021-09-22 14:44:25 +01:00
isdn
leds
macintosh memblock: introduce saner 'memblock_free_ptr()' interface 2021-09-14 13:23:22 -07:00
mailbox
mcb mcb: fix error handling in mcb_alloc_bus() 2021-09-14 11:22:26 +02:00
md md: fix a lock order reversal in md_alloc 2021-09-22 08:45:58 -07:00
media media fixes for v5.15-rc4 2021-09-27 13:05:12 -07:00
memory
memstick
message
mfd - Core Frameworks 2021-09-07 12:38:59 -07:00
misc misc: bcm-vk: fix tty registration race 2021-09-21 16:17:15 +02:00
mmc mmc: renesas_sdhi: fix regression with hard reset on old SDHIs 2021-09-06 18:10:49 +02:00
most
mtd Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
mux
net netfilter: Introduce egress hook 2021-10-14 23:06:28 +02:00
nfc nfc: pn533: Constify pn533_phy_ops 2021-10-07 13:35:10 +01:00
ntb Bug fixes and clean-ups for Linux v5.15 2021-09-07 13:05:02 -07:00
nubus
nvdimm cxl for v5.15 2021-09-09 11:48:27 -07:00
nvme nvme: keep ctrl->namespaces ordered 2021-09-21 09:17:15 +02:00
nvmem nvmem: NVMEM_NINTENDO_OTP should depend on WII 2021-09-21 17:38:37 +02:00
of of: net: move of_net under net/ 2021-10-07 13:39:51 +01:00
opp
parisc parisc: Move pci_dev_is_behind_card_dino to where it is used 2021-09-09 12:44:31 +02:00
parport
pci xen: branch for v5.15-rc3 2021-09-25 15:37:31 -07:00
pcmcia
perf KVM: arm64: Fix PMU probe ordering 2021-09-20 12:43:34 +01:00
phy Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
pinctrl pinctrl: qcom: sc7280: Add PM suspend callbacks 2021-09-23 23:09:14 +02:00
platform platform/x86: gigabyte-wmi: add support for B550I Aorus Pro AX 2021-09-21 15:49:23 +02:00
pnp
power
powercap
pps
ps3
ptp ptp: ocp: Move devlink registration to be last devlink command 2021-09-27 16:32:00 +01:00
pwm
rapidio
ras
regulator regulator: max14577: Revert "regulator: max14577: Add proper module aliases strings" 2021-09-17 13:16:38 +01:00
remoteproc
reset
rpmsg
rtc rtc: cmos: Disable irq around direct invocation of cmos_interrupt() 2021-09-14 10:20:19 +02:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-09-30 14:49:21 -07:00
sbus
scsi qed: Update the TCP active termination 2 MSL timer ("TIME_WAIT") 2021-10-04 12:55:49 +01:00
sh
siox
slimbus
soc
soundwire
spi spi: Fix modalias issues 2021-09-22 11:58:24 -07:00
spmi
ssb
staging Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-09-30 14:49:21 -07:00
target scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT" 2021-09-22 00:17:29 -04:00
tc
tee
thermal thermal/drivers/tsens: Fix wrong check for tzd in irq handlers 2021-09-21 15:17:11 +02:00
thunderbolt thunderbolt: test: split up test cases in tb_test_credit_alloc_all 2021-09-06 12:27:03 -07:00
tty tty: unexport tty_ldisc_release 2021-09-14 11:18:47 +02:00
uio
usb USB-serial fixes for 5.15-rc3 2021-09-24 10:22:53 +02:00
vdpa vdpa/mlx5: Avoid executing set_vq_ready() if device is reset 2021-09-14 18:10:43 -04:00
vfio vfio/pci: add missing identifier name in argument of function prototype 2021-09-23 14:12:36 -06:00
vhost virtio,vdpa: fixes 2021-09-28 07:27:29 -07:00
video tgafb: clarify dependencies 2021-09-18 11:15:01 -07:00
virt
virtio virtio: don't fail on !of_device_is_compatible 2021-09-14 18:09:57 -04:00
visorbus
vlynq
vme
w1
watchdog watchdog/sb_watchdog: fix compilation problem due to COMPILE_TEST 2021-09-27 11:57:19 -07:00
xen xen: branch for v5.15-rc3 2021-09-25 15:37:31 -07:00
zorro
Kconfig
Makefile