No description
Find a file
Pablo Neira Ayuso d19e8bf3ea netfilter: nf_tables: GC transaction API to avoid race with control plane
commit 5f68718b34 upstream.

The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.

The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.

We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:

   cpu 1                                cpu2
     GC work                            transaction comes in , lock nft mutex
       `acquire nft mutex // BLOCKS
                                        transaction asks to remove the set
                                        set destruction calls cancel_work_sync()

cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.

This patch adds a new API that deals with garbage collection in two
steps:

1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
   so they are not visible via lookup. Annotate current GC sequence in
   the GC transaction. Enqueue GC transaction work as soon as it is
   full. If ruleset is updated, then GC transaction is aborted and
   retried later.

2) GC work grabs the mutex. If GC sequence has changed then this GC
   transaction lost race with control plane, abort it as it contains
   stale references to objects and let GC try again later. If the
   ruleset is intact, then this GC transaction deactivates and removes
   the elements and it uses call_rcu() to destroy elements.

Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.

To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.

Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.

We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.

This gives us the choice of ABBA deadlock or UaF.

To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.

Set backends are adapted to use the GC transaction API in a follow up
patch entitled:

  ("netfilter: nf_tables: use gc transaction API in set backends")

This is joint work with Florian Westphal.

Fixes: cfed7e1b1f ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06 13:18:02 +02:00
arch x86/purgatory: Remove LTO flags 2023-09-23 11:10:01 +02:00
block block: don't add or resize partition on the disk with GENHD_FL_NO_PART 2023-09-19 12:23:03 +02:00
certs certs/blacklist_hashes.c: fix const confusion in certs blacklist 2022-06-22 14:22:01 +02:00
crypto crypto: lrw,xts - Replace strlcpy with strscpy 2023-09-23 11:09:55 +02:00
Documentation perf/smmuv3: Enable HiSilicon Erratum 162001900 quirk for HIP08/09 2023-09-23 11:09:55 +02:00
drivers ata: libahci: clear pending interrupt status 2023-10-06 13:18:01 +02:00
fs ext4: do not let fstrim block system suspend 2023-10-06 13:18:01 +02:00
include netfilter: nf_tables: GC transaction API to avoid race with control plane 2023-10-06 13:18:02 +02:00
init x86/mm: Initialize text poking earlier 2023-08-08 19:58:33 +02:00
io_uring io_uring: break iopolling on signal 2023-09-19 12:22:54 +02:00
ipc ipc/sem: Fix dangling sem_array access in semtimedop race 2022-12-08 11:28:45 +01:00
kernel tracing: Have event inject files inc the trace array ref count 2023-10-06 13:18:02 +02:00
lib kobject: Add sanity check for kset->kobj.ktype in kset_register() 2023-09-23 11:09:59 +02:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm mm/vmalloc: add a safer version of find_vm_area() for debug 2023-09-19 12:22:50 +02:00
net netfilter: nf_tables: GC transaction API to avoid race with control plane 2023-10-06 13:18:02 +02:00
samples samples/hw_breakpoint: fix building without module unloading 2023-09-23 11:10:01 +02:00
scripts kconfig: fix possible buffer overflow 2023-09-19 12:22:56 +02:00
security smackfs: Prevent underflow in smk_set_cipso() 2023-09-19 12:22:39 +02:00
sound ALSA: hda: intel-dsp-cfg: add LunarLake support 2023-09-23 11:09:57 +02:00
tools selftests: tracing: Fix to unmount tracefs for recovering environment 2023-09-23 11:10:01 +02:00
usr usr/include/Makefile: add linux/nfc.h to the compile-test coverage 2022-02-01 17:27:15 +01:00
virt KVM: Grab a reference to KVM for VM and vCPU stats file descriptors 2023-08-03 10:22:40 +02:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: add Andrej Shadura 2021-10-18 20:22:03 -10:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS iio: stx104: Move to addac subdirectory 2023-08-26 14:23:27 +02:00
Makefile Linux 5.15.133 2023-09-23 11:10:03 +02:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.