Commit graph

1200679 commits

Author SHA1 Message Date
YiFei Zhu
d1a02358d4 bpf: Non-atomically allocate freelist during prefill
In internal testing of test_maps, we sometimes observed failures like:
  test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *):
    Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed.
where the errno is ENOMEM. After some troubleshooting and enabling
the warnings, we saw:
  [   91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  [   91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G                 N 6.1.38-smp-DEV #7
  [   91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023
  [   91.304721] Call Trace:
  [   91.304724]  <TASK>
  [   91.304730]  [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88
  [   91.304737]  [<ffffffffa7ef83f8>] dump_stack+0x10/0x18
  [   91.304738]  [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870
  [   91.304741]  [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20
  [   91.304743]  [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0
  [   91.304746]  [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0
  [   91.304747]  [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650
  [   91.304750]  [<ffffffffa751d6e0>] map_create+0x140/0x2e0
  [   91.304752]  [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0
  [   91.304753]  [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30
  [   91.304754]  [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80
  [   91.304756]  [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

This makes sense, because in atomic context, percpu allocation would
not create new chunks; it would only create in non-atomic contexts.
And if during prefill all precpu chunks are full, -ENOMEM would
happen immediately upon next unit_alloc.

Prefill phase does not actually run in atomic context, so we can
use this fact to allocate non-atomically with GFP_KERNEL instead
of GFP_NOWAIT. This avoids the immediate -ENOMEM.

GFP_NOWAIT has to be used in unit_alloc when bpf program runs
in atomic context. Even if bpf program runs in non-atomic context,
in most cases, rcu read lock is enabled for the program so
GFP_NOWAIT is still needed. This is often also the case for
BPF_MAP_UPDATE_ELEM syscalls.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 09:41:10 -07:00
Yonghong Song
a76584fc9f selftests/bpf: Enable test test_progs-cpuv4 for gcc build kernel
Currently, test_progs-cpuv4 is generated with clang build kernel
when bpf cpu=v4 is supported by the clang compiler.
Let us enable test_progs-cpuv4 for gcc build kernel as well.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728055745.2285202-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 08:54:05 -07:00
Yonghong Song
09fedc7318 bpf: Fix compilation warning with -Wparentheses
The kernel test robot reported compilation warnings when -Wparentheses is
added to KBUILD_CFLAGS with gcc compiler. The following is the error message:

  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5901:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s64_max >= 0 == s64_min >= 0) {
        ~~~~~~~~^~~~
  .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’:
  .../bpf-next/kernel/bpf/verifier.c:5965:14:
    error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses]
    if (s32_min >= 0 == s32_max >= 0) {
        ~~~~~~~~^~~~

To fix the issue, add proper parentheses for the above '>=' condition
to silence the warning/error.

I tried a few clang compilers like clang16 and clang18 and they do not emit
such warnings with -Wparentheses.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307281133.wi0c4SqG-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230728055740.2284534-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 08:54:04 -07:00
Alexei Starovoitov
f7e6bd33d1 Merge branch 'bpf-support-new-insns-from-cpu-v4'
Yonghong Song says:

====================
bpf: Support new insns from cpu v4

In previous discussion ([1]), it is agreed that we should introduce
cpu version 4 (llvm flag -mcpu=v4) which contains some instructions
which can simplify code, make code easier to understand, fix the
existing problem, or simply for feature completeness. More specifically,
the following new insns are proposed:
  . sign extended load
  . sign extended mov
  . bswap
  . signed div/mod
  . ja with 32-bit offset

This patch set added kernel support for insns proposed in [1] except
BPF_ST which already has full kernel support. Beside the above proposed
insns, LLVM will generate BPF_ST insn as well under -mcpu=v4.
The llvm patch ([2]) has been merged into llvm-project 'main' branch.

The patchset implements interpreter, jit and verifier support for these new
insns.

For this patch set, I tested cpu v2/v3/v4 and the selftests are all passed.
I also tested selftests introduced in this patch set with additional changes
beside normal jit testing (bpf_jit_enable = 1 and bpf_jit_harden = 0)
  - bpf_jit_enable = 0
  - bpf_jit_enable = 1 and bpf_jit_harden = 1
and both testing passed.

  [1] https://lore.kernel.org/bpf/4bfe98be-5333-1c7e-2f6d-42486c8ec039@meta.com/
  [2] https://reviews.llvm.org/D144829

Changelogs:
  v4 -> v5:
   . for v4, patch 8/17 missed in mailing list and patchwork, so resend.
   . rebase on top of master
  v3 -> v4:
   . some minor asm syntax adjustment based on llvm change.
   . add clang version and target arch guard for new tests
     so they can still compile with old llvm compilers.
   . some changes to the bpf doc.
  v2 -> v3:
   . add missed disasm change from v2.
   . handle signed load of ctx fields properly.
   . fix some interpreter sdiv/smod error when bpf_jit_enable = 0.
   . fix some verifier range bounding errors.
   . add more C tests.
  RFCv1 -> v2:
   . add more verifier supports for signed extend load and mov insns.
   . rename some insn names to be more consistent with intel practice.
   . add cpuv4 test runner for test progs.
   . add more unit and C tests.
   . add documentation.
====================

Link: https://lore.kernel.org/r/20230728011143.3710005-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
245d4c40c0 docs/bpf: Add documentation for new instructions
Add documentation in instruction-set.rst for new instruction encoding
and their corresponding operations. Also removed the question
related to 'no BPF_SDIV' in bpf_design_QA.rst since we have
BPF_SDIV insn now.

Cc: bpf@ietf.org
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011342.3724411-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
0c606571ae selftests/bpf: Test ldsx with more complex cases
The following ldsx cases are tested:
  - signed readonly map value
  - read/write map value
  - probed memory
  - not-narrowed ctx field access
  - narrowed ctx field access.

Without previous proper verifier/git handling, the test will fail.

If cpuv4 is not supported either by compiler or by jit,
the test will be skipped.

  # ./test_progs -t ldsx_insn
  #113/1   ldsx_insn/map_val and probed_memory:SKIP
  #113/2   ldsx_insn/ctx_member_sign_ext:SKIP
  #113/3   ldsx_insn/ctx_member_narrow_sign_ext:SKIP
  #113     ldsx_insn:SKIP
  Summary: 1/0 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011336.3723434-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
613dad4980 selftests/bpf: Add unit tests for new gotol insn
Add unit tests for gotol insn.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011329.3721881-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
de1c26809e selftests/bpf: Add unit tests for new sdiv/smod insns
Add unit tests for sdiv/smod insns.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011321.3720500-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
79dbabc175 selftests/bpf: Add unit tests for new bswap insns
Add unit tests for bswap insns.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011314.3720109-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
f02ec3ff3f selftests/bpf: Add unit tests for new sign-extension mov insns
Add unit tests for movsx insns.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011309.3719295-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:17 -07:00
Yonghong Song
147c8f4470 selftests/bpf: Add unit tests for new sign-extension load insns
Add unit tests for new ldsx insns. The test includes sign-extension
with a single value or with a value range.

If cpuv4 is not supported due to
  (1) older compiler, e.g., less than clang version 18, or
  (2) test runner test_progs and test_progs-no_alu32 which tests
      cpu v2 and v3, or
  (3) non-x86_64 arch not supporting new insns in jit yet,
a dummy program is added with below output:
  #318/1   verifier_ldsx/cpuv4 is not supported by compiler or jit, use a dummy test:OK
  #318     verifier_ldsx:OK
to indicate the test passed with a dummy test instead of actually
testing cpuv4. I am using a dummy prog to avoid changing the
verifier testing infrastructure. Once clang 18 is widely available
and other architectures support cpuv4, at least for CI run,
the dummy program can be removed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011304.3719139-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:16 -07:00
Yonghong Song
a5d0c26a27 selftests/bpf: Add a cpuv4 test runner for cpu=v4 testing
Similar to no-alu32 runner, if clang compiler supports -mcpu=v4,
a cpuv4 runner is created to test bpf programs compiled with
-mcpu=v4.

The following are some num-of-insn statistics for each newer
instructions based on existing selftests, excluding subsequent
cpuv4 insn specific tests.

   insn pattern                # of instructions
   reg = (s8)reg               4
   reg = (s16)reg              4
   reg = (s32)reg              144
   reg = *(s8 *)(reg + off)    13
   reg = *(s16 *)(reg + off)   14
   reg = *(s32 *)(reg + off)   15215
   reg = bswap16 reg           142
   reg = bswap32 reg           38
   reg = bswap64 reg           14
   reg s/= reg                 0
   reg s%= reg                 0
   gotol <offset>              58

Note that in llvm -mcpu=v4 implementation, the compiler is a little
bit conservative about generating 'gotol' insn (32-bit branch offset)
as it didn't precise count the number of insns (e.g., some insns are
debug insns, etc.). Compared to old 'goto' insn, newer 'gotol' insn
should have comparable verification states to 'goto' insn.

With current patch set, all selftests passed with -mcpu=v4
when running test_progs-cpuv4 binary. The -mcpu=v3 and -mcpu=v2 run
are also successful.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011250.3718252-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:16 -07:00
Yonghong Song
86180493a2 selftests/bpf: Fix a test_verifier failure
The following test_verifier subtest failed due to
new encoding for BSWAP.

  $ ./test_verifier
  ...
  #99/u invalid 64-bit BPF_END FAIL
  Unexpected success to load!
  verification time 215 usec
  stack depth 0
  processed 3 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
  #99/p invalid 64-bit BPF_END FAIL
  Unexpected success to load!
  verification time 198 usec
  stack depth 0
  processed 3 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

Tighten the test so it still reports a failure.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011244.3717464-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:16 -07:00
Yonghong Song
f835bb6222 bpf: Add kernel/bpftool asm support for new instructions
Add asm support for new instructions so kernel verifier and bpftool
xlated insn dumps can have proper asm syntax for new instructions.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:54:02 -07:00
Yonghong Song
4cd58e9af8 bpf: Support new 32bit offset jmp instruction
Add interpreter/jit/verifier support for 32bit offset jmp instruction.
If a conditional jmp instruction needs more than 16bit offset,
it can be simulated with a conditional jmp + a 32bit jmp insn.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011231.3716103-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
7058e3a31e bpf: Fix jit blinding with new sdiv/smov insns
Handle new insns properly in bpf_jit_blind_insn() function.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011225.3715812-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
ec0e2da95f bpf: Support new signed div/mod instructions.
Add interpreter/jit support for new signed div/mod insns.
The new signed div/mod instructions are encoded with
unsigned div/mod instructions plus insn->off == 1.
Also add basic verifier support to ensure new insns get
accepted.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011219.3714605-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
0845c3db7b bpf: Support new unconditional bswap instruction
The existing 'be' and 'le' insns will do conditional bswap
depends on host endianness. This patch implements
unconditional bswap insns.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011213.3712808-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
1f1e864b65 bpf: Handle sign-extenstin ctx member accesses
Currently, if user accesses a ctx member with signed types,
the compiler will generate an unsigned load followed by
necessary left and right shifts.

With the introduction of sign-extension load, compiler may
just emit a ldsx insn instead. Let us do a final movsx sign
extension to the final unsigned ctx load result to
satisfy original sign extension requirement.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011207.3712528-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
8100928c88 bpf: Support new sign-extension mov insns
Add interpreter/jit support for new sign-extension mov insns.
The original 'MOV' insn is extended to support reg-to-reg
signed version for both ALU and ALU64 operations. For ALU mode,
the insn->off value of 8 or 16 indicates sign-extension
from 8- or 16-bit value to 32-bit value. For ALU64 mode,
the insn->off value of 8/16/32 indicates sign-extension
from 8-, 16- or 32-bit value to 64-bit value.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011202.3712300-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Yonghong Song
1f9a1ea821 bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Jose E. Marchesi
10d78a66a5 bpf, docs: fix BPF_NEG entry in instruction-set.rst
This patch fixes the documentation of the BPF_NEG instruction to
denote that it does not use the source register operand.

Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Acked-by: Dave Thaler <dthaler@microsoft.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230726092543.6362-1-jose.marchesi@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 17:31:49 -07:00
Arnd Bergmann
63e2da3b7f bpf: work around -Wuninitialized warning
Splitting these out into separate helper functions means that we
actually pass an uninitialized variable into another function call
if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT
is disabled:

kernel/bpf/memalloc.c: In function 'add_obj_to_free_list':
kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized]
  200 |         dec_active(c, flags);

Avoid this by passing the flags by reference, so they either get
initialized and dereferenced through a pointer, or the pointer never
gets accessed at all.

Fixes: 18e027b1c7 ("bpf: Factor out inc/dec of active flag into helpers.")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-25 17:14:18 -07:00
Colin Ian King
13fd5e14af selftests/xsk: Fix spelling mistake "querrying" -> "querying"
There is a spelling mistake in an error message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20230720104815.123146-1-colin.i.king@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 14:46:25 -07:00
Martin KaFai Lau
36397a1895 Merge branch 'Add SO_REUSEPORT support for TC bpf_sk_assign'
Lorenz Bauer says:

====================
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.

I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.

Joint work with Daniel Borkmann.

Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
---
Changes in v6:
- Reject unhashed UDP sockets in bpf_sk_assign to avoid ref leak
- Link to v5: https://lore.kernel.org/r/20230613-so-reuseport-v5-0-f6686a0dbce0@isovalent.com

Changes in v5:
- Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki)
- Link to v4: https://lore.kernel.org/r/20230613-so-reuseport-v4-0-4ece76708bba@isovalent.com

Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent.com

Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent.com

Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)

---
Daniel Borkmann (1):
      selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 14:07:08 -07:00
Daniel Borkmann
22408d58a4 selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
We use two programs to check that the new reuseport logic is executed
appropriately.

The first is a TC clsact program which bpf_sk_assigns
the skb to a UDP or TCP socket created by user space. Since the test
communicates via lo we see both directions of packets in the eBPF.
Traffic ingressing to the reuseport socket is identified by looking
at the destination port. For TCP, we additionally need to make sure
that we only assign the initial SYN packets towards our listening
socket. The network stack then creates a request socket which
transitions to ESTABLISHED after the 3WHS.

The second is a reuseport program which shares the fact that
it has been executed with user space. This tells us that the delayed
lookup mechanism is working.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-8-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 14:02:47 -07:00
Lorenz Bauer
9c02bec959 bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.

The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:

    sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
    bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed

Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.

Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.

Fixes: 8e368dc72e ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:55:55 -07:00
Lorenz Bauer
6c886db2e7 net: remove duplicate sk_lookup helpers
Now that inet[6]_lookup_reuseport are parameterised on the ehashfn
we can remove two sk_lookup helpers.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-6-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:44 -07:00
Lorenz Bauer
2a61776366 net: document inet[6]_lookup_reuseport sk_state requirements
The current implementation was extracted from inet[6]_lhash2_lookup
in commit 80b373f74f ("inet: Extract helper for selecting socket
from reuseport group") and commit 5df6531292 ("inet6: Extract helper
for selecting socket from reuseport group"). In the original context,
sk is always in TCP_LISTEN state and so did not have a separate check.

Add documentation that specifies which sk_state are valid to pass to
the function.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-5-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:44 -07:00
Lorenz Bauer
0f495f7617 net: remove duplicate reuseport_lookup functions
There are currently four copies of reuseport_lookup: one each for
(TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of
those functions as well. This is already the case for sk_lookup
helpers (inet,inet6,udp4,udp6)_lookup_run_bpf.

There are two differences between the reuseport_lookup helpers:

1. They call different hash functions depending on protocol
2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED

Move the check for sk_state into the caller and use the INDIRECT_CALL
infrastructure to cut down the helpers to one per IP version.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:44 -07:00
Lorenz Bauer
ce796e60b3 net: export inet_lookup_reuseport and inet6_lookup_reuseport
Rename the existing reuseport helpers for IPv4 and IPv6 so that they
can be invoked in the follow up commit. Export them so that building
DCCP and IPv6 as a module works.

No change in functionality.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-3-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:44 -07:00
Lorenz Bauer
67312adc96 bpf: reject unhashed sockets in bpf_sk_assign
The semantics for bpf_sk_assign are as follows:

    sk = some_lookup_func()
    bpf_sk_assign(skb, sk)
    bpf_sk_release(sk)

That is, the sk is not consumed by bpf_sk_assign. The function
therefore needs to make sure that sk lives long enough to be
consumed from __inet_lookup_skb. The path through the stack for a
TCPv4 packet is roughly:

  netif_receive_skb_core: takes RCU read lock
    __netif_receive_skb_core:
      sch_handle_ingress:
        tcf_classify:
          bpf_sk_assign()
      deliver_ptype_list_skb:
        deliver_skb:
          ip_packet_type->func == ip_rcv:
            ip_rcv_core:
            ip_rcv_finish_core:
              dst_input:
                ip_local_deliver:
                  ip_local_deliver_finish:
                    ip_protocol_deliver_rcu:
                      tcp_v4_rcv:
                        __inet_lookup_skb:
                          skb_steal_sock

The existing helper takes advantage of the fact that everything
happens in the same RCU critical section: for sockets with
SOCK_RCU_FREE set bpf_sk_assign never takes a reference.
skb_steal_sock then checks SOCK_RCU_FREE again and does sock_put
if necessary.

This approach assumes that SOCK_RCU_FREE is never set on a sk
between bpf_sk_assign and skb_steal_sock, but this invariant is
violated by unhashed UDP sockets. A new UDP socket is created
in TCP_CLOSE state but without SOCK_RCU_FREE set. That flag is only
added in udp_lib_get_port() which happens when a socket is bound.

When bpf_sk_assign was added it wasn't possible to access unhashed
UDP sockets from BPF, so this wasn't a problem. This changed
in commit 0c48eefae7 ("sock_map: Lift socket state restriction
for datagram sockets"), but the helper wasn't adjusted accordingly.
The following sequence of events will therefore lead to a refcount
leak:

1. Add socket(AF_INET, SOCK_DGRAM) to a sockmap.
2. Pull socket out of sockmap and bpf_sk_assign it. Since
   SOCK_RCU_FREE is not set we increment the refcount.
3. bind() or connect() the socket, setting SOCK_RCU_FREE.
4. skb_steal_sock will now set refcounted = false due to
   SOCK_RCU_FREE.
5. tcp_v4_rcv() skips sock_put().

Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
This matches the behaviour of __inet_lookup_skb which is ultimately
the goal of bpf_sk_assign().

Fixes: cf7fbe660f ("bpf: Add socket assign support")
Cc: Joe Stringer <joe@cilium.io>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-2-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:43 -07:00
Lorenz Bauer
f0ea27e7bf udp: re-score reuseport groups when connected sockets are present
Contrary to TCP, UDP reuseport groups can contain TCP_ESTABLISHED
sockets. To support these properly we remember whether a group has
a connected socket and skip the fast reuseport early-return. In
effect we continue scoring all reuseport sockets and then choose the
one with the highest score.

The current code fails to re-calculate the score for the result of
lookup_reuseport. According to Kuniyuki Iwashima:

    1) SO_INCOMING_CPU is set
       -> selected sk might have +1 score

    2) BPF prog returns ESTABLISHED and/or SO_INCOMING_CPU sk
       -> selected sk will have more than 8

  Using the old score could trigger more lookups depending on the
  order that sockets are created.

    sk -> sk (SO_INCOMING_CPU) -> sk (ESTABLISHED)
    |     |
    `-> select the next SO_INCOMING_CPU sk
          |
          `-> select itself (We should save this lookup)

Fixes: efc6b6f6c3 ("udp: Improve load balancing for SO_REUSEPORT.")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-1-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:51:43 -07:00
Yonghong Song
7b2b20125f MAINTAINERS: Replace my email address
Switch from corporate email address to linux.dev address.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230725054100.1013421-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 10:11:36 -07:00
Daniel Borkmann
dc644b540a tcx: Fix splat in ingress_destroy upon tcx_entry_free
On qdisc destruction, the ingress_destroy() needs to update the correct
entry, that is, tcx_entry_update must NULL the dev->tcx_ingress pointer.
Therefore, fix the typo.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: syzbot+bdcf141f362ef83335cf@syzkaller.appspotmail.com
Reported-by: syzbot+b202b7208664142954fa@syzkaller.appspotmail.com
Reported-by: syzbot+14736e249bce46091c18@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+bdcf141f362ef83335cf@syzkaller.appspotmail.com
Tested-by: syzbot+b202b7208664142954fa@syzkaller.appspotmail.com
Tested-by: syzbot+14736e249bce46091c18@syzkaller.appspotmail.com
Tested-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20230721233330.5678-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-24 11:42:35 -07:00
David S. Miller
5322a27c0d Merge branch 'ionic-FLR-support'
Shannon Nelson says:

====================
ionic: add FLR support

Add support for handing and recovering from a PCI FLR event.
This patchset first moves some code around to make it usable
from multiple paths, then adds the PCI error handler callbacks
for reset_prepare and reset_done.

Example test:
    echo 1 > /sys/bus/pci/devices/0000:2a:00.0/reset

v4:
 - don't remove ionic_dev_teardown() in ionic_probe() in patch 2/4
 - remove clear_bit() change from patch 3/4

v3:
Link: https://lore.kernel.org/netdev/20230717170001.30539-1-shannon.nelson@amd.com/
 - removed first patch, it is already merged into net

v2:
Link: https://lore.kernel.org/netdev/20230713192936.45152-1-shannon.nelson@amd.com/
 - removed redundant pci_save/restore_state() calls
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:42:34 +01:00
Shannon Nelson
a79b559e99 ionic: add FLR recovery support
Add support for the PCI reset handlers in order to manage an FLR event.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:42:34 +01:00
Shannon Nelson
30d2e07396 ionic: pull out common bits from fw_up
Pull out some code from ionic_lif_handle_fw_up() that can be
used in the coming FLR recovery patch.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:42:34 +01:00
Shannon Nelson
0de38d9f1d ionic: extract common bits from ionic_probe
Pull out some chunks of code from ionic_probe() that will
be common in rebuild paths.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:42:34 +01:00
Shannon Nelson
87d7a9f373 ionic: extract common bits from ionic_remove
Pull out a chunk of code from ionic_remove() that will
be common in teardown paths.

Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:42:33 +01:00
David S. Miller
67a8976780 Merge branch 'phy-motorcomm-driver-strength'
Samin Guo says:

====================
Add motorcomm phy pad-driver-strength-cfg support

The motorcomm phy (YT8531) supports the ability to adjust the drive
strength of the rx_clk/rx_data, and the default strength may not be
suitable for all boards. So add configurable options to better match
the boards.(e.g. StarFive VisionFive 2)

The first patch adds a description of dt-bingding, and the second patch adds
YT8531's parsing and settings for pad-driver-strength-cfg.

Changes since v4:
Patch 1:
- Removed register-related DS(3b) values and added vol descriptions (by Andrew Lunn)
- Dropped the type and added '-microamp' suffix. (by Rob Herring)
Patch 2:
- Return -EINVAL if the value in DT but it is invalid (by Andrew Lunn)

Changes since v3:
Patch 1:
- Used current values instead of register values
- Added units and numerical descriptions of driver-strength
Patch 2:
- Added a lookup table to listing the valid values in the schema (by Andrew Lunn)

Changes since v2:
Patch 2:
- Readjusted the order of YT8531_RGMII_xxx to below YTPHY_PAD_DRIVE_STRENGTH_REG (by Frank Sae)
- Reversed Christmas tree, sort these longest first, shortest last (by Andrew Lunn)
- Rebased on tag v6.4

Changes since v1:
Patch 1:
- Renamed "rx-xxx-driver-strength" to "motorcomm,rx-xxx-driver-strength" (by Frank Sae)
Patch 2:
- Added default values for rxc/rxd driver strength (by Frank Sea/Andrew Lunn)
- Added range checking when val is in DT (by Frank Sea/Andrew Lunn)

Previous versions:
v1 - https://patchwork.kernel.org/project/netdevbpf/cover/20230426063541.15378-1-samin.guo@starfivetech.com
v2 - https://patchwork.kernel.org/project/netdevbpf/cover/20230505090558.2355-1-samin.guo@starfivetech.com
v3 - https://patchwork.kernel.org/project/netdevbpf/cover/20230526090502.29835-1-samin.guo@starfivetech.com
v4 - https://patchwork.kernel.org/project/netdevbpf/cover/20230714101406.17686-1-samin.guo@starfivetech.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:36:45 +01:00
Samin Guo
7a561e9351 net: phy: motorcomm: Add pad drive strength cfg support
The motorcomm phy (YT8531) supports the ability to adjust the drive
strength of the rx_clk/rx_data, and the default strength may not be
suitable for all boards. So add configurable options to better match
the boards.(e.g. StarFive VisionFive 2)

When we configure the drive strength, we need to read the current
LDO voltage value to ensure that it is a legal value at that LDO
voltage.

Reviewed-by: Hal Feng <hal.feng@starfivetech.com>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:36:45 +01:00
Samin Guo
79e71d9569 dt-bindings: net: motorcomm: Add pad driver strength cfg
The motorcomm phy (YT8531) supports the ability to adjust the drive
strength of the rx_clk/rx_data.

The YT8531 RGMII LDO voltage supports 1.8V/3.3V, and the
LDO voltage can be configured with hardware pull-up resistors to match
the SOC voltage (usually 1.8V). The software can read the registers
0xA001 obtain the current LDO voltage value.

Reviewed-by: Hal Feng <hal.feng@starfivetech.com>
Signed-off-by: Samin Guo <samin.guo@starfivetech.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 10:36:45 +01:00
Eric Dumazet
f5f80e32de ipv6: remove hard coded limitation on ipv6_pinfo
IPv6 inet sockets are supposed to have a "struct ipv6_pinfo"
field at the end of their definition, so that inet6_sk_generic()
can derive from socket size the offset of the "struct ipv6_pinfo".

This is very fragile, and prevents adding bigger alignment
in sockets, because inet6_sk_generic() does not work
if the compiler adds padding after the ipv6_pinfo component.

We are currently working on a patch series to reorganize
TCP structures for better data locality and found issues
similar to the one fixed in commit f5d547676c
("tcp: fix tcp_inet6_sk() for 32bit kernels")

Alternative would be to force an alignment on "struct ipv6_pinfo",
greater or equal to __alignof__(any ipv6 sock) to ensure there is
no padding. This does not look great.

v2: fix typo in mptcp_proto_v6_init() (Paolo)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Chao Wu <wwchao@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-24 09:39:31 +01:00
Patrick Rohr
1671bcfd76 net: add sysctl accept_ra_min_rtr_lft
This change adds a new sysctl accept_ra_min_rtr_lft to specify the
minimum acceptable router lifetime in an RA. If the received RA router
lifetime is less than the configured value (and not 0), the RA is
ignored.
This is useful for mobile devices, whose battery life can be impacted
by networks that configure RAs with a short lifetime. On such networks,
the device should never gain IPv6 provisioning and should attempt to
drop RAs via hardware offload, if available.

Signed-off-by: Patrick Rohr <prohr@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:51:24 +01:00
justinstitt@google.com
5c9f7b04aa net: dsa: remove deprecated strncpy
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

Even call sites utilizing length-bounded destination buffers should
switch over to using `strtomem` or `strtomem_pad`. In this case,
however, the compiler is unable to determine the size of the `data`
buffer which renders `strtomem` unusable. Due to this, `strscpy`
should be used.

It should be noted that most call sites already zero-initialize the
destination buffer. However, I've opted to use `strscpy_pad` to maintain
the same exact behavior that `strncpy` produced (zero-padded tail up to
`len`).

Also see [3].

[1]: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
[2]: elixir.bootlin.com/linux/v6.3/source/net/ethtool/ioctl.c#L1944
[3]: manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html

Link: https://github.com/KSPP/linux/issues/90
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:45:46 +01:00
David S. Miller
2e60314c28 Merge branch 'process-connector-bug-fixes-and-enhancements'
Anjali Kulkarni says:

====================
Process connector bug fixes & enhancements

Oracle DB is trying to solve a performance overhead problem it has been
facing for the past 10 years and using this patch series, we can fix this
issue.

Oracle DB runs on a large scale with 100000s of short lived processes,
starting up and exiting quickly. A process monitoring DB daemon which
tracks and cleans up after processes that have died without a proper exit
needs notifications only when a process died with a non-zero exit code
(which should be rare).

Due to the pmon architecture, which is distributed, each process is
independent and has minimal interaction with pmon. Hence fd based
solutions to track a process's spawning and exit cannot be used. Pmon
needs to detect the abnormal death of a process so it can cleanup after.
Currently it resorts to checking /proc every few seconds. Other methods
we tried like using system call to reduce the above overhead were not
accepted upstream.

With this change, we add event based filtering to proc connector module
so that DB can only listen to the events it is interested in. A new
event type PROC_EVENT_NONZERO_EXIT is added, which is only sent by kernel
to a listening application when any process exiting has a non-zero exit
status.

This change will give Oracle DB substantial performance savings - it takes
50ms to scan about 8K PIDs in /proc, about 500ms for 100K PIDs. DB does
this check every 3 secs, so over an hour we save 10secs for 100K PIDs.

With this, a client can register to listen for only exit or fork or a mix or
all of the events. This greatly enhances performance - currently, we
need to listen to all events, and there are 9 different types of events.
For eg. handling 3 types of events - 8K-forks + 8K-exits + 8K-execs takes
200ms, whereas handling 2 types - 8K-forks + 8K-exits takes about 150ms,
and handling just one type - 8K exits takes about 70ms.

Measuring the time using pidfds for monitoring 8K process exits took 4
times longer - 200ms, as compared to 70ms using only exit notifications
of proc connector. Hence, we cannot use pidfd for our use case.

This kind of a new event could also be useful to other applications like
Google's lmkd daemon, which needs a killed process's exit notification.

This patch series is organized as follows -

Patch 1 : Needed for patch 3 to work.
Patch 2 : Needed for patch 3 to work.
Patch 3 : Fixes some bugs in proc connector, details in the patch.
Patch 4 : Adds event based filtering for performance enhancements.
Patch 5 : Allow non-root users access to proc connector events.
Patch 6 : Selftest code for proc connector.

v9->v10 changes:
- Rebased to net-next, re-compiled and re-tested.

v8->v9 changes:
- Added sha1 ("title") of reversed patch as suggested by Eric Dumazet.

v7->v8 changes:
- Fixed an issue pointed by Liam Howlett in v7.

v6->v7 changes:
- Incorporated Liam Howlett's comments on v6
- Incorporated Kalesh Anakkur Purayil's comments

v5->v6 changes:
- Incorporated Liam Howlett's comments
- Removed FILTER define from proc_filter.c and added a "-f" run-time
  option to run new filter code.
- Made proc_filter.c a selftest in tools/testing/selftests/connector

v4->v5 changes:
- Change the cover letter
- Fix a small issue in proc_filter.c

v3->v4 changes:
- Fix comments by Jakub Kicinski to incorporate root access changes
  within bind call of connector

v2->v3 changes:
- Fix comments by Jakub Kicinski to separate netlink (patch 2) (after
  layering) from connector fixes (patch 3).
- Minor fixes suggested by Jakub.
- Add new multicast group level permissions check at netlink layer.
  Split this into netlink & connector layers (patches 6 & 7)

v1->v2 changes:
- Fix comments by Jakub Kicinski to keep layering within netlink and
  update kdocs.
- Move non-root users access patch last in series so remaining patches
  can go in first.

v->v1 changes:
- Changed commit log in patch 4 as suggested by Christian Brauner
- Changed patch 4 to make more fine grained access to non-root users
- Fixed warning in cn_proc.c,
  Reported-by: kernel test robot <lkp@intel.com>
- Fixed some existing warnings in cn_proc.c
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:34:22 +01:00
Anjali Kulkarni
73a29531f4 connector/cn_proc: Selftest for proc connector
Run as ./proc_filter -f to run new filter code. Run without "-f" to run
usual proc connector code without the new filtering code.

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:34:22 +01:00
Anjali Kulkarni
bfdfdc2f3b connector/cn_proc: Allow non-root users access
There were a couple of reasons for not allowing non-root users access
initially  - one is there was some point no proper receive buffer
management in place for netlink multicast. But that should be long
fixed. See link below for more context.

Second is that some of the messages may contain data that is root only. But
this should be handled with a finer granularity, which is being done at the
protocol layer.  The only problematic protocols are nf_queue and the
firewall netlink. Hence, this restriction for non-root access was relaxed
for NETLINK_ROUTE initially:
https://lore.kernel.org/all/20020612013101.A22399@wotan.suse.de/

This restriction has also been removed for following protocols:
NETLINK_KOBJECT_UEVENT, NETLINK_AUDIT, NETLINK_SOCK_DIAG,
NETLINK_GENERIC, NETLINK_SELINUX.

Since process connector messages are not sensitive (process fork, exit
notifications etc.), and anyone can read /proc data, we can allow non-root
access here. However, since process event notification is not the only
consumer of NETLINK_CONNECTOR, we can make this change even more
fine grained than the protocol level, by checking for multicast group
within the protocol.

Allow non-root access for NETLINK_CONNECTOR via NL_CFG_F_NONROOT_RECV
but add new bind function cn_bind(), which allows non-root access only
for CN_IDX_PROC multicast group.

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:34:22 +01:00
Anjali Kulkarni
743acf351b connector/cn_proc: Performance improvements
This patch adds the capability to filter messages sent by the proc
connector on the event type supplied in the message from the client
to the connector. The client can register to listen for an event type
given in struct proc_input.

This event based filteting will greatly enhance performance - handling
8K exits takes about 70ms, whereas 8K-forks + 8K-exits takes about 150ms
& handling 8K-forks + 8K-exits + 8K-execs takes 200ms. There are currently
9 different types of events, and we need to listen to all of them. Also,
measuring the time using pidfds for monitoring 8K process exits took
much longer - 200ms, as compared to 70ms using only exit notifications of
proc connector.

We also add a new event type - PROC_EVENT_NONZERO_EXIT, which is
only sent by kernel to a listening application when any process exiting,
has a non-zero exit status. This will help the clients like Oracle DB,
where a monitoring process wants notfications for non-zero process exits
so it can cleanup after them.

This kind of a new event could also be useful to other applications like
Google's lmkd daemon, which needs a killed process's exit notification.

The patch takes care that existing clients using old mechanism of not
sending the event type work without any changes.

cn_filter function checks to see if the event type being notified via
proc connector matches the event type requested by client, before
sending(matches) or dropping(does not match) a packet.

Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-23 11:34:22 +01:00