Commit graph

1138403 commits

Author SHA1 Message Date
Kumar Kartikeya Dwivedi
b7ff97925b bpf: Allow locking bpf_spin_lock in inner map values
There is no need to restrict users from locking bpf_spin_lock in map
values of inner maps. Each inner map lookup gets a unique reg->id
assigned to the returned PTR_TO_MAP_VALUE which will be preserved after
the NULL check. Distinct lookups into different inner map get unique
IDs, and distinct lookups into same inner map also get unique IDs.

Hence, lift the restriction by removing the check return -ENOTSUPP in
map_in_map.c. Later commits will add comprehensive test cases to ensure
that invalid cases are rejected.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
d0d78c1df9 bpf: Allow locking bpf_spin_lock global variables
Global variables reside in maps accessible using direct_value_addr
callbacks, so giving each load instruction's rewrite a unique reg->id
disallows us from holding locks which are global.

The reason for preserving reg->id as a unique value for registers that
may point to spin lock is that two separate lookups are treated as two
separate memory regions, and any possible aliasing is ignored for the
purposes of spin lock correctness.

This is not great especially for the global variable case, which are
served from maps that have max_entries == 1, i.e. they always lead to
map values pointing into the same map value.

So refactor the active_spin_lock into a 'active_lock' structure which
represents the lock identity, and instead of the reg->id, remember two
fields, a pointer and the reg->id. The pointer will store reg->map_ptr
or reg->btf. It's only necessary to distinguish for the id == 0 case of
global variables, but always setting the pointer to a non-NULL value and
using the pointer to check whether the lock is held simplifies code in
the verifier.

This is generic enough to allow it for global variables, map lookups,
and allocated objects at the same time.

Note that while whether a lock is held can be answered by just comparing
active_lock.ptr to NULL, to determine whether the register is pointing
to the same held lock requires comparing _both_ ptr and id.

Finally, as a result of this refactoring, pseudo load instructions are
not given a unique reg->id, as they are doing lookup for the same map
value (max_entries is never greater than 1).

Essentially, we consider that the tuple of (ptr, id) will always be
unique for any kind of argument to bpf_spin_{lock,unlock}.

Note that this can be extended in the future to also remember offset
used for locking, so that we can introduce multiple bpf_spin_lock fields
in the same allocation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
4e814da0d5 bpf: Allow locking bpf_spin_lock in allocated objects
Allow locking a bpf_spin_lock in an allocated object, in addition to
already supported map value pointers. The handling is similar to that of
map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
as well, and adjusting process_spin_lock to work with them and remember
the id in verifier state.

Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
preserve reg->id, that will be used in env->cur_state->active_spin_lock
to remember the currently held spin lock.

Also update the comment describing bpf_spin_lock implementation details
to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
865ce09a49 bpf: Verify ownership relationships for user BTF types
Ensure that there can be no ownership cycles among different types by
way of having owning objects that can hold some other type as their
element. For instance, a map value can only hold allocated objects, but
these are allowed to have another bpf_list_head. To prevent unbounded
recursion while freeing resources, elements of bpf_list_head in local
kptrs can never have a bpf_list_head which are part of list in a map
value. Later patches will verify this by having dedicated BTF selftests.

Also, to make runtime destruction easier, once btf_struct_metas is fully
populated, we can stash the metadata of the value type directly in the
metadata of the list_head fields, as that allows easier access to the
value type's layout to destruct it at runtime from the btf_field entry
of the list head itself.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
8ffa5cc142 bpf: Recognize lock and list fields in allocated objects
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
allocated object.

Also update btf_struct_access to reject direct access to these special
fields.

A bpf_list_head allows implementing map-in-map style use cases, where an
allocated object with bpf_list_head is linked into a list in a map
value. This would require embedding a bpf_list_node, support for which
is also included. The bpf_spin_lock is used to protect the bpf_list_head
and other data.

While we strictly don't require to hold a bpf_spin_lock while touching
the bpf_list_head in such objects, as when have access to it, we have
complete ownership of the object, the locking constraint is still kept
and may be conditionally lifted in the future.

Note that the specification of such types can be done just like map
values, e.g.:

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
	struct bpf_list_node node;
};

struct map_value {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(foo, node);
};

To recognize such types in user BTF, we build a btf_struct_metas array
of metadata items corresponding to each BTF ID. This is done once during
the btf_parse stage to avoid having to do it each time during the
verification process's requirement to inspect the metadata.

Moreover, the computed metadata needs to be passed to some helpers in
future patches which requires allocating them and storing them in the
BTF that is pinned by the program itself, so that valid access can be
assumed to such data during program runtime.

A key thing to note is that once a btf_struct_meta is available for a
type, both the btf_record and btf_field_offs should be available. It is
critical that btf_field_offs is available in case special fields are
present, as we extensively rely on special fields being zeroed out in
map values and allocated objects in later patches. The code ensures that
by bailing out in case of errors and ensuring both are available
together. If the record is not available, the special fields won't be
recognized, so not having both is also fine (in terms of being a
verification error and not a runtime bug).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
282de143ea bpf: Introduce allocated objects support
Introduce support for representing pointers to objects allocated by the
BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
This is indicated by the presence of MEM_ALLOC type flag in reg->type to
avoid having to check btf_is_kernel when trying to match argument types
in helpers.

Whenever walking such types, any pointers being walked will always yield
a SCALAR instead of pointer. In the future we might permit kptr inside
such allocated objects (either kernel or program allocated), and it will
then form a PTR_TO_BTF_ID of the respective type.

For now, such allocated objects will always be referenced in verifier
context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
to such objects, as long fields that are special are not touched
(support for which will be added in subsequent patches). Note that once
such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
to it.

No PROBE_MEM handling is therefore done for loads into this type unless
PTR_UNTRUSTED is part of the register type, since they can never be in
an undefined state, and their lifetime will always be valid.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
f73e601aaf bpf: Populate field_offs for inner_map_meta
Far too much code simply assumes that both btf_record and btf_field_offs
are set to valid pointers together, or both are unset. They go together
hand in hand as btf_record describes the special fields and
btf_field_offs is compact representation for runtime copying/zeroing.

It is very difficult to make this clear in the code when the only
exception to this universal invariant is inner_map_meta which is used
as reg->map_ptr in the verifier. This is simply a bug waiting to happen,
as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE
is coming from an inner map, and if we ever end up using field_offs for
any reason in the future, we will silently ignore the special fields for
inner map case (as NULL is not an error but unset field_offs).

Hence, simply copy field_offs from inner map together with btf_record.

While at it, refactor code to unwind properly on errors with gotos.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi
d48995723c bpf: Free inner_map_meta when btf_record_dup fails
Whenever btf_record_dup fails, we must free inner_map_meta that was
allocated before.

This fixes a memory leak (in case of errors) during inner map creation.

Fixes: aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi
d7f5ef653c bpf: Do btf_record_free outside map_free callback
Since the commit being fixed, we now miss freeing btf_record for local
storage maps which will have a btf_record populated in case they have
bpf_spin_lock element.

This was missed because I made the choice of offloading the job to free
kptr_off_tab (now btf_record) to the map_free callback when adding
support for kptrs.

Revisiting the reason for this decision, there is the possibility that
the btf_record gets used inside map_free callback (e.g. in case of maps
embedding kptrs) to iterate over them and free them, hence doing it
before the map_free callback would be leaking special field memory, and
do invalid memory access. The btf_record keeps module references which
is critical to ensure the dtor call made for referenced kptr is safe to
do.

If doing it after map_free callback, the map area is already freed, so
we cannot access bpf_map structure anymore.

To fix this and prevent such lapses in future, move bpf_map_free_record
out of the map_free callback, and do it after map_free by remembering
the btf_record pointer. There is no need to access bpf_map structure in
that case, and we can avoid missing this case when support for new map
types is added for other special fields.

Since a btf_record and its btf_field_offs are used together, for
consistency delay freeing of field_offs as well. While not a problem
right now, a lot of code assumes that either both record and field_offs
are set or none at once.

Note that in case of map of maps (outer maps), inner_map_meta->record is
only used during verification, not to free fields in map value, hence we
simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
never touch map->inner_map_meta in bpf_map_free_deferred.

Add a comment making note of these details.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi
c237bfa528 bpf: Fix early return in map_check_btf
Instead of returning directly with -EOPNOTSUPP for the timer case, we
need to free the btf_record before returning to userspace.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Björn Töpel
98b2afc8a6 selftests/bpf: Pass target triple to get_sys_includes macro
When cross-compiling [1], the get_sys_includes make macro should use
the target system include path, and not the build hosts system include
path.

Make clang honor the CROSS_COMPILE triple.

[1] e.g. "ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- make"

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Link: https://lore.kernel.org/bpf/20221115182051.582962-2-bjorn@kernel.org
2022-11-17 15:14:35 -08:00
Björn Töpel
c4525f05ca selftests/bpf: Explicitly pass RESOLVE_BTFIDS to sub-make
When cross-compiling selftests/bpf, the resolve_btfids binary end up
in a different directory, than the regular resolve_btfids
builds. Populate RESOLVE_BTFIDS for sub-make, so it can find the
binary.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20221115182051.582962-1-bjorn@kernel.org
2022-11-17 15:14:34 -08:00
Hou Tao
3af43ba4c6 bpf: Pass map file to .map_update_batch directly
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
the target map file, then it invokes generic_map_update_batch() to do
batch update. generic_map_update_batch() will get the target map file
by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().

The problem is map file returned by the second fdget() may be NULL or a
totally different file compared by map file in bpf_map_do_batch(). The
reason is that the first fdget() only guarantees the liveness of struct
file instead of file descriptor and the file description may be released
by concurrent close() through pick_file().

It doesn't incur any problem as for now, because maps with batch update
support don't use map file in .map_fd_get_ptr() ops. But it is better to
fix the potential access of an invalid map file.

Using __bpf_map_get() again in generic_map_update_batch() can not fix
the problem, because batch.map_fd may be closed and reopened, and the
returned map file may be different with map file got in
bpf_map_do_batch(), so just passing the map file directly to
.map_update_batch() in bpf_map_do_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
2022-11-17 17:12:35 +01:00
Daniel Müller
383f1a8df8 bpf/docs: Include blank lines between bullet points in bpf_devel_QA.rst
Commit 26a9b433cf ("bpf/docs: Document how to run CI without patch
submission") caused a warning to be generated when compiling the
documentation:

 > bpf_devel_QA.rst:55: WARNING: Unexpected indentation.
 > bpf_devel_QA.rst:56: WARNING: Block quote ends without a blank line

This change fixes the problem by inserting the required blank lines.

Fixes: 26a9b433cf ("bpf/docs: Document how to run CI without patch submission")
Reported-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/bpf/20221116174358.2744613-1-deso@posteo.net
2022-11-17 16:26:42 +01:00
Wang Yufen
c453e64cbc selftests/bpf: fix memory leak of lsm_cgroup
kmemleak reports this issue:

unreferenced object 0xffff88810b7835c0 (size 32):
  comm "test_progs", pid 270, jiffies 4294969007 (age 1621.315s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    03 00 00 00 03 00 00 00 0f 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000376cdeab>] kmalloc_trace+0x27/0x110
    [<000000003bcdb3b6>] selinux_sk_alloc_security+0x66/0x110
    [<000000003959008f>] security_sk_alloc+0x47/0x80
    [<00000000e7bc6668>] sk_prot_alloc+0xbd/0x1a0
    [<0000000002d6343a>] sk_alloc+0x3b/0x940
    [<000000009812a46d>] unix_create1+0x8f/0x3d0
    [<000000005ed0976b>] unix_create+0xa1/0x150
    [<0000000086a1d27f>] __sock_create+0x233/0x4a0
    [<00000000cffe3a73>] __sys_socket_create.part.0+0xaa/0x110
    [<0000000007c63f20>] __sys_socket+0x49/0xf0
    [<00000000b08753c8>] __x64_sys_socket+0x42/0x50
    [<00000000b56e26b3>] do_syscall_64+0x3b/0x90
    [<000000009b4871b8>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

The issue occurs in the following scenarios:

unix_create1()
  sk_alloc()
    sk_prot_alloc()
      security_sk_alloc()
        call_int_hook()
          hlist_for_each_entry()
            entry1->hook.sk_alloc_security
            <-- selinux_sk_alloc_security() succeeded,
            <-- sk->security alloced here.
            entry2->hook.sk_alloc_security
            <-- bpf_lsm_sk_alloc_security() failed
      goto out_free;
        ...    <-- the sk->security not freed, memleak

The core problem is that the LSM is not yet fully stacked (work is
actively going on in this space) which means that some LSM hooks do
not support multiple LSMs at the same time. To fix, skip the
"EPERM" test when it runs in the environments that already have
non-bpf lsms installed

Fixes: dca85aac88 ("selftests/bpf: lsm_cgroup functional test")
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Cc: Stanislav Fomichev <sdf@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/1668482980-16163-1-git-send-email-wangyufen@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-16 23:00:56 -08:00
Eduard Zingerman
5b1d640800 selftests/bpf: allow unpriv bpf for selftests by default
Enable unprivileged bpf for selftests kernel by default.
This forces CI to run test_verifier tests in both privileged
and unprivileged modes.

The test_verifier.c:do_test uses sysctl kernel.unprivileged_bpf_disabled
to decide whether to run or to skip test cases in unprivileged mode.
The CONFIG_BPF_UNPRIV_DEFAULT_OFF controls the default value of the
kernel.unprivileged_bpf_disabled.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20221116015456.2461135-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 21:38:04 -08:00
Tiezhu Yang
df9c41e9db bpftool: Check argc first before "file" in do_batch()
If the parameters for batch are more than 2, check argc first can
return immediately, no need to use is_prefix() to check "file" with
a little overhead and then check argc, it is better to check "file"
only when the parameters for batch are 2.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/1668517207-11822-1-git-send-email-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 21:36:24 -08:00
Donald Hunter
e0eb60829a docs/bpf: Fix sample code in MAP_TYPE_ARRAY docs
Remove mistaken & from code example in MAP_TYPE_ARRAY docs

Fixes: 1cfa97b30c ("bpf, docs: Document BPF_MAP_TYPE_ARRAY")
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20221115095910.86407-1-donald.hunter@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 21:33:47 -08:00
Alexei Starovoitov
6373ef1c5e Merge branch 'propagate nullness information for reg to reg comparisons'
Eduard Zingerman says:

====================

This patchset adds ability to propagates nullness information for
branches of register to register equality compare instructions. The
following rules are used:
 - suppose register A maybe null
 - suppose register B is not null
 - for JNE A, B, ... - A is not null in the false branch
 - for JEQ A, B, ... - A is not null in the true branch

E.g. for program like below:

  r6 = skb->sk;
  r7 = sk_fullsock(r6);
  r0 = sk_fullsock(r6);
  if (r0 == 0) return 0;    (a)
  if (r0 != r7) return 0;   (b)
  *r7->type;                (c)
  return 0;

It is safe to dereference r7 at point (c), because of (a) and (b).

The utility of this change came up while working on BPF CLang backend
issue [1]. Specifically, while debugging issue with selftest
`test_sk_lookup.c`. This test has the following structure:

    int access_ctx_sk(struct bpf_sk_lookup *ctx __CTX__)
    {
        struct bpf_sock *sk1 = NULL, *sk2 = NULL;
        ...
        sk1 = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
        if (!sk1)           // (a)
            goto out;
        ...
        if (ctx->sk != sk1) // (b)
            goto out;
        ...
        if (ctx->sk->family != AF_INET ||     // (c)
            ctx->sk->type != SOCK_STREAM ||
            ctx->sk->state != BPF_TCP_LISTEN)
            goto out;
            ...
    }

- at (a) `sk1` is checked to be not null;
- at (b) `ctx->sk` is verified to be equal to `sk1`;
- at (c) `ctx->sk` is accessed w/o nullness check.

Currently Global Value Numbering pass considers expressions `sk1` and
`ctx->sk` to be identical at point (c) and replaces `ctx->sk` with
`sk1` (not expressions themselves but corresponding SSA values).
Since `sk1` is known to be not null after (b) verifier allows
execution of the program.

However, such optimization is not guaranteed to happen. When it does
not happen verifier reports an error.

Changelog:
v2 -> v3:
 - verifier tests are updated with correct error message for
   unprivileged mode (pointer comparisons are forbidden in
   unprivileged mode).

v1 -> v2:
 - after investigation described in [2] as suggested by John, Daniel
   and Shung-Hsi, function `type_is_pointer` is removed, calls to this
   function are replaced by `__is_pointer_value(false, src_reg)`.

RFC -> v1:
 - newly added if block in `check_cond_jmp_op` is moved down to keep
   `make_ptr_not_null_reg` actions together;
 - tests rewritten to have a single `r0 = 0; exit;` block.

[1]   https://reviews.llvm.org/D131633#3722231
[2]   https://lore.kernel.org/bpf/bad8be826d088e0d180232628160bf932006de89.camel@gmail.com/
[RFC] https://lore.kernel.org/bpf/20220822094312.175448-1-eddyz87@gmail.com/
[v1]  https://lore.kernel.org/bpf/20220826172915.1536914-1-eddyz87@gmail.com/
[v2]  https://lore.kernel.org/bpf/20221106214921.117631-1-eddyz87@gmail.com/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 17:38:36 -08:00
Eduard Zingerman
4741c371aa selftests/bpf: check nullness propagation for reg to reg comparisons
Verify that nullness information is porpagated in the branches of
register to register JEQ and JNE operations.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221115224859.2452988-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 17:38:36 -08:00
Eduard Zingerman
befae75856 bpf: propagate nullness information for reg to reg comparisons
Propagate nullness information for branches of register to register
equality compare instructions. The following rules are used:
- suppose register A maybe null
- suppose register B is not null
- for JNE A, B, ... - A is not null in the false branch
- for JEQ A, B, ... - A is not null in the true branch

E.g. for program like below:

  r6 = skb->sk;
  r7 = sk_fullsock(r6);
  r0 = sk_fullsock(r6);
  if (r0 == 0) return 0;    (a)
  if (r0 != r7) return 0;   (b)
  *r7->type;                (c)
  return 0;

It is safe to dereference r7 at point (c), because of (a) and (b).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221115224859.2452988-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 17:38:36 -08:00
Toke Høiland-Jørgensen
32637e3300 bpf: Expand map key argument of bpf_redirect_map to u64
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.

Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 09:00:27 -08:00
Toke Høiland-Jørgensen
14d898f3c1 dev: Move received_rps counter next to RPS members in softnet data
Move the received_rps counter value next to the other RPS-related members
in softnet_data. This closes two four-byte holes in the structure, making
room for another pointer in the first two cache lines without bumping the
xmit struct to its own line.

Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-2-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 09:00:27 -08:00
Daniel Müller
26a9b433cf bpf/docs: Document how to run CI without patch submission
This change documents the process for running the BPF CI before
submitting a patch to the upstream mailing list, similar to what happens
if a patch is send to bpf@vger.kernel.org: it builds kernel and
selftests and runs the latter on different architecture (but it notably
does not cover stylistic checks such as cover letter verification).

Running BPF CI this way can help achieve better test coverage ahead of
patch submission than merely running locally (say, using
tools/testing/selftests/bpf/vmtest.sh), as additional architectures may
be covered as well.

Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221114211501.2068684-1-deso@posteo.net
2022-11-15 15:33:27 +01:00
Kumar Kartikeya Dwivedi
6728aea721 bpf: Refactor btf_struct_access
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
894f2a8b16 bpf: Rename MEM_ALLOC to MEM_RINGBUF
Currently, verifier uses MEM_ALLOC type tag to specially tag memory
returned from bpf_ringbuf_reserve helper. However, this is currently
only used for this purpose and there is an implicit assumption that it
only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
in check_func_arg_reg_off).

Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
relationship and instead open the use of MEM_ALLOC for more generic
allocations made for user types.

Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.

Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
2de2669b4e bpf: Rename RET_PTR_TO_ALLOC_MEM
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
imply that it carries MEM_ALLOC, while only the latter does. This causes
confusion during code review leading to conclusions like that the return
value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.

Rename it to make it clear MEM_ALLOC needs to be tacked on top of
RET_PTR_TO_MEM.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
f0c5941ff5 bpf: Support bpf_list_head in map values
Add the support on the map side to parse, recognize, verify, and build
metadata table for a new special field of the type struct bpf_list_head.
To parameterize the bpf_list_head for a certain value type and the
list_node member it will accept in that value type, we use BTF
declaration tags.

The definition of bpf_list_head in a map value will be done as follows:

struct foo {
	struct bpf_list_node node;
	int data;
};

struct map_value {
	struct bpf_list_head head __contains(foo, node);
};

Then, the bpf_list_head only allows adding to the list 'head' using the
bpf_list_node 'node' for the type struct foo.

The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:name:node" where the name is then used to look up the
type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
the lookup. The node defines name of the member in this type that has
the type struct bpf_list_node, which is actually used for linking into
the linked list. For now, 'kind' part is hardcoded as struct.

This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.

For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is done with
a TODO that will be resolved in a future patch.

Note that the bpf_list_head_free function moves the list out to a local
variable under the lock and releases it, doing the actual draining of
the list items outside the lock. While this helps with not holding the
lock for too long pessimizing other concurrent list operations, it is
also necessary for deadlock prevention: unless every function called in
the critical section would be notrace, a fentry/fexit program could
attach and call bpf_map_update_elem again on the map, leading to the
same lock being acquired if the key matches and lead to a deadlock.
While this requires some special effort on part of the BPF programmer to
trigger and is highly unlikely to occur in practice, it is always better
if we can avoid such a condition.

While notrace would prevent this, doing the draining outside the lock
has advantages of its own, hence it is used to also fix the deadlock
related problem.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
e5feed0f64 bpf: Fix copy_map_value, zero_map_value
The current offset needs to also skip over the already copied region in
addition to the size of the next field. This case manifests where there
are gaps between adjacent special fields.

It was observed that for a map value with size 48, having fields at:
off:  0, 16, 32
size: 4, 16, 16

The current code does:

memcpy(dst + 0, src + 0, 0)
memcpy(dst + 4, src + 4, 12)
memcpy(dst + 20, src + 20, 12)
memcpy(dst + 36, src + 36, 12)

With the fix, it is done correctly as:

memcpy(dst + 0, src + 0, 0)
memcpy(dst + 4, src + 4, 12)
memcpy(dst + 32, src + 32, 0)
memcpy(dst + 48, src + 48, 0)

Fixes: 4d7d7f69f4 ("bpf: Adapt copy_map_value for multiple offset case")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
2d57725257 bpf: Remove BPF_MAP_OFF_ARR_MAX
In f71b2f6417 ("bpf: Refactor map->off_arr handling"), map->off_arr
was refactored to be btf_field_offs. The number of field offsets is
equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse
BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled
specially for offset sorting, fix the comment, and remove incorrect
WARN_ON as its rec->cnt can never exceed this value. The reason to keep
separate constant was the it was always more 2 more than total kptrs.
This is no longer the case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi
1f6d52f1a8 bpf: Remove local kptr references in documentation
We don't want to commit to a specific name for these. Simply call them
allocated objects coming from bpf_obj_new, which is completely clear in
itself.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Andrii Nakryiko
de763fbb2c Merge branch 'libbpf: Fixed various checkpatch issues'
Kang Minchul says:

====================
This patch series contains various checkpatch fixes
in btf.c, libbpf.c, ringbuf.c.

I know these are trivial but some issues are hard to ignore
and I think these checkpatch issues are accumulating.

v1 -> v2: changed cover letter message.
====================

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2022-11-14 11:43:20 -08:00
Kang Minchul
b486d19a0a libbpf: checkpatch: Fixed code alignments in ringbuf.c
Fixed some checkpatch issues in ringbuf.c

Signed-off-by: Kang Minchul <tegongkang@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221113190648.38556-4-tegongkang@gmail.com
2022-11-14 11:43:17 -08:00
Kang Minchul
e3ba8e4e8c libbpf: Fixed various checkpatch issues in libbpf.c
Fixed following checkpatch issues:

WARNING: Block comments use a trailing */ on a separate line
+        * other BPF program's BTF object */

WARNING: Possible repeated word: 'be'
+        * name. This is important to be be able to find corresponding BTF

ERROR: switch and case should be at the same indent
+       switch (ext->kcfg.sz) {
+               case 1: *(__u8 *)ext_val = value; break;
+               case 2: *(__u16 *)ext_val = value; break;
+               case 4: *(__u32 *)ext_val = value; break;
+               case 8: *(__u64 *)ext_val = value; break;
+               default:

ERROR: trailing statements should be on next line
+               case 1: *(__u8 *)ext_val = value; break;

ERROR: trailing statements should be on next line
+               case 2: *(__u16 *)ext_val = value; break;

ERROR: trailing statements should be on next line
+               case 4: *(__u32 *)ext_val = value; break;

ERROR: trailing statements should be on next line
+               case 8: *(__u64 *)ext_val = value; break;

ERROR: code indent should use tabs where possible
+                }$

WARNING: please, no spaces at the start of a line
+                }$

WARNING: Block comments use a trailing */ on a separate line
+        * for faster search */

ERROR: code indent should use tabs where possible
+^I^I^I^I^I^I        &ext->kcfg.is_signed);$

WARNING: braces {} are not necessary for single statement blocks
+       if (err) {
+               return err;
+       }

ERROR: code indent should use tabs where possible
+^I^I^I^I        sizeof(*obj->btf_modules), obj->btf_module_cnt + 1);$

Signed-off-by: Kang Minchul <tegongkang@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221113190648.38556-3-tegongkang@gmail.com
2022-11-14 11:43:06 -08:00
Kang Minchul
c7694ac340 libbpf: checkpatch: Fixed code alignments in btf.c
Fixed some checkpatch issues in btf.c

Signed-off-by: Kang Minchul <tegongkang@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221113190648.38556-2-tegongkang@gmail.com
2022-11-14 11:42:53 -08:00
Maryam Tahhan
e662c77536 bpf, docs: Fixup cpumap sphinx >= 3.1 warning
Fixup bpf_map_update_elem() declaration to use a single line.

Reported-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Akira Yokosawa <akiyks@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221113103327.3287482-1-mtahhan@redhat.com
2022-11-14 19:10:57 +01:00
David Michael
dfd0afbf15 libbpf: Fix uninitialized warning in btf_dump_dump_type_data
GCC 11.3.0 fails to compile btf_dump.c due to the following error,
which seems to originate in btf_dump_struct_data where the returned
value would be uninitialized if btf_vlen returns zero.

btf_dump.c: In function ‘btf_dump_dump_type_data’:
btf_dump.c:2363:12: error: ‘err’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 2363 |         if (err < 0)
      |            ^

Fixes: 920d16af9b ("libbpf: BTF dumper support for typed data")
Signed-off-by: David Michael <fedora.dm0@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/87zgcu60hq.fsf@gmail.com
2022-11-14 19:02:32 +01:00
David S. Miller
f12ed9c048 mlx5-updates-2022-11-12
Misc updates to mlx5 driver
 
 1) Support enhanced CQE compression, on ConnectX6-Dx
    Reduce irq rate, cpu utilization and latency.
 
 2) Connection tracking: Optimize the pre_ct table lookup for rules
    installed on chain 0.
 
 3) implement ethtool get_link_ext_stats for PHY down events
 
 4) Expose device vhca_id to debugfs
 
 5) misc cleanups and trivial changes
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmNvc2QACgkQSD+KveBX
 +j4DuggAiKEu7Imu94k+zXhm/n7Ykb8c1VIQVT9d9+lkyLaRVt5i7q6OpKtbRl4X
 k5JMK0+xxXZ6lN/Du+iC3hwAxTayVOQn1KoQFVA85aAOBfd8fS4o7hMZ+gEAeaOy
 qh34sUGPX5YzALN5zhIdk97h1kYvjbjX8CujyluBR9TR9564m4e9oJzqBPC1otab
 3YHqeCiP0P8TEclUbiKgkCxvB1JgnzXUnu/u74O6nmzB51gqr1gAZWIUE5y7uuGD
 WjD/1x62h5kr1xN1TtItKJQWVi1QGXsgiYbB13pNvDCpEMM2ZioEp9GDhWjOLQyh
 gYHm3Bne8YaV4dVf0lr/7ijTMSwJjQ==
 =BBj5
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2022-11-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2022-11-12

Misc updates to mlx5 driver

1) Support enhanced CQE compression, on ConnectX6-Dx
   Reduce irq rate, cpu utilization and latency.

2) Connection tracking: Optimize the pre_ct table lookup for rules
   installed on chain 0.

3) implement ethtool get_link_ext_stats for PHY down events

4) Expose device vhca_id to debugfs

5) misc cleanups and trivial changes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:35:28 +00:00
Shenwei Wang
6970ef27ff net: fec: add xdp and page pool statistics
Added xdp and page pool statistics.
In order to make the implementation simple and compatible, the patch
uses the 32bit integer to record the XDP statistics.

Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:28:55 +00:00
David S. Miller
4d1bbdf57b Merge branch 'sparx5-sorted-VCAP-rules'
Steen Hegelund says:

====================
net: Add support for sorted VCAP rules in Sparx5

This provides support for adding Sparx5 VCAP rules in sorted order, VCAP
rule counters and TC filter matching on ARP frames.

It builds on top of the initial IS2 VCAP support found in these series:

https://lore.kernel.org/all/20221020130904.1215072-1-steen.hegelund@microchip.com/
https://lore.kernel.org/all/20221109114116.3612477-1-steen.hegelund@microchip.com/

Functionality
=============

When a new VCAP rule is added the driver will now ensure that the rule is
inserted in sorted order, and when a rule is removed, the remaining rules
will be moved to keep the sorted order and remove any gaps in the VCAP
address space.

A VCAP rule is ordered using these 3 values:

 - Rule size: the count of VCAP addresses used by the rule.  The largest
   rule have highest priority

 - Rule User: The rules are ordered by the user enumeration

 - Priority: The priority provided in the flower filter.  The lowest value
   has the highest priority.

A VCAP instance may contain the counter as part of the VCAP cache area, and
this counter may be one or more bits in width.  This type of counter
automatically increments its value when the rule is hit.

Other VCAP instances have a dedicated counter area outside of the VCAP and
in this case the rule must contain the counter id to be able to locate the
counter value and cause the counter to be incremented.  In this case there
must also be a VCAP rule action that sets the counter id.

The Sparx5 IS2 VCAP uses a dedicated counter area with 32bit counters.

This series adds support for getting VCAP rule counters and provide these
via the TC statistic interface.

This only support packet counters, not byte counters.

Finally the series adds support for the ARP frame dissector and configures
the Sparx5 IS2 VCAP to generate the ARP keyset when ARP traffic is
received.

Delivery:
=========

This is current plan for delivering the full VCAP feature set of Sparx5:

- DebugFS support for inspecting rules
- TC protocol all support
- Sparx5 IS0 VCAP support
- TC policer and drop action support (depends on the Sparx5 QoS support
  upstreamed separately)
- Sparx5 ES0 VCAP support
- TC flower template support
- TC matchall filter support for mirroring and policing ports
- TC flower filter mirror action support
- Sparx5 ES2 VCAP support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:17 +00:00
Steen Hegelund
dccc30cc49 net: microchip: sparx5: Add KUNIT test of counters and sorted rules
This tests the insert, move and deleting of rules and checks that the
unused VCAP addresses are initialized correctly.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:17 +00:00
Steen Hegelund
40e7fe18ab net: microchip: sparx5: Add support for TC flower filter statistics
This provides flower filter packet statistics (bytes are not supported) via
the dedicated IS2 counter feature.

All rules having the same TC cookie will contribute to the packet
statistics for the filter as they are considered to be part of the same TC
flower filter.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:17 +00:00
Steen Hegelund
f13230a474 net: microchip: sparx5: Add support for IS2 VCAP rule counters
This adds API methods to set and get a rule counter.

A VCAP instance may contain the counter as part of the VCAP cache area, and
this counter may be one or more bits in width.  This type of counter
automatically increments it value when the rule is hit.

Other VCAP instances have a dedicated counter area outside of the VCAP and
in this case the rule must contain the counter id to be able to locate the
counter value.  In this case there must also be a rule action that updates
the counter using the rule id when the rule is hit.

The Sparx5 IS2 VCAP uses a dedicated counter area.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:17 +00:00
Steen Hegelund
990e483981 net: microchip: sparx5: Add/delete rules in sorted order
This adds a sorting criteria to rule insertion and deletion.

The criteria is (in the listed order):

- Rule size (largest size first)
- User (based on an enumerated user value)
- Priority (highest priority first, aka lowest value)

When a rule is deleted the other rules may need to be moved to fill the gap
to use the available VCAP address space in the best possible way.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:17 +00:00
Steen Hegelund
3a344f99bb net: microchip: sparx5: Add support for TC flower ARP dissector
This add support for Sparx5 for dissecting TC ARP flower filter keys and
sets up the Sparx5 IS2 VCAP to generate the ARP keyset for ARP frames.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:16 +00:00
Steen Hegelund
70ea86a0df net: flow_offload: add support for ARP frame matching
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:16 +00:00
xu xin
2fd450cd83 ipasdv4/tcp_ipv4: remove redundant assignment
The value of 'st->state' has been verified as "TCP_SEQ_STATE_LISTENING",
it's unnecessary to assign TCP_SEQ_STATE_LISTENING to it, so we can remove it.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:20:15 +00:00
David S. Miller
8a30b30b26 Merge branch 'ibmvnic-affinity-hints'
Nick Child says:

====================
ibmvnic: Introduce affinity hint support

This is a patchset to do 3 things to improve ibmvnic performance:
    1. Assign affinity hints to ibmvnic queue irq's
    2. Update affinity hints on cpu hotplug events
    3. Introduce transmit packet steering (XPS)

NOTE: If irqbalance is running, you need to stop it from overriding
  our affinity hints. To do this you can do one of:
   - systemctl stop irqbalance
   - ban the ibmvnic module irqs
      - you must have the latest irqbalance v9.2, the banmod argument was broken before this
      - in /etc/sysconfig/irqbalance -> IRQBALANCE_ARGS="--banmod=ibmvnic"
      - systemctl restart irqbalance
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 10:47:07 +00:00
Nick Child
df8f66d02d ibmvnic: Update XPS assignments during affinity binding
Transmit Packet Steering (XPS) maps cpu numbers to transmit
queues. By running the same connection on the same set of cpu's,
contention for the queue and cache miss rate can be minimized.
When assigning a cpu mask for a tranmit queues irq number, assign
the same cpu mask as the set of cpu's that XPS should use for that
queue.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 10:47:07 +00:00
Nick Child
92125c3a60 ibmvnic: Add hotpluggable CPU callbacks to reassign affinity hints
When CPU's are added and removed, ibmvnic devices will reassign
hint values. Introduce a new cpu hotplug state CPUHP_IBMVNIC_DEAD
to signal to ibmvnic devices that the CPU has been removed and it
is time to reset affinity hint assignments. On the other hand,
when CPU's are being added, add a state instance to
CPUHP_AP_ONLINE_DYN which will trigger a reassignment of affinity
hints once the new CPU's are online. This implementation is based
on the virtio_net driver.

Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 10:47:07 +00:00