Commit graph

1029588 commits

Author SHA1 Message Date
Kurt Kanzenbach
2b477d057e igc: Integrate flex filter into ethtool ops
Use the flex filter mechanism to extend the current ethtool filter
operations by intercoperating the user data. This allows to match
eight more bytes within a Ethernet frame in addition to macs, ether
types and vlan.

The matching pattern looks like this:

 * dest_mac [6]
 * src_mac [6]
 * tpid [2]
 * vlan tci [2]
 * ether type [2]
 * user data [8]

This can be used to match Profinet traffic classes by FrameID range.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:07:33 -07:00
Kurt Kanzenbach
6574631b50 igc: Add possibility to add flex filter
The Intel i225 NIC has the possibility to add flex filters which can
match up to the first 128 byte of a packet. These filters are useful
for all kind of packet matching. One particular use case is Profinet,
as the different traffic classes are distinguished by the frame id
range which cannot be matched by any other means.

Add code to configure and enable flex filters.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-07-16 14:07:24 -07:00
Bill Wendling
919d527956 bnx2x: remove unused variable 'cur_data_offset'
Fix the clang build warning:

  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c:1862:13: error: variable 'cur_data_offset' set but not used [-Werror,-Wunused-but-set-variable]
        dma_addr_t cur_data_offset;

Signed-off-by: Bill Wendling <morbo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:56:55 -07:00
Christophe JAILLET
a99f030b24 net: switchdev: Simplify 'mlxsw_sp_mc_write_mdb_entry()'
Use 'bitmap_alloc()/bitmap_free()' instead of hand-writing it.
This makes the code less verbose.

Also, use 'bitmap_alloc()' instead of 'bitmap_zalloc()' because the bitmap
is fully overridden by a 'bitmap_copy()' call just after its allocation.

While at it, remove an extra and unneeded space.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:54:18 -07:00
Yajun Deng
f79a3bcb1a net/sched: Remove unnecessary if statement
It has been deal with the 'if (err' statement in rtnetlink_send()
and rtnl_unicast(). so remove unnecessary if statement.

v2: use the raw name rtnetlink_send().

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:46:35 -07:00
Yajun Deng
cfdf0d9ae7 rtnetlink: use nlmsg_notify() in rtnetlink_send()
The netlink_{broadcast, unicast} don't deal with 'if (err > 0' statement
but nlmsg_{multicast, unicast} do. The nlmsg_notify() contains them.
so use nlmsg_notify() instead. so that the caller wouldn't deal with
'if (err > 0' statement.

v2: use nlmsg_notify() will do well.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:46:35 -07:00
Haiyue Wang
63a9192b8f gve: fix the wrong AdminQ buffer overflow check
The 'tail' pointer is also free-running count, so it needs to be masked
as 'adminq_prod_cnt' does, to become an index value of AdminQ buffer.

Fixes: 5cdad90de6 ("gve: Batch AQ commands for creating and destroying queues.")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16 10:41:40 -07:00
David S. Miller
82a1ffe57e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-07-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 45 non-merge commits during the last 15 day(s) which contain
a total of 52 files changed, 3122 insertions(+), 384 deletions(-).

The main changes are:

1) Introduce bpf timers, from Alexei.

2) Add sockmap support for unix datagram socket, from Cong.

3) Fix potential memleak and UAF in the verifier, from He.

4) Add bpf_get_func_ip helper, from Jiri.

5) Improvements to generic XDP mode, from Kumar.

6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 22:40:10 -07:00
Alexei Starovoitov
c50524ec4e Merge branch 'sockmap: add sockmap support for unix datagram socket'
Cong Wang says:

====================

From: Cong Wang <cong.wang@bytedance.com>

This is the last patchset of the original large patchset. In the
previous patchset, a new BPF sockmap program BPF_SK_SKB_VERDICT
was introduced and UDP began to support it too. In this patchset,
we add BPF_SK_SKB_VERDICT support to Unix datagram socket, so that
we can finally splice Unix datagram socket and UDP socket. Please
check each patch description for more details.

To see the big picture, the previous patchsets are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=1e0ab70778bd86a90de438cc5e1535c115a7c396
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=89d69c5d0fbcabd8656459bc8b1a476d6f1efee4

and this patchset is available here:
https://github.com/congwang/linux/tree/sockmap3
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
v5: lift socket state check for dgram
    remove ->unhash() case
    add retries for EAGAIN in all test cases
    remove an unused parameter of __unix_dgram_recvmsg()
    rebase on the latest bpf-next

v4: fix af_unix disconnect case
    add unix_unhash()
    split out two small patches
    reduce u->iolock critical section
    remove an unused parameter of __unix_dgram_recvmsg()

v3: fix Kconfig dependency
    make unix_read_sock() static
    fix a UAF in unix_release()
    add a missing header unix_bpf.c

v2: separate out from the original large patchset
    rebase to the latest bpf-next
    clean up unix_read_sock()
    export sock_map_close()
    factor out some helpers in selftests for code reuse
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-15 18:17:51 -07:00
Cong Wang
a2ffda38dc selftests/bpf: Add test cases for redirection between udp and unix
Add two test cases to ensure redirection between udp and unix
work bidirectionally.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-12-xiyou.wangcong@gmail.com
2021-07-15 18:17:51 -07:00
Cong Wang
5ea905dd43 selftests/bpf: Add a test case for unix sockmap
Add a test case to ensure redirection between two AF_UNIX
datagram sockets work.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-11-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
0626bc2ff6 selftests/bpf: Factor out add_to_sockmap()
Factor out a common helper add_to_sockmap() which adds two
sockets into a sockmap.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-10-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
d950625c81 selftests/bpf: Factor out udp_socketpair()
Factor out a common helper udp_socketpair() which creates
a pair of connected UDP sockets.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-9-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
9825d866ce af_unix: Implement unix_dgram_bpf_recvmsg()
We have to implement unix_dgram_bpf_recvmsg() to replace the
original ->recvmsg() to retrieve skmsg from ingress_msg.

AF_UNIX is again special here because the lack of
sk_prot->recvmsg(). I simply add a special case inside
unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
c63829182c af_unix: Implement ->psock_update_sk_prot()
Now we can implement unix_bpf_update_proto() to update
sk_prot, especially prot->close().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
c7272e15f0 af_unix: Add a dummy ->close() for sockmap
Unlike af_inet, unix_proto is very different, it does not even
have a ->close(). We have to add a dummy implementation to
satisfy sockmap. Normally it is just a nop, it is introduced only
for sockmap to replace it.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-6-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
83301b5367 af_unix: Set TCP_ESTABLISHED for datagram sockets too
Currently only unix stream socket sets TCP_ESTABLISHED,
datagram socket can set this too when they connect to its
peer socket. At least __ip4_datagram_connect() does the same.

This will be used to determine whether an AF_UNIX datagram
socket can be redirected to in sockmap.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-5-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
29df44fa52 af_unix: Implement ->read_sock() for sockmap
Implement ->read_sock() for AF_UNIX datagram socket, it is
pretty much similar to udp_read_sock().

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-4-xiyou.wangcong@gmail.com
2021-07-15 18:17:50 -07:00
Cong Wang
0c48eefae7 sock_map: Lift socket state restriction for datagram sockets
TCP and other connection oriented sockets have accept()
for each incoming connection on the server side, hence
they can just insert those fd's from accept() to sockmap,
which are of course established.

Now with datagram sockets begin to support sockmap and
redirection, the restriction is no longer applicable to
them, as they have no accept(). So we have to lift this
restriction for them. This is fine, because inside
bpf_sk_redirect_map() we still have another socket status
check, sock_map_redirect_allowed(), as a guard.

This also means they do not have to be removed from
sockmap when disconnecting.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com
2021-07-15 18:17:49 -07:00
Cong Wang
17edea21b3 sock_map: Relax config dependency to CONFIG_NET
Currently sock_map still has Kconfig dependency on CONFIG_INET,
but there is no actual functional dependency on it after we
introduce ->psock_update_sk_prot().

We have to extend it to CONFIG_NET now as we are going to
support AF_UNIX.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
2021-07-15 18:17:49 -07:00
Alexei Starovoitov
1554a080e7 Merge branch 'Add bpf_get_func_ip helper'
Jiri Olsa says:

====================
Add bpf_get_func_ip helper that returns IP address of the
caller function for trampoline and krobe programs.

There're 2 specific implementation of the bpf_get_func_ip
helper, one for trampoline progs and one for kprobe/kretprobe
progs.

The trampoline helper call is replaced/inlined by the verifier
with simple move instruction. The kprobe/kretprobe is actual
helper call that returns prepared caller address.

Also available at:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/get_func_ip

v4 changes:
  - dropped jit/x86 check for get_func_ip tracing check [Alexei]
  - added code to bpf_get_func_ip_tracing [Alexei]
    and tested that it works without inlining [Alexei]
  - changed has_get_func_ip to check_get_func_ip [Andrii]
  - replaced test assert loop with explicit asserts [Andrii]
  - adde bpf_program__attach_kprobe_opts function
    and use it for offset setup [Andrii]
  - used bpf_program__set_autoload(false) for test6 [Andrii]
  - added Masami's ack

v3 changes:
  - resend with Masami in cc and v3 in each patch subject

v2 changes:
  - use kprobe_running to get kprobe instead of cpu var [Masami]
  - added support to add kprobe on function+offset
    and test for that [Alan]
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-15 17:59:27 -07:00
Jiri Olsa
8237e75420 selftests/bpf: Add test for bpf_get_func_ip in kprobe+offset probe
Adding test for bpf_get_func_ip in kprobe+ofset probe.
Because of the offset value it's arch specific, enabling
the new test only for x86_64 architecture.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-9-jolsa@kernel.org
2021-07-15 17:59:23 -07:00
Alan Maguire
a2488b5f48 libbpf: Allow specification of "kprobe/function+offset"
kprobes can be placed on most instructions in a function, not
just entry, and ftrace and bpftrace support the function+offset
notification for probe placement.  Adding parsing of func_name
into func+offset to bpf_program__attach_kprobe() allows the
user to specify

SEC("kprobe/bpf_fentry_test5+0x6")

...for example, and the offset can be passed to perf_event_open_probe()
to support kprobe attachment.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-8-jolsa@kernel.org
2021-07-15 17:59:14 -07:00
Jiri Olsa
ac0ed48829 libbpf: Add bpf_program__attach_kprobe_opts function
Adding bpf_program__attach_kprobe_opts that does the same
as bpf_program__attach_kprobe, but takes opts argument.

Currently opts struct holds just retprobe bool, but we will
add new field in following patch.

The function is not exported, so there's no need to add
size to the struct bpf_program_attach_kprobe_opts for now.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-7-jolsa@kernel.org
2021-07-15 17:59:14 -07:00
Jiri Olsa
5d8b583d04 selftests/bpf: Add test for bpf_get_func_ip helper
Adding test for bpf_get_func_ip helper for fentry, fexit,
kprobe, kretprobe and fmod_ret programs.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-6-jolsa@kernel.org
2021-07-15 17:59:14 -07:00
Jiri Olsa
9ffd9f3ff7 bpf: Add bpf_get_func_ip helper for kprobe programs
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs,
so it's now possible to call bpf_get_func_ip from both kprobe and
kretprobe programs.

Taking the caller's address from 'struct kprobe::addr', which is
defined for both kprobe and kretprobe.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
2021-07-15 17:59:09 -07:00
Jiri Olsa
9b99edcae5 bpf: Add bpf_get_func_ip helper for tracing programs
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
specifically for all trampoline attach types.

The trampoline's caller IP address is stored in (ctx - 8) address.
so there's no reason to actually call the helper, but rather fixup
the call instruction and return [ctx - 8] value directly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
2021-07-15 17:58:41 -07:00
Jiri Olsa
1e37392ccc bpf: Enable BPF_TRAMP_F_IP_ARG for trampolines with call_get_func_ip
Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.

The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
and is used only by programs with bpf_get_func_ip helper, which is
added in following patch and sets call_get_func_ip bit.

This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
trampolines that have programs with call_get_func_ip set.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
2021-07-15 17:16:06 -07:00
Jiri Olsa
7e6f3cd89f bpf, x86: Store caller's ip in trampoline stack
Storing caller's ip in trampoline's stack. Trampoline programs
can reach the IP in (ctx - 8) address, so there's no change in
program's arguments interface.

The IP address is takes from [fp + 8], which is return address
from the initial 'call fentry' call to trampoline.

This IP address will be returned via bpf_get_func_ip helper
helper, which is added in following patches.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-2-jolsa@kernel.org
2021-07-15 17:16:06 -07:00
Daniel Borkmann
7628317192 Merge branch 'bpf-timers'
Alexei Starovoitov says:

====================
The first request to support timers in bpf was made in 2013 before sys_bpf
syscall was added. That use case was periodic sampling. It was address with
attaching bpf programs to perf_events. Then during XDP development the timers
were requested to do garbage collection and health checks. They were worked
around by implementing timers in user space and triggering progs with
BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
armed by the bpf program. They're done asynchronously vs program execution.
The XDP program cannot send a packet and arm the timer at the same time. The
tracing prog cannot record an event and arm the timer right away. This large
class of use cases remained unaddressed. The jiffy based and hrtimer based
timers are essential part of the kernel development and with this patch set
the hrtimer based timers will be available to bpf programs.

TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
to make sure bpf progs cannot crash the kernel.

v6->v7:
- address Andrii's comments and add his Acks.

v5->v6:
- address code review feedback from Martin and add his Acks.
- add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
second loop in map_free callbacks.
- add cond_resched_rcu.

v4->v5:
- Martin noticed the following issues:
. prog could be reallocated bpf_patch_insn_data().
Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
during insn patching.
. Added missing rcu_read_lock.
. Removed redundant record_map.
- Discovered few bugs with stress testing:
. One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
while another is trying to do something with the timer like bpf_timer_start/set_callback.
Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
so both cpus spin forever. The same problem existed in bpf_timer_cancel().
One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
the timer callback that tries to do bpf_timer_*() helper on the same timer.
The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
Because of this had to add callback_fn != NULL check to bpf_timer_cb().
Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
bpf_timer_cb is in softirq.
. Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
Fixed similar to bpf_map_put which is irq ready.
- Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
make the main logic more clear, since Martin and Yonghong brought up this concern.

v3->v4:
1.
Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
suggested by Martin. That makes bpf timer api match one to one to
kernel hrtimer api and provides greater flexibility.
2.
Martin also discovered the following issue with uref approach:
bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
rm /sys/fs/bpf/xdp_timer
nc -6 ::1 8888
bpftool net detach xdpgeneric dev lo
The timer callback stays active in the kernel though the prog was detached
and map usercnt == 0.
It happened because 'bpftool prog load' pinned the prog only.
The map usercnt went to zero. Subsequent attach and runs didn't
affect map usercnt. The timer was able to start and bpf_prog_inc itself.
When the prog was detached the prog stayed active.
To address this issue added
if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
to the first patch.
Which means that timers are allowed only in the maps that are held
by user space with open file descriptor or maps pinned in bpffs.
3.
Discovered that timers in inner maps were broken.
The inner map pointers are dynamic. Therefore changed bpf_timer_init()
to accept explicit map pointer supplied by the program instead
of hidden map pointer supplied by the verifier.
To make sure that pointer to a timer actually belongs to that map
added the verifier check in patch 3.
4.
Addressed Yonghong's feedback. Improved comments and added
dynamic in_nmi() check.
Added Acks.

v2->v3:
The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
called to make sure callback code doesn't disappear when timer is active and
drop refcnt when timer cb is done. That led to a ton of race conditions between
callback running and concurrent bpf_timer_init/start/cancel on another cpu,
and concurrent bpf_map_update/delete_elem, and map destroy.

Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
timers that bpf prog armed in a link list and canceled them when prog refcnt
went to zero. The race conditions disappeared, but timers in map-in-map could
not be supported cleanly, since timers in inner maps have inner map's life time
and don't match prog's life time.

This v3 approach makes timers to be owned by maps. It allows timers in inner
maps to be supported from the start. This apporach relies on "user refcnt"
scheme used in prog_array that stores bpf programs for bpf_tail_call. The
bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
callback does decrement the refcnt. The ops->map_release_uref is
responsible for cancelling the timers and dropping prog refcnt when user space
reference to a map is dropped. That addressed all the races and simplified
locking.

Andrii presented a use case where specifying callback_fn in bpf_timer_init()
is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
typically is called outside for timer callback, while bpf_timer_start() most
likely will be called from the callback.
timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
infinite loop to the verifier. The verifier had to be made smarter to recognize
such async callbacks. Patches 7,8,9 addressed that.

Patch 1 and 2 refactoring.
Patch 3 implements bpf timer helpers and locking.
Patch 4 implements map side of bpf timer support.
Patch 5 prevent pointer mismatch in bpf_timer_init.
Patch 6 adds support for BTF in inner maps.
Patch 7 teaches check_cfg() pass to understand async callbacks.
Patch 8 teaches do_check() pass to understand async callbacks.
Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
Patches 10 and 11 are the tests.

v1->v2:
- Addressed great feedback from Andrii and Toke.
- Fixed race between parallel bpf_timer_*() ops.
- Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
- Disallowed mmap and global timers.
- Allow spin_lock and bpf_timer in an element.
- Fixed memory leaks due to map destruction and LRU eviction.
- A ton more tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-07-15 22:37:02 +02:00
Alexei Starovoitov
61f71e746c selftests/bpf: Add a test with bpf_timer in inner map.
Check that map-in-map supports bpf timers.

Check that indirect "recursion" of timer callbacks works:
timer_cb1() { bpf_timer_set_callback(timer_cb2); }
timer_cb2() { bpf_timer_set_callback(timer_cb1); }

Check that
  bpf_map_release
    htab_free_prealloced_timers
      bpf_timer_cancel_and_free
        hrtimer_cancel
works while timer cb is running.
"while true; do ./test_progs -t timer_mim; done"
is a great stress test. It caught missing timer cancel in htab->extra_elems.

timer_mim_reject.c is a negative test that checks
that timer<->map mismatch is prevented.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-12-alexei.starovoitov@gmail.com
2021-07-15 22:31:11 +02:00
Alexei Starovoitov
3540f7c6b9 selftests/bpf: Add bpf_timer test.
Add bpf_timer test that creates timers in preallocated and
non-preallocated hash, in array and in lru maps.
Let array timer expire once and then re-arm it for 35 seconds.
Arm lru timer into the same callback.
Then arm and re-arm hash timers 10 times each.
At the last invocation of prealloc hash timer cancel the array timer.
Force timer free via LRU eviction and direct bpf_map_delete_elem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-11-alexei.starovoitov@gmail.com
2021-07-15 22:31:11 +02:00
Alexei Starovoitov
7ddc80a476 bpf: Teach stack depth check about async callbacks.
Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
bfc6bb74e4 bpf: Implement verifier support for validation of async callbacks.
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.

Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.

Enforce return zero from async callbacks.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
86fc6ee6e2 bpf: Relax verifier recursion check.
In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
    bpf_timer_set_callback(.., timer_cb);
}

the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.

Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_set_callback() is done in the follow up patch.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
40ec00abf1 bpf: Remember BTF of inner maps.
BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:

struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
if (inner_map)
    timer = bpf_map_lookup_elem(&inner_map, ...);

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
3e8ce29850 bpf: Prevent pointer mismatch in bpf_timer_init.
bpf_timer_init() arguments are:
1. pointer to a timer (which is embedded in map element).
2. pointer to a map.
Make sure that pointer to a timer actually belongs to that map.

Use map_uid (which is unique id of inner map) to reject:
inner_map1 = bpf_map_lookup_elem(outer_map, key1)
inner_map2 = bpf_map_lookup_elem(outer_map, key2)
if (inner_map1 && inner_map2) {
    timer = bpf_map_lookup_elem(inner_map1);
    if (timer)
        // mismatch would have been allowed
        bpf_timer_init(timer, inner_map2);
}

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
68134668c1 bpf: Add map side support for bpf timers.
Restrict bpf timers to array, hash (both preallocated and kmalloced), and
lru map types. The per-cpu maps with timers don't make sense, since 'struct
bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
the number of timers depends on number of possible cpus and timers would not be
accessible from all cpus. lpm map support can be added in the future.
The timers in inner maps are supported.

The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
bpf_timer in a given map element.

Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
that map element indeed contains 'struct bpf_timer'.

Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
map element data is reused in preallocated htab and lru maps.

Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
map element. There could be one of each, but not more than one. Due to 'one
bpf_timer in one element' restriction do not support timers in global data,
since global data is a map of single element, but from bpf program side it's
seen as many global variables and restriction of single global timer would be
odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
timers, since user space could have corrupted mmap element and crashed the
kernel. The maps with timers cannot be readonly. Due to these restrictions
search for bpf_timer in datasec BTF in case it was placed in the global data to
report clear error.

The previous patch allowed 'struct bpf_timer' as a first field in a map
element only. Relax this restriction.

Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
the timer when lru map deletes an element as a part of it eviction algorithm.

Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
The timer operation are done through helpers only.
This is similar to 'struct bpf_spin_lock'.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
b00628b1c7 bpf: Introduce bpf timers.
Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
in hash/array/lru maps as a regular field and helpers to operate on it:

// Initialize the timer.
// First 4 bits of 'flags' specify clockid.
// Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);

// Configure the timer to call 'callback_fn' static function.
long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);

// Arm the timer to expire 'nsec' nanoseconds from the current time.
long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);

// Cancel the timer and wait for callback_fn to finish if it was running.
long bpf_timer_cancel(struct bpf_timer *timer);

Here is how BPF program might look like:
struct map_elem {
    int counter;
    struct bpf_timer timer;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1000);
    __type(key, int);
    __type(value, struct map_elem);
} hmap SEC(".maps");

static int timer_cb(void *map, int *key, struct map_elem *val);
/* val points to particular map element that contains bpf_timer. */

SEC("fentry/bpf_fentry_test1")
int BPF_PROG(test1, int a)
{
    struct map_elem *val;
    int key = 0;

    val = bpf_map_lookup_elem(&hmap, &key);
    if (val) {
        bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
        bpf_timer_set_callback(&val->timer, timer_cb);
        bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
    }
}

This patch adds helper implementations that rely on hrtimers
to call bpf functions as timers expire.
The following patches add necessary safety checks.

Only programs with CAP_BPF are allowed to use bpf_timer.

The amount of timers used by the program is constrained by
the memcg recorded at map creation time.

The bpf_timer_init() helper needs explicit 'map' argument because inner maps
are dynamic and not known at load time. While the bpf_timer_set_callback() is
receiving hidden 'aux->prog' argument supplied by the verifier.

The prog pointer is needed to do refcnting of bpf program to make sure that
program doesn't get freed while the timer is armed. This approach relies on
"user refcnt" scheme used in prog_array that stores bpf programs for
bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
paired with bpf_timer_cancel() that will drop the prog refcnt. The
ops->map_release_uref is responsible for cancelling the timers and dropping
prog refcnt when user space reference to a map reaches zero.
This uref approach is done to make sure that Ctrl-C of user space process will
not leave timers running forever unless the user space explicitly pinned a map
that contained timers in bpffs.

bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
have user references (is not held by open file descriptor from user space and
not pinned in bpffs).

The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
and free the timer if given map element had it allocated.
"bpftool map update" command can be used to cancel timers.

The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
'__u64 :64' has 1 byte alignment of 8 byte padding.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
c1b3fed319 bpf: Factor out bpf_spin_lock into helpers.
Move ____bpf_spin_lock/unlock into helpers to make it more clear
that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Alexei Starovoitov
d809e134be bpf: Prepare bpf_prog_put() to be called from irq context.
Currently bpf_prog_put() is called from the task context only.
With addition of bpf timers the timer related helpers will start calling
bpf_prog_put() from irq-saved region and in rare cases might drop
the refcnt to zero.
To address this case, first, convert bpf_prog_free_id() to be irq-save
(this is similar to bpf_map_free_id), and, second, defer non irq
appropriate calls into work queue.
For example:
bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
They are not safe with irqs disabled.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com
2021-07-15 22:31:10 +02:00
Tobias Klauser
de587d564f selftests/bpf: Remove unused variable in tc_tunnel prog
The variable buf is unused since commit 005edd1656 ("selftests/bpf:
convert bpf tunnel test to BPF_ADJ_ROOM_MAC"). Remove it to fix the
following warning:

    test_tc_tunnel.c:531:7: warning: unused variable 'buf' [-Wunused-variable]

Fixes: 005edd1656 ("selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210713102719.8890-1-tklauser@distanz.ch
2021-07-15 21:54:43 +02:00
Rocco Yue
87117baf4f ipv6: remove unnecessary local variable
The local variable "struct net *net" in the two functions of
inet6_rtm_getaddr() and inet6_dump_addr() are actually useless,
so remove them.

Signed-off-by: Rocco Yue <rocco.yue@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 10:26:03 -07:00
Richard Laing
5c2c853159 bus: mhi: pci-generic: configurable network interface MRU
The MRU value used by the MHI MBIM network interface affects
the throughput performance of the interface. Different modem
models use different default MRU sizes based on their bandwidth
capabilities. Large values generally result in higher throughput
for larger packet sizes.

In addition if the MRU used by the MHI device is larger than that
specified in the MHI net device the data is fragmented and needs
to be re-assembled which generates a (single) warning message about
the fragmented packets. Setting the MRU on both ends avoids the
extra processing to re-assemble the packets.

This patch allows the documented MRU for a modem to be automatically
set as the MHI net device MRU avoiding fragmentation and improving
throughput performance.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 10:14:44 -07:00
He Fengqing
75f0fc7b48 bpf: Fix potential memleak and UAF in the verifier.
In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
insert new instructions, then use adjust_insn_aux_data() to adjust
insn_aux_data. If the old env->prog have no enough room for new inserted
instructions, we use bpf_prog_realloc to construct new_prog and free the
old env->prog.

There have two errors here. First, if adjust_insn_aux_data() return
ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
been freed in bpf_prog_realloc, but we will use it in bpf_check().

So in this patch, we make the adjust_insn_aux_data() never fails. In
bpf_patch_insn_data(), we first pre-malloc memory for the new
insn_aux_data, then call bpf_patch_insn_single() to insert new
instructions, at last call adjust_insn_aux_data() to adjust
insn_aux_data.

Fixes: 8041902dae ("bpf: adjust insn_aux_data when patching insns")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
2021-07-14 18:31:24 -07:00
Kuniyuki Iwashima
f170acda7f bpf: Fix a typo of reuseport map in bpf.h.
Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo
in bpf.h.

Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210714124317.67526-1-kuniyu@amazon.co.jp
2021-07-14 18:12:05 -07:00
Alexei Starovoitov
cf2c6f0863 bpf: Sync tools/include/uapi/linux/bpf.h
Commit 47316f4a30 missed updating tools/.../bpf.h.
Sync it.

Fixes: 47316f4a30 ("bpf: Support input xdp_md context in BPF_PROG_TEST_RUN")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-07-14 17:36:20 -07:00
Linus Torvalds
8096acd744 Networking fixes for 5.14-rc2, including fixes from bpf and netfilter.
Current release - regressions:
 
  - sock: fix parameter order in sock_setsockopt()
 
 Current release - new code bugs:
 
  - netfilter: nft_last:
      - fix incorrect arithmetic when restoring last used
      - honor NFTA_LAST_SET on restoration
 
 Previous releases - regressions:
 
  - udp: properly flush normal packet at GRO time
 
  - sfc: ensure correct number of XDP queues; don't allow enabling the
         feature if there isn't sufficient resources to Tx from any CPU
 
  - dsa: sja1105: fix address learning getting disabled on the CPU port
 
  - mptcp: addresses a rmem accounting issue that could keep packets
         in subflow receive buffers longer than necessary, delaying
 	MPTCP-level ACKs
 
  - ip_tunnel: fix mtu calculation for ETHER tunnel devices
 
  - do not reuse skbs allocated from skbuff_fclone_cache in the napi
    skb cache, we'd try to return them to the wrong slab cache
 
  - tcp: consistently disable header prediction for mptcp
 
 Previous releases - always broken:
 
  - bpf: fix subprog poke descriptor tracking use-after-free
 
  - ipv6:
       - allocate enough headroom in ip6_finish_output2() in case
         iptables TEE is used
       - tcp: drop silly ICMPv6 packet too big messages to avoid
         expensive and pointless lookups (which may serve as a DDOS
 	vector)
       - make sure fwmark is copied in SYNACK packets
       - fix 'disable_policy' for forwarded packets (align with IPv4)
 
  - netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state
 
  - netfilter: conntrack: do not mark RST in the reply direction coming
       after SYN packet for an out-of-sync entry
 
  - mptcp: cleanly handle error conditions with MP_JOIN and syncookies
 
  - mptcp: fix double free when rejecting a join due to port mismatch
 
  - validate lwtstate->data before returning from skb_tunnel_info()
 
  - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
 
  - mt76: mt7921: continue to probe driver when fw already downloaded
 
  - bonding: fix multiple issues with offloading IPsec to (thru?) bond
 
  - stmmac: ptp: fix issues around Qbv support and setting time back
 
  - bcmgenet: always clear wake-up based on energy detection
 
 Misc:
 
  - sctp: move 198 addresses from unusable to private scope
 
  - ptp: support virtual clocks and timestamping
 
  - openvswitch: optimize operation for key comparison
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDu3mMACgkQMUZtbf5S
 Irsjxg//UwcPJMYFmXV+fGkEsWYe1Kf29FcUDEeANFtbltfAcIfZ0GoTbSDRnrVb
 HcYAKcm4XRx5bWWdQrQsQq/yiLbnS/rSLc7VRB+uRHWRKl3eYcaUB2rnCXsxrjGw
 wQJgOmztDCJS4BIky24iQpF/8lg7p/Gj2Ih532gh93XiYo612FrEJKkYb2/OQfYX
 GkbnZ0kL2Y1SV+bhy6aT5azvhHKM4/3eA4fHeJ2p8e2gOZ5ni0vpX0xEzdzKOCd0
 vwR/Wu3h/+2QuFYVcSsVguuM++JXACG8MAS/Tof78dtNM4a3kQxzqeh5Bv6IkfTu
 rokENLq4pjNRy+nBAOeQZj8Jd0K0kkf/PN9WMdGQtplMoFhjjV25R6PeRrV9wwPo
 peozIz2MuQo7Kfof1D+44h2foyLfdC28/Z0CvRbDpr5EHOfYynvBbrnhzIGdQp6V
 xgftKTOdgz2Djgg8HiblZund1FA44OYerddVAASrIsnSFnIz1VLVQIsfV+GLBwwc
 FawrIZ6WfIjzRSrDGOvDsbAQI47T/1jbaPJeK6XgjWkQmjEd6UtRWRZLYCxemQEw
 4HP3sWC96BOehuD8ylipVE1oFqrxCiOB/fZxezXqjo8dSX3NLdak4cCHTHoW5SuZ
 eEAxQRaBliKd+P7hoy9cZ57CAu3zUa8kijfM5QRlCAHF+zSxaPs=
 =QFnb
 -----END PGP SIGNATURE-----

Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski.
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - sock: fix parameter order in sock_setsockopt()

  Current release - new code bugs:

   - netfilter: nft_last:
       - fix incorrect arithmetic when restoring last used
       - honor NFTA_LAST_SET on restoration

  Previous releases - regressions:

   - udp: properly flush normal packet at GRO time

   - sfc: ensure correct number of XDP queues; don't allow enabling the
     feature if there isn't sufficient resources to Tx from any CPU

   - dsa: sja1105: fix address learning getting disabled on the CPU port

   - mptcp: addresses a rmem accounting issue that could keep packets in
     subflow receive buffers longer than necessary, delaying MPTCP-level
     ACKs

   - ip_tunnel: fix mtu calculation for ETHER tunnel devices

   - do not reuse skbs allocated from skbuff_fclone_cache in the napi
     skb cache, we'd try to return them to the wrong slab cache

   - tcp: consistently disable header prediction for mptcp

  Previous releases - always broken:

   - bpf: fix subprog poke descriptor tracking use-after-free

   - ipv6:
       - allocate enough headroom in ip6_finish_output2() in case
         iptables TEE is used
       - tcp: drop silly ICMPv6 packet too big messages to avoid
         expensive and pointless lookups (which may serve as a DDOS
         vector)
       - make sure fwmark is copied in SYNACK packets
       - fix 'disable_policy' for forwarded packets (align with IPv4)

   - netfilter: conntrack:
       - do not renew entry stuck in tcp SYN_SENT state
       - do not mark RST in the reply direction coming after SYN packet
         for an out-of-sync entry

   - mptcp: cleanly handle error conditions with MP_JOIN and syncookies

   - mptcp: fix double free when rejecting a join due to port mismatch

   - validate lwtstate->data before returning from skb_tunnel_info()

   - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path

   - mt76: mt7921: continue to probe driver when fw already downloaded

   - bonding: fix multiple issues with offloading IPsec to (thru?) bond

   - stmmac: ptp: fix issues around Qbv support and setting time back

   - bcmgenet: always clear wake-up based on energy detection

  Misc:

   - sctp: move 198 addresses from unusable to private scope

   - ptp: support virtual clocks and timestamping

   - openvswitch: optimize operation for key comparison"

* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
  net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
  sfc: add logs explaining XDP_TX/REDIRECT is not available
  sfc: ensure correct number of XDP queues
  sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
  net: fddi: fix UAF in fza_probe
  net: dsa: sja1105: fix address learning getting disabled on the CPU port
  net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
  net: Use nlmsg_unicast() instead of netlink_unicast()
  octeontx2-pf: Fix uninitialized boolean variable pps
  ipv6: allocate enough headroom in ip6_finish_output2()
  net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
  net: bridge: multicast: fix MRD advertisement router port marking race
  net: bridge: multicast: fix PIM hello router port marking race
  net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
  dsa: fix for_each_child.cocci warnings
  virtio_net: check virtqueue_add_sgs() return value
  mptcp: properly account bulk freed memory
  selftests: mptcp: fix case multiple subflows limited by server
  mptcp: avoid processing packet if a subflow reset
  mptcp: fix syncookie process if mptcp can not_accept new subflow
  ...
2021-07-14 09:24:32 -07:00
Christian Brauner
d1d488d813 fs: add vfs_parse_fs_param_source() helper
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.

Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14 09:19:06 -07:00
Christian Brauner
3b0462726e cgroup: verify that source is a string
The following sequence can be used to trigger a UAF:

    int fscontext_fd = fsopen("cgroup");
    int fd_null = open("/dev/null, O_RDONLY);
    int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
    close_range(3, ~0U, 0);

The cgroup v1 specific fs parser expects a string for the "source"
parameter.  However, it is perfectly legitimate to e.g.  specify a file
descriptor for the "source" parameter.  The fs parser doesn't know what
a filesystem allows there.  So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.

This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value.  Access to that union is
guarded by the param->type member.  Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.

Fix this by verifying that param->type is actually a string and report
an error if not.

In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.

Fixes: 8d2451f499 ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14 09:19:06 -07:00