Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback,
fix missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGWf08ACgkQMUZtbf5S
Irt+lxAAj8FAoLoSmQKUK3LttLLh0ZQQXu8Riey+wrP8Z9Yp8xWXIaVRF1c0vCE6
clbrF+mLfk6Wvv/RzOgwyBMHvK+djr/oVDNSmjlRvss4MLDfOQZhUV8V4XpvF4Up
hI7wyKfHtd7niosNqil6wklJFpLU8WyIAWrPSIPE6JlPkJmcm3GUGsliwEPwdLY1
yl7z4zsxigjA+hKxYqNQX6tixF3xnbDUbAnWshrSPL89melwz4GMao45qmcxJEVr
EipPhKifk0hT067jG08FMXcKBFKt6rKk7SVNo4mtq8Tl6HleJBj8fdaJAjSdFahB
+rYJ0sDZwGoDL5CxZ5mD3fM1cDgh4WFEM0z//0b/bZhoPDRKEpLr9LPuv+N6+/rA
8D98EHsvyNjlFgdyd8celMstiGtBn4YLEoLNYYh9Qibgm0XsCuv0yox7g0AOLMmQ
QiBmh2EnaXNPQ8PRZNMK3VH5ol2KoYWL6yrpJYV+wOWVLfezwlSsjkPSfW5pF9FG
hU0iQBp/YTCdCadR9YLj8qfDWDUAkCN7WpqIu9EA9FXJcYjJVaix0MA/tAVlzXyR
xlB7cU6O5NABcs/+04zPkKLwKbVYNMqgvKE+FVDVm+BKxo0UMxcmz/Np/ZYxfhkF
bwKplaiPb2H4D6t0sdxqaeYirPwt1BcleLilae6vHG1jO90H9Vw=
=tlqV
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, mac80211.
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing
progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback, fix
missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to
ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks"
* tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
ptp: ocp: Fix a couple NULL vs IS_ERR() checks
net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
ipv6: check return value of ipv6_skip_exthdr
e100: fix device suspend/resume
devlink: Don't throw an error if flash notification sent before devlink visible
page_pool: Revert "page_pool: disable dma mapping support..."
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port()
octeontx2-af: debugfs: don't corrupt user memory
NFC: add NCI_UNREG flag to eliminate the race
NFC: reorder the logic in nfc_{un,}register_device
NFC: reorganize the functions in nci_request
tipc: check for null after calling kmemdup
i40e: Fix display error code in dmesg
i40e: Fix creation of first queue by omitting it if is not power of two
i40e: Fix warning message and call stack during rmmod i40e driver
i40e: Fix ping is lost after configuring ADq on VF
i40e: Fix changing previously set num_queue_pairs for PFs
i40e: Fix NULL ptr dereference on VSI filter sync
i40e: Fix correct max_pkt_size on VF RX queue
...
The offset value is used in pointer math on skb->data.
Since ipv6_skip_exthdr may return -1 the pointer to uh and th
may not point to the actual udp and tcp headers and potentially
overwrite other stuff. This is why I think this should be checked.
EDIT: added {}'s, thanks Kees
Signed-off-by: Jordy Zomer <jordy@pwning.systems>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlxsw driver calls to various devlink flash routines even before
users can get any access to the devlink instance itself. For example,
mlxsw_core_fw_rev_validate() one of such functions.
__mlxsw_core_bus_device_register
-> mlxsw_core_fw_rev_validate
-> mlxsw_core_fw_flash
-> mlxfw_firmware_flash
-> mlxfw_status_notify
-> devlink_flash_update_status_notify
-> __devlink_flash_update_notify
-> WARN_ON(...)
It causes to the WARN_ON to trigger warning about devlink not registered.
Fixes: cf53021740 ("devlink: Notify users when objects are accessible")
Reported-by: Danielle Ratson <danieller@nvidia.com>
Tested-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d00e60ee54.
As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.
1. https://www.spinics.net/lists/netdev/msg779890.html
Fixes: d00e60ee54 ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Tested-by: "kernelci.org bot" <bot@kernelci.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two sites that calls queue_work() after the
destroy_workqueue() and lead to possible UAF.
The first site is nci_send_cmd(), which can happen after the
nci_close_device as below
nfcmrvl_nci_unregister_dev | nfc_genl_dev_up
nci_close_device |
flush_workqueue |
del_timer_sync |
nci_unregister_device | nfc_get_device
destroy_workqueue | nfc_dev_up
nfc_unregister_device | nci_dev_up
device_del | nci_open_device
| __nci_request
| nci_send_cmd
| queue_work !!!
Another site is nci_cmd_timer, awaked by the nci_cmd_work from the
nci_send_cmd.
... | ...
nci_unregister_device | queue_work
destroy_workqueue |
nfc_unregister_device | ...
device_del | nci_cmd_work
| mod_timer
| ...
| nci_cmd_timer
| queue_work !!!
For the above two UAF, the root cause is that the nfc_dev_up can race
between the nci_unregister_device routine. Therefore, this patch
introduce NCI_UNREG flag to easily eliminate the possible race. In
addition, the mutex_lock in nci_close_device can act as a barrier.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211116152732.19238-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a potential UAF between the unregistration routine and the NFC
netlink operations.
The race that cause that UAF can be shown as below:
(FREE) | (USE)
nfcmrvl_nci_unregister_dev | nfc_genl_dev_up
nci_close_device |
nci_unregister_device | nfc_get_device
nfc_unregister_device | nfc_dev_up
rfkill_destory |
device_del | rfkill_blocked
... | ...
The root cause for this race is concluded below:
1. The rfkill_blocked (USE) in nfc_dev_up is supposed to be placed after
the device_is_registered check.
2. Since the netlink operations are possible just after the device_add
in nfc_register_device, the nfc_dev_up() can happen anywhere during the
rfkill creation process, which leads to data race.
This patch reorder these actions to permit
1. Once device_del is finished, the nfc_dev_up cannot dereference the
rfkill object.
2. The rfkill_register need to be placed after the device_add of nfc_dev
because the parent device need to be created first. So this patch keeps
the order but inject device_lock to prevent the data race.
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: be055b2f89 ("NFC: RFKILL support")
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20211116152652.19217-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a possible data race as shown below:
thread-A in nci_request() | thread-B in nci_close_device()
| mutex_lock(&ndev->req_lock);
test_bit(NCI_UP, &ndev->flags); |
... | test_and_clear_bit(NCI_UP, &ndev->flags)
mutex_lock(&ndev->req_lock); |
|
This race will allow __nci_request() to be awaked while the device is
getting removed.
Similar to commit e2cb6b891a ("bluetooth: eliminate the potential race
condition when removing the HCI controller"). this patch alters the
function sequence in nci_request() to prevent the data races between the
nci_close_device().
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Link: https://lore.kernel.org/r/20211115145600.8320-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Without dropping dst, the packets sent from local mirred/redirected
to ingress will may still use the old dst. ip_rcv() will drop it as
the old dst is for output and its .input is dst_discard.
This patch is to fix by also dropping dst for those packets that are
mirred or redirected from egress to ingress in act_mirred.
Note that we don't drop it for the direction change from ingress to
egress, as on which there might be a user case attaching a metadata
dst by act_tunnel_key that would be used later.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* bad dont-reorder check
* throughput LED trigger for various new(ish) paths
* radiotap header generation
* locking assertions in mac80211 with monitor mode
* radio statistics
* don't try to access IV when not present
* call stop_ap for P2P_GO as well as we should
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmGT1iIACgkQB8qZga/f
l8QDhA/+JeKZeXQ+Ct8kh34xMN4VxCUeGlJNY1aH2mTlJFSwxqJ4XH7eCiSTezXB
ZEZ46wDLOpUF9epGl4NLzFkR5+Sanpn3LJOMLBDYQwkl3xcvNiQEZ0y4T8PayJRR
kSs9ina3zxvZvBjCYX1KKdFatZiE+X5qwYyaK8T4ZjNQ4vjQpeX/V3eFpFeMDjuX
fzIUN27yiESZG4heB2BkyeTR/jb4hG/gCbdGfNxa9MbF0OGJMJkuwH9MNHRQR2cx
NvEPoq2IcooZfbqWnVk9lxWzRpm8Buio+gth9Knd13AGDl4Mez4+n0KxAEhafjhE
ljUYekjh+55iSkt7eczW7wjvxmdGC+lfGN7GUop9mw4LU6NvpFCKXW5MqaH8XBy4
UIkcblXsMKkD8sFbMGzNIl7pT4JGdXkxlorxg7xI5s02nMFppnWJ7mrXQmcBkBG7
5JLkaXahiQzQYq/VGepCPMYYwKQqbHXts9TPMq/W3aIeR8QxA3uINZ5P9eGF9qn8
xTcvEQPYvIid1BIorO5/0V/VgE9VNDP1w8aKvn8bJjOmbWfuE68Wx+K6LlHy+E+O
/MhjiRwnCvvTfdi6vPUjTxis8Xh5e9xfYmcqyRVIQujLmhta31FEJ/q+R+A1A4Rs
RsesqpLgEOo3rkSciAAfItmRf3SVgJlIuZcfe16XFw0s45oYEck=
=aD/1
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Couple of fixes:
* bad dont-reorder check
* throughput LED trigger for various new(ish) paths
* radiotap header generation
* locking assertions in mac80211 with monitor mode
* radio statistics
* don't try to access IV when not present
* call stop_ap for P2P_GO as well as we should
* tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
mac80211: fix throughput LED trigger
mac80211: fix monitor_sdata RCU/locking assertions
mac80211: drop check for DONT_REORDER in __ieee80211_select_queue
mac80211: fix radiotap header generation
mac80211: do not access the IV when it was stripped
nl80211: fix radio statistics in survey dump
cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
====================
Link: https://lore.kernel.org/r/20211116160845.157214-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It turns out the skb's in sock receive queue could have bad checksums, as
both ->poll() and ->recvmsg() validate checksums. We have to do the same
for ->read_sock() path too before they are redirected in sockmap.
Fixes: d7f571188e ("udp: Implement ->read_sock() for sockmap")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com
The link_id is supposed to be unique, but smcr_next_link_id() doesn't
skip the used link_id as expected. So the patch fixes this.
Fixes: 026c381fb4 ("net/smc: introduce link_idx for link group array")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_clone_lock() needs to call sock_inuse_add(1) before entering the
sk_free_unlock_clone() error path, for __sk_free() from sk_free() from
sk_free_unlock_clone() calls sock_inuse_add(-1).
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 648845ab7e ("sock: Move the socket inuse to namespace.")
Signed-off-by: David S. Miller <davem@davemloft.net>
The MSG_CRYPTO msgs are always encrypted and sent to other nodes
for keys' deployment. But when receiving in peers, if those nodes
do not validate it and make sure it's encrypted, one could craft
a malicious MSG_CRYPTO msg to deploy its key with no need to know
other nodes' keys.
This patch is to do that by checking TIPC_SKB_CB(skb)->decrypted
and discard it if this packet never got decrypted.
Note that this is also a supplementary fix to CVE-2021-43267 that
can be triggered by an unencrypted malicious MSG_CRYPTO msg.
Fixes: 1ef6f7c939 ("tipc: add automatic session key exchange")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When kmemdup called failed and register_net_sysctl return NULL, should
return ENOMEM instead of ENOBUFS
Signed-off-by: liuguoqiang <liuguoqiang@uniontech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SMC fallback is incomplete currently. There may be some
wait queue entries remaining in smc socket->wq, which should
be removed to clcsocket->wq during the fallback.
For example, in nginx/wrk benchmark, this issue causes an
all-zeros test result:
server: nginx -g 'daemon off;'
client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
Running 5s test @ http://11.200.15.93/index.html
1 threads and 1 connections
Thread Stats Avg Stdev Max ± Stdev
Latency 0.00us 0.00us 0.00us -nan%
Req/Sec 0.00 0.00 0.00 -nan%
0 requests in 5.00s, 0.00B read
Requests/sec: 0.00
Transfer/sec: 0.00B
The reason for this all-zeros result is that when wrk used SMC
to replace TCP, it added an eppoll_entry into smc socket->wq
and expected to be notified if epoll events like EPOLL_IN/
EPOLL_OUT occurred on the smc socket.
However, once a fallback occurred, wrk switches to use clcsocket.
Now it is clcsocket->wq instead of smc socket->wq which will
be woken up. The eppoll_entry remaining in smc socket->wq does
not work anymore and wrk stops the test.
This patch fixes this issue by removing remaining wait queue
entries from smc socket->wq to clcsocket->wq during the fallback.
Link: https://www.spinics.net/lists/netdev/msg779769.html
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some functions, like tipc_crypto_start use inconsisten GFP flags
when allocating memory. The mentioned function use GFP_ATOMIC to
to alloc a crypto instance, and then calls alloc_ordered_workqueue()
which allocates memory with GFP_KERNEL. tipc_aead_init() function
even uses GFP_KERNEL and GFP_ATOMIC interchangeably.
No doc comment specifies what context a function is designed to
work in, but the flags should at least be consistent within a function.
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: tipc-discussion@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The codepaths for rx with decap offload and tx with itxq were not updating
the counters for the throughput led trigger.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211113063415.55147-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since commit a05829a722 ("cfg80211: avoid holding the RTNL when
calling the driver") we've not only been protecting the pointer
to monitor_sdata with the RTNL, but also with the wiphy->mtx. This
is relevant in a number of lockdep assertions, e.g. the one we hit
in ieee80211_set_monitor_channel(). However, we're now protecting
all the assignments/dereferences, even the one in interface iter,
with the wiphy->mtx, so switch over the lockdep assertions to that
lock.
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20211112135143.cb8e8ceffef3.Iaa210f16f6904c8a7a24954fb3396da0ef86ec08@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When __ieee80211_select_queue is called, skb->cb has not been cleared yet,
which means that info->control.flags can contain garbage.
In some cases this leads to IEEE80211_TX_CTRL_DONT_REORDER being set, causing
packets marked for other queues to randomly end up in BE instead.
This flag only needs to be checked in ieee80211_select_queue_80211, since
the radiotap parser is the only piece of code that sets it
Fixes: 66d06c8473 ("mac80211: adhere to Tx control flag that prevents frame reordering")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211110212201.35452-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In commit 8c89f7b3d3 ("mac80211: Use flex-array for radiotap header
bitmap") we accidentally pointed the position to the wrong place, so
we overwrite a present bitmap, and thus cause all kinds of trouble.
To see the issue, note that the previous code read:
pos = (void *)(it_present + 1);
The requirement now is that we need to calculate pos via it_optional,
to not trigger the compiler hardening checks, as:
pos = (void *)&rthdr->it_optional[...];
Rewriting the original expression, we get (obviously, since that just
adds "+ x - x" terms):
pos = (void *)(it_present + 1 + rthdr->it_optional - rthdr->it_optional)
and moving the "+ rthdr->it_optional" outside to be used as an array:
pos = (void *)&rthdr->it_optional[it_present + 1 - rthdr->it_optional];
The original is off by one, fix it.
Cc: stable@vger.kernel.org
Fixes: 8c89f7b3d3 ("mac80211: Use flex-array for radiotap header bitmap")
Reported-by: Sid Hayn <sidhayn@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Sid Hayn <sidhayn@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211109100203.c61007433ed6.I1dade57aba7de9c4f48d68249adbae62636fd98c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_get_keyid() will return false value if IV has been stripped,
such as return 0 for IP/ARP frames due to LLC header, and return -EINVAL
for disassociation frames due to its length... etc. Don't try to access
it if it's not present.
Signed-off-by: Xing Song <xing.song@mediatek.com>
Link: https://lore.kernel.org/r/20211101024657.143026-1-xing.song@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Even if userspace specifies the NL80211_ATTR_SURVEY_RADIO_STATS
attribute, we cannot get the statistics because we're not really
parsing the incoming attributes properly any more.
Fix this by passing the attrbuf to nl80211_prepare_wdev_dump()
and filling it there, if given, and using a local version only
if no output is desired.
Since I'm touching it anyway, make nl80211_prepare_wdev_dump()
static.
Fixes: 50508d941c ("cfg80211: use parallel_ops for genl")
Reported-by: Jan Fuchs <jf@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20211029092539.2851b4799386.If9736d4575ee79420cbec1bd930181e1d53c7317@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the userspace tools switch from NL80211_IFTYPE_P2P_GO to
NL80211_IFTYPE_ADHOC via send_msg(NL80211_CMD_SET_INTERFACE), it
does not call the cleanup cfg80211_stop_ap(), this leads to the
initialization of in-use data. For example, this path re-init the
sdata->assigned_chanctx_list while it is still an element of
assigned_vifs list, and makes that linked list corrupt.
Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com>
Reported-by: syzbot+bbf402b783eeb6d908db@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20211027173722.777287-1-phind.uet@gmail.com
Cc: stable@vger.kernel.org
Fixes: ac800140c2 ("cfg80211: .stop_ap when interface is going down")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch reverts two prior patches, e7310c9402
("security: implement sctp_assoc_established hook in selinux") and
7c2ef0240e ("security: add sctp_assoc_established hook"), which
create the security_sctp_assoc_established() LSM hook and provide a
SELinux implementation. Unfortunately these two patches were merged
without proper review (the Reviewed-by and Tested-by tags from
Richard Haines were for previous revisions of these patches that
were significantly different) and there are outstanding objections
from the SELinux maintainers regarding these patches.
Work is currently ongoing to correct the problems identified in the
reverted patches, as well as others that have come up during review,
but it is unclear at this point in time when that work will be ready
for inclusion in the mainline kernel. In the interest of not keeping
objectionable code in the kernel for multiple weeks, and potentially
a kernel release, we are reverting the two problematic patches.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in 5.7 are now enabled by default. This should greatly speed up things
like rm, tar and rsync. To opt out, wsync mount option can be used.
Other than that we have a pile of bug fixes all across the filesystem
from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
Luis.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmGORY8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5XmB/0SRQTW+BRAhYvSD/Ib2/c6uVZGnmJU
MUCO/uuD4fZvfxyMVb3qnAzHIGh3YlFWa/dgSZNStvLmY0L9P0MIvyaYolVuC4Tu
7llX1I+yckTns9VmiULBNy9D812eRY282nMbRzikMGPO1eb6Yqo3r50AdTvkam/R
Qs5pfwRqLerbP7VUv4vSsrBflwVyHOrCFZaUUVTu4f2kKz/FzZd/FCw5VV51KYzq
ygN+8eFotf5zluMX0tl0FXVgJ13N9vbg+YNxyOzHmOAV0AhSmmtV8vJ+j+m7+sOj
b7wNl5AuI5Ogeg8PHSGsXOHBVn6IbhdGDcougEVL+1gHkxxOQ5hfBHWQ
=22Vd
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"One notable change here is that async creates and unlinks introduced
in 5.7 are now enabled by default. This should greatly speed up things
like rm, tar and rsync. To opt out, wsync mount option can be used.
Other than that we have a pile of bug fixes all across the filesystem
from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
Luis"
* tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client:
ceph: add a new metric to keep track of remote object copies
libceph, ceph: move ceph_osdc_copy_from() into cephfs code
ceph: clean-up metrics data structures to reduce code duplication
ceph: split 'metric' debugfs file into several files
ceph: return the real size read when it hits EOF
ceph: properly handle statfs on multifs setups
ceph: shut down mount on bad mdsmap or fsmap decode
ceph: fix mdsmap decode when there are MDS's beyond max_mds
ceph: ignore the truncate when size won't change with Fx caps issued
ceph: don't rely on error_string to validate blocklisted session.
ceph: just use ci->i_version for fscache aux info
ceph: shut down access to inode when async create fails
ceph: refactor remove_session_caps_cb
ceph: fix auth cap handling logic in remove_session_caps_cb
ceph: drop private list from remove_session_caps_cb
ceph: don't use -ESTALE as special return code in try_get_cap_refs
ceph: print inode numbers instead of pointer values
ceph: enable async dirops by default
libceph: drop ->monmap and err initialization
ceph: convert to noop_direct_IO
TCP Receive zerocopy iterates through the SKB queue via
tcp_recv_skb(), acquiring a pointer to an SKB and an offset within
that SKB to read from. From there, it iterates the SKB frags array to
determine which offset to start remapping pages from.
However, this is built on the assumption that the offset read so far
within the SKB is smaller than the SKB length. If this assumption is
violated, we can attempt to read an invalid frags array element, which
would cause a fault.
tcp_recv_skb() can cause such an SKB to be returned when the TCP FIN
flag is set. Therefore, we must guard against this occurrence inside
skb_advance_frag().
One way that we can reproduce this error follows:
1) In a receiver program, call getsockopt(TCP_ZEROCOPY_RECEIVE) with:
char some_array[32 * 1024];
struct tcp_zerocopy_receive zc = {
.copybuf_address = (__u64) &some_array[0],
.copybuf_len = 32 * 1024,
};
2) In a sender program, after a TCP handshake, send the following
sequence of packets:
i) Seq = [X, X+4000]
ii) Seq = [X+4000, X+5000]
iii) Seq = [X+4000, X+5000], Flags = FIN | URG, urgptr=1000
(This can happen without URG, if we have a signal pending, but URG is
a convenient way to reproduce the behaviour).
In this case, the following event sequence will occur on the receiver:
tcp_zerocopy_receive():
-> receive_fallback_to_copy() // copybuf_len >= inq
-> tcp_recvmsg_locked() // reads 5000 bytes, then breaks due to URG
-> tcp_recv_skb() // yields skb with skb->len == offset
-> tcp_zerocopy_set_hint_for_skb()
-> skb_advance_to_frag() // will returns a frags ptr. >= nr_frags
-> find_next_mappable_frag() // will dereference this bad frags ptr.
With this patch, skb_advance_to_frag() will no longer return an
invalid frags pointer, and will return NULL instead, fixing the issue.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Link: https://lore.kernel.org/r/20211111235215.2605384-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch reverts two prior patches, e7310c9402
("security: implement sctp_assoc_established hook in selinux") and
7c2ef0240e ("security: add sctp_assoc_established hook"), which
create the security_sctp_assoc_established() LSM hook and provide a
SELinux implementation. Unfortunately these two patches were merged
without proper review (the Reviewed-by and Tested-by tags from
Richard Haines were for previous revisions of these patches that
were significantly different) and there are outstanding objections
from the SELinux maintainers regarding these patches.
Work is currently ongoing to correct the problems identified in the
reverted patches, as well as others that have come up during review,
but it is unclear at this point in time when that work will be ready
for inclusion in the mainline kernel. In the interest of not keeping
objectionable code in the kernel for multiple weeks, and potentially
a kernel release, we are reverting the two problematic patches.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Fix a crash in the buffer pool allocator when a buffer is double
freed. It is possible to trigger this behavior not only from a faulty
driver, but also from user space like this: Create a zero-copy AF_XDP
socket. Load an XDP program that will issue XDP_DROP for all
packets. Put the same umem buffer into the fill ring multiple times,
then bind the socket and send some traffic. This will crash the kernel
as the XDP_DROP action triggers one call to xsk_buff_free()/xp_free()
for every packet dropped. Each call will add the corresponding buffer
entry to the free_list and increase the free_list_cnt. Some entries
will have been added multiple times due to the same buffer being
freed. The buffer allocation code will then traverse this broken list
and since the same buffer is in the list multiple times, it will try
to delete the same buffer twice from the list leading to a crash.
The fix for this is just to test that the buffer has not been added
before in xp_free(). If it has been, just return from the function and
do not put it in the free_list a second time.
Note that this bug was not present in the code before the commit
referenced in the Fixes tag. That code used one list entry per
allocated buffer, so multiple frees did not have any side effects. But
the commit below optimized the usage of the pool and only uses a
single entry per buffer in the umem, meaning that multiple
allocations/frees of the same buffer will also only use one entry,
thus leading to the problem.
Fixes: 47e4075df3 ("xsk: Batched buffer allocation for the pool")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20211111075707.21922-1-magnus.karlsson@gmail.com
and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different
from the tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit operations
to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
to access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
=Ttgq
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different from the
tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the
workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit
operations to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs"
* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
selftests/net: udpgso_bench_rx: fix port argument
net: wwan: iosm: fix compilation warning
cxgb4: fix eeprom len when diagnostics not implemented
net: fix premature exit from NAPI state polling in napi_disable()
net/smc: fix sk_refcnt underflow on linkdown and fallback
net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
gve: fix unmatched u64_stats_update_end()
net: ethernet: lantiq_etop: Fix compilation error
selftests: forwarding: Fix packet matching in mirroring selftests
vsock: prevent unnecessary refcnt inc for nonblocking connect
net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
net: stmmac: allow a tc-taprio base-time of zero
selftests: net: test_vxlan_under_vrf: fix HV connectivity test
net: hns3: allow configure ETS bandwidth of all TCs
net: hns3: remove check VF uc mac exist when set by PF
net: hns3: fix some mac statistics is always 0 in device version V2
net: hns3: fix kernel crash when unload VF while it is being reset
net: hns3: sync rx ring head in echo common pull
net: hns3: fix pfc packet number incorrect after querying pfc parameters
...
Commit 719c571970 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.
Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.
Error path looks like:
napi_disable():
unsigned long val, new; /* new is uninitialized */
do {
val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
NAPIF_STATE_SCHED is set */
if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
usleep_range(20, 200);
continue; /* go straight to the condition check */
}
new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
writes garbage */
napi_enable():
do {
val = READ_ONCE(n->state);
BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>
while the typical BUG splat is like:
[ 172.652461] ------------[ cut here ]------------
[ 172.652462] kernel BUG at net/core/dev.c:6937!
[ 172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G I 5.15.0 #42
[ 172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[ 172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[ 172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[ 172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[ 172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[ 172.782403] Call Trace:
[ 172.784857] <TASK>
[ 172.786963] ice_up_complete+0x6f/0x210 [ice]
[ 172.791349] ice_xdp+0x136/0x320 [ice]
[ 172.795108] ? ice_change_mtu+0x180/0x180 [ice]
[ 172.799648] dev_xdp_install+0x61/0xe0
[ 172.803401] dev_xdp_attach+0x1e0/0x550
[ 172.807240] dev_change_xdp_fd+0x1e6/0x220
[ 172.811338] do_setlink+0xee8/0x1010
[ 172.814917] rtnl_setlink+0xe5/0x170
[ 172.818499] ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[ 172.823732] ? security_capable+0x36/0x50
< snip >
Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.
From v1 [0]:
- just use a for-loop to simplify both the fix and the existing
code (Eric).
[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
Fixes: 719c571970 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
6aiufVWKNNRQef9y
=yKBl
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
xdr-related cleanup from Chuck"
* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd4: remove obselete comment
nfsd: document server-to-server-copy parameters
NFSD:fix boolreturn.cocci warning
nfsd: update create verifier comment
SUNRPC: Change return value type of .pc_encode
SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
NFSD: Save location of NFSv4 COMPOUND status
SUNRPC: Change return value type of .pc_decode
SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
SUNRPC: De-duplicate .pc_release() call sites
SUNRPC: Simplify the SVC dispatch code path
SUNRPC: Capture value of xdr_buf::page_base
SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
svcrdma: Split svcrmda_wc_{read,write} tracepoints
svcrdma: Split the svcrdma_wc_send() tracepoint
svcrdma: Split the svcrdma_wc_receive() tracepoint
NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
NFSD: Initialize pointer ni with NULL and not plain integer 0
NFSD: simplify struct nfsfh
...
Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other tracing
improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
=n8lj
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
- Optimisations for READDIR and the 'ls -l' style workload
- Further replacements of dprintk() with tracepoints and other
tracing improvements
- Ensure we re-probe NFSv4 server capabilities when the user does a
"mount -o remount"
Bugfixes:
- Fix an Oops in pnfs_mark_request_commit()
- Fix up deadlocks in the commit code
- Fix regressions in NFSv2/v3 attribute revalidation due to the
change_attr_type optimisations
- Fix some dentry verifier races
- Fix some missing dentry verifier settings
- Fix a performance regression in nfs_set_open_stateid_locked()
- SUNRPC was sending multiple SYN calls when re-establishing a TCP
connection.
- Fix multiple NFSv4 issues due to missing sanity checking of server
return values
- Fix a potential Oops when FREE_STATEID races with an unmount
Cleanups:
- Clean up the labelled NFS code
- Remove unused header <linux/pnfs_osd_xdr.h>"
* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
NFS: Remove the nfs4_label argument from decode_getattr_*() functions
NFS: Remove the nfs4_label argument from nfs_setsecurity
NFS: Remove the nfs4_label argument from nfs_fhget()
NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
NFS: Remove the nfs4_label argument from nfs_instantiate()
NFS: Remove the nfs4_label from the nfs_setattrres
NFS: Remove the nfs4_label from the nfs4_getattr_res
NFS: Remove the f_label from the nfs4_opendata and nfs_openres
NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
NFS: Remove the label from the nfs4_lookup_res struct
NFS: Remove the nfs4_label from the nfs4_link_res struct
NFS: Remove the nfs4_label from the nfs4_create_res struct
NFS: Remove the nfs4_label from the nfs_entry struct
NFS: Create a new nfs_alloc_fattr_with_label() function
NFS: Always initialise fattr->label in nfs_fattr_alloc()
NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4: Remove unnecessary 'minor version' check
NFSv4: Fix potential Oops in decode_op_map()
...
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
We got the following WARNING when running ab/nginx
test with RDMA link flapping (up-down-up).
The reason is when smc_sock fallback and at linkdown
happens simultaneously, we may got the following situation:
__smc_lgr_terminate()
--> smc_conn_kill()
--> smc_close_active_abort()
smc_sock->sk_state = SMC_CLOSED
sock_put(smc_sock)
smc_sock was set to SMC_CLOSED and sock_put() been called
when terminate the link group. But later application call
close() on the socket, then we got:
__smc_release():
if (smc_sock->fallback)
smc_sock->sk_state = SMC_CLOSED
sock_put(smc_sock)
Again we set the smc_sock to CLOSED through it's already
in CLOSED state, and double put the refcnt, so the following
warning happens:
refcount_t: underflow; use-after-free.
WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0
Modules linked in:
CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
RIP: 0010:refcount_warn_saturate+0x8d/0xf0
Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48
RSP: 0018:ffffc90000527e50 EFLAGS: 00010286
RAX: 0000000000000026 RBX: ffff8881300df2c0 RCX: 0000000000000027
RDX: 0000000000000000 RSI: ffff88813bd58040 RDI: ffff88813bd58048
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000001
R10: ffff8881300df2c0 R11: ffffc90000527c78 R12: ffff8881300df340
R13: ffff8881300df930 R14: ffff88810b3dad80 R15: ffff8881300df4f8
FS: 00007f739de8fb80(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000a01b008 CR3: 0000000111b64003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
smc_release+0x353/0x3f0
__sock_release+0x3d/0xb0
sock_close+0x11/0x20
__fput+0x93/0x230
task_work_run+0x65/0xa0
exit_to_user_mode_prepare+0xf9/0x100
syscall_exit_to_user_mode+0x27/0x190
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This patch adds check in __smc_release() to make
sure we won't do an extra sock_put() and set the
socket to CLOSED when its already in CLOSED state.
Fixes: 51f1de79ad (net/smc: replace sock_put worker by socket refcounting)
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently vosck_connect() increments sock refcount for nonblocking
socket each time it's called, which can lead to memory leak if
it's called multiple times because connect timeout function decrements
sock refcount only once.
Fixes it by making vsock_connect() return -EALREADY immediately when
sock state is already SS_CONNECTING.
Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2021-11-09
We've added 7 non-merge commits during the last 3 day(s) which contain
a total of 10 files changed, 174 insertions(+), 48 deletions(-).
The main changes are:
1) Various sockmap fixes, from John and Jussi.
2) Fix out-of-bound issue with bpf_pseudo_func, from Martin.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg
bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
bpf, sockmap: Remove unhash handler for BPF sockmap usage
bpf, sockmap: Use stricter sk state checks in sk_lookup_assign
bpf: selftest: Trigger a DCE on the whole subprog
bpf: Stop caching subprog index in the bpf_pseudo_func insn
====================
Link: https://lore.kernel.org/r/20211109215702.38350-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- fix syzcaller uninitialized value usage after missing error check
- add module autoloading based on transport name
- convert cached reads to use netfs helpers
- adjust readahead based on transport msize
- and many, many checkpatch.pl warning fixes...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmGJK4IACgkQq06b7GqY
5nAKOQ/+NO41If4p93P65j78pF7EIuGolIRnG3OgLV9M92v3sGQVChMoX68GPzrf
mOWbhAL7lHdvzJBdrYjxa5rGOq9zzjPHJ/toI/93pLOqFHZlw15GV/nq0wLZZoc+
fvWCc1co6aL3suAeVUXM3bwUAZHRkCssDSKajbMrLFikIkifSoroM4tzAvktEXXJ
4nM5xtTi6p3r8HLffTHNTqoEmA08VQm41hqRZ5XYLhieFUMH9p8b5cbWhycROn2D
aNLgXcnXohvhJpbrkJ16fxdWGCUGaxAxK6gso8wMwBns+/7jq+uBeVypdIji00LL
KDzQGs8VVThXyMwtB40rYzCTfYqXxJB2qoR+dJd6hh48lmvwdjDF/nGe5aFIYjJ/
251xm2yojs3BQs4/v7lJOA4RT9K0r57ZjzhsnMqSNgnJVa6ufZel76wZPxsjwfR2
KQeoNRd2ftPaIfduscaHl6Ay9vMj4oGah/T2wuAI6hpV8Y9iB6XTA27dbxTCfq1Y
AC2yzWvno3jgSYaKVdJ4TfPoZ4HynTQveTuF8zJakmUPJ6k0oJTku0t8ij3+gF+C
Y0KtOFMFw3FRZe4va4UJRHqHxqOhqT5BAP5BmaDKYv8myLcEqQWJmKshx+wxxsoF
astAXY79zkS3+n90c5pn0aqdLJk0KWsIN+46gp7c/CYlLH1PCMI=
=1g7l
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Fixes, netfs read support and checkpatch rewrite:
- fix syzcaller uninitialized value usage after missing error check
- add module autoloading based on transport name
- convert cached reads to use netfs helpers
- adjust readahead based on transport msize
- and many, many checkpatch.pl warning fixes..."
* tag '9p-for-5.16-rc1' of git://github.com/martinetd/linux:
9p: fix a bunch of checkpatch warnings
9p: set readahead and io size according to maxsize
9p p9mode2perm: remove useless strlcpy and check sscanf return code
9p v9fs_parse_options: replace simple_strtoul with kstrtouint
9p: fix file headers
fs/9p: fix indentation and Add missing a blank line after declaration
fs/9p: fix warnings found by checkpatch.pl
9p: fix minor indentation and codestyle
fs/9p: cleanup: opening brace at the beginning of the next line
9p: Convert to using the netfs helper lib to do reads and caching
fscache_cookie_enabled: check cookie is valid before accessing it
net/9p: autoload transport modules
9p/net: fix missing error check in p9_check_errors
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.
Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current conversion of skb->data_end reads like this:
; data_end = (void*)(long)skb->data_end;
559: (79) r1 = *(u64 *)(r2 +200) ; r1 = skb->data
560: (61) r11 = *(u32 *)(r2 +112) ; r11 = skb->len
561: (0f) r1 += r11
562: (61) r11 = *(u32 *)(r2 +116)
563: (1f) r1 -= r11
But similar to the case in 84f44df664 ("bpf: sock_ops sk access may stomp
registers when dst_reg = src_reg"), the code will read an incorrect skb->len
when src == dst. In this case we end up generating this xlated code:
; data_end = (void*)(long)skb->data_end;
559: (79) r1 = *(u64 *)(r1 +200) ; r1 = skb->data
560: (61) r11 = *(u32 *)(r1 +112) ; r11 = (skb->data)->len
561: (0f) r1 += r11
562: (61) r11 = *(u32 *)(r1 +116)
563: (1f) r1 -= r11
... where line 560 is the reading 4B of (skb->data + 112) instead of the
intended skb->len Here the skb pointer in r1 gets set to skb->data and the
later deref for skb->len ends up following skb->data instead of skb.
This fixes the issue similarly to the patch mentioned above by creating an
additional temporary variable and using to store the register when dst_reg =
src_reg. We name the variable bpf_temp_reg and place it in the cb context for
sk_skb. Then we restore from the temp to ensure nothing is lost.
Fixes: 16137b09a6 ("bpf: Compute data_end dynamically with JIT code")
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-6-john.fastabend@gmail.com
Strparser is reusing the qdisc_skb_cb struct to stash the skb message handling
progress, e.g. offset and length of the skb. First this is poorly named and
inherits a struct from qdisc that doesn't reflect the actual usage of cb[] at
this layer.
But, more importantly strparser is using the following to access its metadata.
(struct _strp_msg *)((void *)skb->cb + offsetof(struct qdisc_skb_cb, data))
Where _strp_msg is defined as:
struct _strp_msg {
struct strp_msg strp; /* 0 8 */
int accum_len; /* 8 4 */
/* size: 12, cachelines: 1, members: 2 */
/* last cacheline: 12 bytes */
};
So we use 12 bytes of ->data[] in struct. However in BPF code running parser
and verdict the user has read capabilities into the data[] array as well. Its
not too problematic, but we should not be exposing internal state to BPF
program. If its really needed then we can use the probe_read() APIs which allow
reading kernel memory. And I don't believe cb[] layer poses any API breakage by
moving this around because programs can't depend on cb[] across layers.
In order to fix another issue with a ctx rewrite we need to stash a temp
variable somewhere. To make this work cleanly this patch builds a cb struct
for sk_skb types called sk_skb_cb struct. Then we can use this consistently
in the strparser, sockmap space. Additionally we can start allowing ->cb[]
write access after this.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-5-john.fastabend@gmail.com
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
We do not need to handle unhash from BPF side we can simply wait for the
close to happen. The original concern was a socket could transition from
ESTABLISHED state to a new state while the BPF hook was still attached.
But, we convinced ourself this is no longer possible and we also improved
BPF sockmap to handle listen sockets so this is no longer a problem.
More importantly though there are cases where unhash is called when data is
in the receive queue. The BPF unhash logic will flush this data which is
wrong. To be correct it should keep the data in the receive queue and allow
a receiving application to continue reading the data. This may happen when
tcp_abort() is received for example. Instead of complicating the logic in
unhash simply moving all this to tcp_close() hook solves this.
Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-3-john.fastabend@gmail.com
In order to fix an issue with sockets in TCP sockmap redirect cases we plan
to allow CLOSE state sockets to exist in the sockmap. However, the check in
bpf_sk_lookup_assign() currently only invalidates sockets in the
TCP_ESTABLISHED case relying on the checks on sockmap insert to ensure we
never SOCK_CLOSE state sockets in the map.
To prepare for this change we flip the logic in bpf_sk_lookup_assign() to
explicitly test for the accepted cases. Namely, a tcp socket in TCP_LISTEN
or a udp socket in TCP_CLOSE state. This also makes the code more resilent
to future changes.
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-2-john.fastabend@gmail.com
This patch moves ceph_osdc_copy_from() function out of libceph code into
cephfs. There are no other users for this function, and there is the need
(in another patch) to access internal ceph_osd_request struct members.
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Call to build_initial_monmap() is one stone two birds. Explicitly it
initializes err variable. Implicitly it initializes ->monmap via call to
kzalloc(). We should only declare err and ->monmap is taken care of by
ceph_monc_init() prototype.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>