__nft_release_hooks() is called from pre_netns exit path which
unregisters the hooks, then the NETDEV_UNREGISTER event is triggered
which unregisters the hooks again.
[ 565.221461] WARNING: CPU: 18 PID: 193 at net/netfilter/core.c:495 __nf_unregister_net_hook+0x247/0x270
[...]
[ 565.246890] CPU: 18 PID: 193 Comm: kworker/u64:1 Tainted: G E 5.18.0-rc7+ #27
[ 565.253682] Workqueue: netns cleanup_net
[ 565.257059] RIP: 0010:__nf_unregister_net_hook+0x247/0x270
[...]
[ 565.297120] Call Trace:
[ 565.300900] <TASK>
[ 565.304683] nf_tables_flowtable_event+0x16a/0x220 [nf_tables]
[ 565.308518] raw_notifier_call_chain+0x63/0x80
[ 565.312386] unregister_netdevice_many+0x54f/0xb50
Unregister and destroy netdev hook from netns pre_exit via kfree_rcu
so the NETDEV_UNREGISTER path see unregistered hooks.
Fixes: 767d1216bf ("netfilter: nftables: fix possible UAF over chains from packet path in netns")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
clean_net() runs in workqueue while walking over the lists, grab mutex.
Fixes: 767d1216bf ("netfilter: nftables: fix possible UAF over chains from packet path in netns")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add several sanity checks for nft_set_desc_concat_parse():
- validate desc->field_count not larger than desc->field_len array.
- field length cannot be larger than desc->field_len (ie. U8_MAX)
- total length of the concatenation cannot be larger than register array.
Joint work with Florian Westphal.
Fixes: f3a2181e16 ("netfilter: nf_tables: Support for sets with multiple ranged fields")
Reported-by: <zhangziming.zzm@antgroup.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RFC 9131 changes default behaviour of handling RX of NA messages when the
corresponding entry is absent in the neighbour cache. The current
implementation is limited to accept just unsolicited NAs. However, the
RFC is more generic where it also accepts solicited NAs. Both types
should result in adding a STALE entry for this case.
Expand accept_untracked_na behaviour to also accept solicited NAs to
be compliant with the RFC and rename the sysctl knob to
accept_untracked_na.
Fixes: f9a2fb7331 ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131")
Signed-off-by: Arun Ajith S <aajith@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Fix the following build warning when CONFIG_IPV6 is not set:
In function ‘fortify_memcpy_chk’,
inlined from ‘tcp_md5_do_add’ at net/ipv4/tcp_ipv4.c:1210:2:
./include/linux/fortify-string.h:328:4: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
328 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220526101213.2392980-1-zhanggenjian@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmKSSbcTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXgJyCACeyMOcFws5lyqqdk0R0zGr2KFfKsJn
YQR9nvldT2p/1y0ykvU208UIq0HHmXOb9pD8gOUzGYGp4XlEaC1f4V37mmzgLcRu
vL/HcFqBl2cQEfaQxiXZrmsIIVszwbc57EGqpl93cS2er4hp/NXmredKCId7Mpt8
FjxjgVGzdhEUKbJZYjkDM5pYAnJ9QVwuK3MaarKMK86Oj1P5YtKgIb4ZSt/NHvsC
Mukx3nivSH29XfK3fRsFDJUQr9WNYh1cmTtyhB0tWVXQCYFc4angZRtCJwyXzkp2
P5GBIQoMZcXX2XWkUBTtA1w5g/aZZsBExb3YGhQjsQP+jb6MtDnvOEo9
=Z2E+
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Harden hv_sock driver (Andrea Parri)
- Harden Hyper-V PCI driver (Andrea Parri)
- Fix multi-MSI for Hyper-V PCI driver (Jeffrey Hugo)
- Fix Hyper-V PCI to reduce boot time (Dexuan Cui)
- Remove code for long EOL'ed Hyper-V versions (Michael Kelley, Saurabh
Sengar)
- Fix balloon driver error handling (Shradha Gupta)
- Fix a typo in vmbus driver (Julia Lawall)
- Ignore vmbus IMC device (Michael Kelley)
- Add a new error message to Hyper-V DRM driver (Saurabh Sengar)
* tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (28 commits)
hv_balloon: Fix balloon_probe() and balloon_remove() error handling
scsi: storvsc: Removing Pre Win8 related logic
Drivers: hv: vmbus: fix typo in comment
PCI: hv: Fix synchronization between channel callback and hv_pci_bus_exit()
PCI: hv: Add validation for untrusted Hyper-V values
PCI: hv: Fix interrupt mapping for multi-MSI
PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
drm/hyperv: Remove support for Hyper-V 2008 and 2008R2/Win7
video: hyperv_fb: Remove support for Hyper-V 2008 and 2008R2/Win7
scsi: storvsc: Remove support for Hyper-V 2008 and 2008R2/Win7
Drivers: hv: vmbus: Remove support for Hyper-V 2008 and Hyper-V 2008R2/Win7
x86/hyperv: Disable hardlockup detector by default in Hyper-V guests
drm/hyperv: Add error message for fb size greater than allocated
PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time
PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
Drivers: hv: vmbus: Refactor the ring-buffer iterator functions
Drivers: hv: vmbus: Accept hv_sock offers in isolated guests
hv_sock: Add validation for untrusted Hyper-V values
hv_sock: Copy packets sent by Hyper-V out of the ring buffer
hv_sock: Check hv_pkt_iter_first_raw()'s return value
...
Use ida_alloc()/ida_free() instead of deprecated
ida_simple_get()/ida_simple_remove() .
Signed-off-by: keliu <liuke94@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"struct smc_cdc_tx_pend **" can not directly convert
to "struct smc_wr_tx_pend_priv *".
Fixes: 2bced6aefa ("net/smc: put slot when connection is killed")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reports:
BUG: KASAN: slab-out-of-bounds in __list_del_entry_valid+0xcc/0xf0 lib/list_debug.c:42
[..]
list_del include/linux/list.h:148 [inline]
cttimeout_net_exit+0x211/0x540 net/netfilter/nfnetlink_cttimeout.c:617
No reproducer so far. Looking at recent changes in this area
its clear that the free_head must not be at the end of the
structure because nf_ct_timeout structure has variable size.
Reported-by: <syzbot+92968395eedbdbd3617d@syzkaller.appspotmail.com>
Fixes: 78222bacfc ("netfilter: cttimeout: decouple unlink and free on netns destruction")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
syzbot reports following warn:
WARNING: CPU: 0 PID: 3600 at net/netfilter/nfnetlink.c:703 nfnetlink_unbind+0x357/0x3b0 net/netfilter/nfnetlink.c:694
The syzbot generated program does this:
socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3
setsockopt(3, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, [1], 4) = 0
... which triggers 'WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0)' check.
Instead of counting, just enable reporting for every bind request
and check if we still have listeners on unbind.
While at it, also add the needed bounds check on nfnl_group2type[]
access.
Reported-by: <syzbot+4903218f7fba0a2d6226@syzkaller.appspotmail.com>
Reported-by: <syzbot+afd2d80e495f96049571@syzkaller.appspotmail.com>
Fixes: 2794cdb0b9 ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of a virtual kernel address use a pointer of the associated
struct page as second parameter of gnttab_end_foreign_access().
Most users have that pointer available already and are creating the
virtual address from it, risking problems in case the memory is
located in highmem.
gnttab_end_foreign_access() itself won't need to get the struct page
from the address again.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static tools
- Clean the some of the kernel caps, reducing the historical stealth uAPI
leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core to
free them
- Several rc quality bug fixes for hfi1
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYo+NxgAKCRCFwuHvBreF
YbqSAQDJ+QolaATUvOQUPLbuLopUCJLe95VS15Kl3SNXiVUUFAEA8DLL1s6+WShd
AgypUxGHipx5BAytrn45/WiwuDeEbQ8=
=jgTl
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static
tools
- Clean the some of the kernel caps, reducing the historical stealth
uAPI leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs
layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core
to free them
- Several rc quality bug fixes for hfi1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits)
RDMA/rtrs-clt: Fix one kernel-doc comment
RDMA/hfi1: Remove all traces of diagpkt support
RDMA/hfi1: Consolidate software versions
RDMA/hfi1: Remove pointless driver version
RDMA/hfi1: Fix potential integer multiplication overflow errors
RDMA/hfi1: Prevent panic when SDMA is disabled
RDMA/hfi1: Prevent use of lock before it is initialized
RDMA/rxe: Fix an error handling path in rxe_get_mcg()
IB/core: Fix typo in comment
RDMA/core: Fix typo in comment
IB/hf1: Fix typo in comment
IB/qib: Fix typo in comment
IB/iser: Fix typo in comment
RDMA/mlx4: Avoid flush_scheduled_work() usage
IB/isert: Avoid flush_scheduled_work() usage
RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr()
RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq()
RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx()
RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx()
RDMA/irdma: Add SW mechanism to generate completions on error
...
We introduce "courteous server" in this release. Previously NFSD
would purge open and lock state for an unresponsive client after
one lease period (typically 90 seconds). Now, after one lease
period, another client can open and lock those files and the
unresponsive client's lease is purged; otherwise if the unrespon-
sive client's open and lock state is uncontended, the server retains
that open and lock state for up to 24 hours, allowing the client's
workload to resume after a lengthy network partition.
A longstanding issue with NFSv4 file creation is also addressed.
Previously a file creation can fail internally, returning an error
to the client, but leave the newly created file in place as an
artifact. The file creation code path has been reorganized so that
internal failures and race conditions are less likely to result in
an unwanted file creation.
A fault injector has been added to help exercise paths that are run
during kernel metadata cache invalidation. These caches contain
information maintained by user space about exported filesystems.
Many of our test workloads do not trigger cache invalidation.
There is one patch that is needed to support PREEMPT_RT and a fix
for an ancient "sleep while spin-locked" splat that seems to have
become easier to hit since v5.18-rc3.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKPliAACgkQM2qzM29m
f5dB3BAAorPa2L8xu5P1Ge1oTNogNSOVRkLPDzEkfEwK07ZM2qvz78eMZGkMziJ/
strorvBWl3SWBlVtTePgNpJUjgYQ75MRRwaX7Qh2WuHeRKm1JlZm0/NId3+zKgbh
N40QI20jdswWcNDuhidxVFFWurd09GlcM4z1cu8gZLbfthkiUOjZoPiLkXeNcvhk
7wC9GiueWxHefYQQDAKh1nQS/L0GG1EkzJdJo7WUVAldZ9qVY9LpmJVMRqrBBbta
XrFYfpeY1zFFDY4Qolyz5PUJSeQuDj9PctlhoZ6B1hp56PD/6yaqVhYXiPxtlALj
tITtktfiekULZkgfvfvyzssCv+wkbYiaEBZcSSCauR7dkGOmBmajO+cf7vpsERgE
fbCU8DWGk78SMeehdCrO+26cV37VP+8c2t2Txq/rG5Eq4ZoCi++Hj5poRboFLqb+
oom+0Ee0LfcAKXkxH5gWTPTblHo49GzGitPZtRzTgZ9uFnVwvEaJ4+t0ij0J8JpL
HuVtWrg5/REhqpEvOSwF0sRmkYWLTu7KdueGn/iZ8xUi7GHEue01NsVkClohKJcR
WOjWrbNCNF/LJaG88MX0z5u7IO7s9bOHphd7PJ92vR+4YsehW3uRhk+rNi2ZBqQz
hzULfu8BiaicV9fdB/hDcMmKQD6U6due2AVVPtxTf5XY+CHQNRY=
=phE1
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"We introduce 'courteous server' in this release. Previously NFSD would
purge open and lock state for an unresponsive client after one lease
period (typically 90 seconds). Now, after one lease period, another
client can open and lock those files and the unresponsive client's
lease is purged; otherwise if the unresponsive client's open and lock
state is uncontended, the server retains that open and lock state for
up to 24 hours, allowing the client's workload to resume after a
lengthy network partition.
A longstanding issue with NFSv4 file creation is also addressed.
Previously a file creation can fail internally, returning an error to
the client, but leave the newly created file in place as an artifact.
The file creation code path has been reorganized so that internal
failures and race conditions are less likely to result in an unwanted
file creation.
A fault injector has been added to help exercise paths that are run
during kernel metadata cache invalidation. These caches contain
information maintained by user space about exported filesystems. Many
of our test workloads do not trigger cache invalidation.
There is one patch that is needed to support PREEMPT_RT and a fix for
an ancient 'sleep while spin-locked' splat that seems to have become
easier to hit since v5.18-rc3"
* tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits)
NFSD: nfsd_file_put() can sleep
NFSD: Add documenting comment for nfsd4_release_lockowner()
NFSD: Modernize nfsd4_release_lockowner()
NFSD: Fix possible sleep during nfsd4_release_lockowner()
nfsd: destroy percpu stats counters after reply cache shutdown
nfsd: Fix null-ptr-deref in nfsd_fill_super()
nfsd: Unregister the cld notifier when laundry_wq create failed
SUNRPC: Use RMW bitops in single-threaded hot paths
NFSD: Clean up the show_nf_flags() macro
NFSD: Trace filecache opens
NFSD: Move documenting comment for nfsd4_process_open2()
NFSD: Fix whitespace
NFSD: Remove dprintk call sites from tail of nfsd4_open()
NFSD: Instantiate a struct file when creating a regular NFSv4 file
NFSD: Clean up nfsd_open_verified()
NFSD: Remove do_nfsd_create()
NFSD: Refactor NFSv4 OPEN(CREATE)
NFSD: Refactor NFSv3 CREATE
NFSD: Refactor nfsd_create_setattr()
NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create()
...
When cloning a packet-based limit expression, copy the cost value as
well. Otherwise the new limit is not functional anymore.
Fixes: 3b9e2ea6c1 ("netfilter: nft_limit: move stateful fields out of expression data")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since 3e135cd499 ("netfilter: nft_dynset: dynamic stateful expression
instantiation"), it is possible to attach stateful expressions to set
elements.
cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate
and destroy phase") introduces conditional destruction on the object to
accomodate transaction semantics.
nft_expr_init() calls expr->ops->init() first, then check for
NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful
lookup expressions which points to a set, which might lead to UAF since
the set is not properly detached from the set->binding for this case.
Anyway, this combination is non-sense from nf_tables perspective.
This patch fixes this problem by checking for NFT_STATEFUL_EXPR before
expr->ops->init() is called.
The reporter provides a KASAN splat and a poc reproducer (similar to
those autogenerated by syzbot to report use-after-free errors). It is
unknown to me if they are using syzbot or if they use similar automated
tool to locate the bug that they are reporting.
For the record, this is the KASAN splat.
[ 85.431824] ==================================================================
[ 85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20
[ 85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776
[ 85.434756]
[ 85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G W 5.18.0+ #2
[ 85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Fixes: 0b2d8a7b63 ("netfilter: nf_tables: add helper functions for expression handling")
Reported-and-tested-by: Aaron Adams <edg-e@nccgroup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Yuwei reported that plain reuse of DELAY_PROBE_TIME to rearm work queue
in neigh_managed_work is problematic if user explicitly configures the
DELAY_PROBE_TIME to 0 for a neighbor table. Such misconfig can then hog
CPU to 100% processing the system work queue. Instead, set lower interval
bound to HZ which is totally sufficient. Yuwei is additionally looking
into making the interval separately configurable from DELAY_PROBE_TIME.
Reported-by: Yuwei Wang <wangyuweihx@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/netdev/797c3c53-ce1b-9f60-e253-cda615788f4a@iogearbox.net
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/3b8c5aa906c52c3a8c995d1b2e8ccf650ea7c716.1653432794.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc and register rmb, but the latter may fail. In this case,
the pointer should be reset.
Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220525085408.812273-1-liuyacan@corp.netease.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding
to very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space
to remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections
that have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting.
This makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core
from Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKNMPQACgkQMUZtbf5S
IrsRARAAuDyYs6jFYB3p+xazZdOnbF4iAgVv71+DQGvmsCl6CB9OrsNZMlvE85OL
Q3gjcRbgjrkN4lhgI8DmiGYbsUJnAvVjFdNjccz1Z/vTLYvuIM0ol54MUp5S+9WY
StncOJkOGJxxR/Gi5gzVmejPDsysU3Jik+hm/fpIcz8pybXxAsFKU5waY5qfl+/T
TZepfV0VCfqRDjqcF1qA5+jJZNU8pdodQlZ1+mh8bwu6Jk1ZkWkj6Ov8MWdwQldr
LnPeK/9hIGzkdJYHZfajxA3t8D0K5CHzSuih2bJ9ry8ZXgVBkXEThew778/R5izW
uB0YZs9COFlrIP7XHjtRTy/2xHOdYIPlj2nWhVdfuQDX8Crvt4VRN6EZ1rjko1ZJ
WanfG6WHF8NH5pXBRQbh3kIMKBnYn6OIzuCfCQSqd+niHcxFIM4vRiggeXI5C5TW
vJgEWfK6X+NfDiFVa3xyCrEmp5ieA/pNecpwd8rVkql+MtFAAw4vfsotLKOJEAru
J/XL6UE+YuLqIJV9ACZ9x1AFXXAo661jOxBunOo4VXhXVzWS9lYYz5r5ryIkgT/8
/Fr0zjANJWgfIuNdIBtYfQ4qG+LozGq038VA06RhFUAZ5tF9DzhqJs2Q2AFuWWBC
ewCePJVqo1j2Ceq2mGonXRt47OEnlePoOxTk9W+cKZb7ZWE+zEo=
=Wjii
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding to
very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space to
remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections that
have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting. This
makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core from
Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection"
* tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits)
ptp: ocp: Add firmware header checks
ptp: ocp: fix PPS source selector debugfs reporting
ptp: ocp: add .init function for sma_op vector
ptp: ocp: vectorize the sma accessor functions
ptp: ocp: constify selectors
ptp: ocp: parameterize input/output sma selectors
ptp: ocp: revise firmware display
ptp: ocp: add Celestica timecard PCI ids
ptp: ocp: Remove #ifdefs around PCI IDs
ptp: ocp: 32-bit fixups for pci start address
Revert "net/smc: fix listen processing for SMC-Rv2"
ath6kl: Use cc-disable-warning to disable -Wdangling-pointer
selftests/bpf: Dynptr tests
bpf: Add dynptr data slices
bpf: Add bpf_dynptr_read and bpf_dynptr_write
bpf: Dynptr support for ring buffers
bpf: Add bpf_dynptr_from_mem for local dynptrs
bpf: Add verifier support for dynptrs
bpf: Suppress 'passing zero to PTR_ERR' warning
bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
...
Fix the following coccicheck warning:
net/ceph/crush/mapper.c:1077:8-9: WARNING opportunity for swap()
by using swap() for the swapping of variable values and drop
the tmp variable that is not needed any more.
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This KUnit update for Linux 5.19-rc1 consists of several fixes, cleanups,
and enhancements to tests and framework:
- introduces _NULL and _NOT_NULL macros to pointer error checks
- reworks kunit_resource allocation policy to fix memory leaks when
caller doesn't specify free() function to be used when allocating
memory using kunit_add_resource() and kunit_alloc_resource() funcs.
- adds ability to specify suite-level init and exit functions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmKLw4QACgkQCwJExA0N
Qxz9wRAA3PonJESDAFF2sXTDzQurEXdWoJHqNvO0JCObku8SDODEI7nozXOD0MBC
ASAXiX3HuNI0yESF27xECqu3xbe8KsYOtCN8vco/sYUroVGmzgAt/atsvrSUv2Oh
sEQbjrTMwkMUjL5ECvjR2dArd6bQew7PPBkl3HqOpyysL3b/EAMEAY0DmDXrrrwB
+oNvXGVAR1Tczg4ahcSSwDdZl1C41kREj5f8S/4+kohMdIjCUPWOAYnaWHpVdAOJ
C+LWkPSJ5IpgjU2urDX2kNfg32UxIJpFI009ovytBmwCbd+GEs24u7gtgtksPM2s
YypoPEqC40gxkbY99omojtADiDdZlKqlIipCTWYe/CpzgBD+WQ4PVqMGM4ZprP9w
Hrc6ulVmd8hZ4F9QQ3oN6W9L6pBCgdXtPPCsQtGoUTbw7r79BP67PjJ6Ko+usn3s
Jy0FR5LvzYBjykoJzKSIaJ8ONaX34DB6w5rB+q5mBGwPKPHWo3eAZVZDPEMVo3Z7
D9TW5UliGBt2y5YJZbPbSnhdJPMPHSK5ef9hIy0wYjVJFafirdgrQhgbWbVxalRT
eZz1edcs1sdU7GAzfMA/v+NqAAA3bFIUVr2b+GTc+4zzWhq+cwI2SNikgyhETv/f
xKq8Xek8EkOIdaa2lu9chTPT4sG7A6991EkRqfc7rL1IptkPiS8=
=DzVQ
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
"Several fixes, cleanups, and enhancements to tests and framework:
- introduce _NULL and _NOT_NULL macros to pointer error checks
- rework kunit_resource allocation policy to fix memory leaks when
caller doesn't specify free() function to be used when allocating
memory using kunit_add_resource() and kunit_alloc_resource() funcs.
- add ability to specify suite-level init and exit functions"
* tag 'linux-kselftest-kunit-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (41 commits)
kunit: tool: Use qemu-system-i386 for i386 runs
kunit: fix executor OOM error handling logic on non-UML
kunit: tool: update riscv QEMU config with new serial dependency
kcsan: test: use new suite_{init,exit} support
kunit: tool: Add list of all valid test configs on UML
kunit: take `kunit_assert` as `const`
kunit: tool: misc cleanups
kunit: tool: minor cosmetic cleanups in kunit_parser.py
kunit: tool: make parser stop overwriting status of suites w/ no_tests
kunit: tool: remove dead parse_crash_in_log() logic
kunit: tool: print clearer error message when there's no TAP output
kunit: tool: stop using a shell to run kernel under QEMU
kunit: tool: update test counts summary line format
kunit: bail out of test filtering logic quicker if OOM
lib/Kconfig.debug: change KUnit tests to default to KUNIT_ALL_TESTS
kunit: Rework kunit_resource allocation policy
kunit: fix debugfs code to use enum kunit_status, not bool
kfence: test: use new suite_{init/exit} support, add .kunitconfig
kunit: add ability to specify suite-level init and exit functions
kunit: rename print_subtest_{start,end} for clarity (s/subtest/suite)
...
One may want to have DF set on large packets to support discovering
path mtu and limiting the size of generated packets (hence not
setting the XFRM_STATE_NOPMTUDISC tunnel flag), while still
supporting networks that are incapable of carrying even minimal
sized IPv6 frames (post encapsulation).
Having IPv4 Don't Frag bit set on encapsulated IPv6 frames that
are not larger than the minimum IPv6 mtu of 1280 isn't useful,
because the resulting ICMP Fragmentation Required error isn't
actionable (even assuming you receive it) because IPv6 will not
drop it's path mtu below 1280 anyway. While the IPv4 stack
could prefrag the packets post encap, this requires the ICMP
error to be successfully delivered and causes a loss of the
original IPv6 frame (thus requiring a retransmit and latency
hit). Luckily with IPv4 if we simply don't set the DF flag,
we'll just make further fragmenting the packets some other
router's problems.
We'll still learn the correct IPv4 path mtu through encapsulation
of larger IPv6 frames.
I'm still not convinced this patch is entirely sufficient to make
everything happy... but I don't see how it could possibly
make things worse.
See also recent:
4ff2980b6b 'xfrm: fix tunnel model fragmentation behavior'
and friends
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Lina Wang <lina.wang@mediatek.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Maciej Zenczykowski <maze@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This reverts commit 4dc2a5a8f6.
A non-zero return value from pfkey_broadcast() does not necessarily mean
an error occurred as this function returns -ESRCH when no registered
listener received the message. In particular, a call with
BROADCAST_PROMISC_ONLY flag and null one_sk argument can never return
zero so that this commit in fact prevents processing any PF_KEY message.
One visible effect is that racoon daemon fails to find encryption
algorithms like aes and refuses to start.
Excluding -ESRCH return value would fix this but it's not obvious that
we really want to bail out here and most other callers of
pfkey_broadcast() also ignore the return value. Also, as pointed out by
Steffen Klassert, PF_KEY is kind of deprecated and newer userspace code
should use netlink instead so that we should only disturb the code for
really important fixes.
v2: add a comment explaining why is the return value ignored
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKKpM8ACgkQSfxwEqXe
A6726w/+OJimGd4arvpSmdn+vxepSyDLgKfwM0x5zprRVd16xg8CjJr4eMonTesq
YvtJRqpetb53MB+sMhutlvQqQzrjtf2MBkgPwF4I2gUrk7vLD45Q+AGdGhi/rUwz
wHGA7xg1FHLHia2M/9idSqi8QlZmUP4u4l5ZnMyTUHiwvRD6XOrWKfqvUSawNzyh
hCWlTUxDrjizsW5YpsJX/MkRadSC8loJEk5ByZebow6nRPfurJvqfrcOMgHyNrbY
pOZ/CGPxcetMqotL2TuuJt5wKmenqYhIWGAp3YM2SWWgU2ueBZekW8AYeMfgUcvh
LWV93RpSuAnE5wsdjIULvjFnEDJBf8ihfMnMrd9G5QjQu44tuKWfY2MghLSpYzaR
V6UFbRmhrqhqiStHQXOvk1oqxtpbHlc9zzJLmvPmDJcbvzXQ9Opk5GVXAmdtnHnj
M/ty3wGWxucY6mHqT8MkCShSSslbgEtc1pEIWHdrUgnaiSVoCVBEO+9LqLbjvOTm
XA/6YtoiCE5FasK51pir1zVb2GORQn0v8HnuAOsusD/iPAlRQ/G5jZkaXbwRQI6j
atYL1svqvSKn5POnzqAlMUXfMUr19K5xqJdp7i6qmlO1Vq6Z+tWbCQgD1JV+Wjkb
CMyvXomFCFu4aYKGRE2SBRnWLRghG3kYHqEQ15yTPMQerxbUDNg=
=SUr3
-----END PGP SIGNATURE-----
Merge tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"These updates continue to refine the work began in 5.17 and 5.18 of
modernizing the RNG's crypto and streamlining and documenting its
code.
New for 5.19, the updates aim to improve entropy collection methods
and make some initial decisions regarding the "premature next" problem
and our threat model. The cloc utility now reports that random.c is
931 lines of code and 466 lines of comments, not that basic metrics
like that mean all that much, but at the very least it tells you that
this is very much a manageable driver now.
Here's a summary of the various updates:
- The random_get_entropy() function now always returns something at
least minimally useful. This is the primary entropy source in most
collectors, which in the best case expands to something like RDTSC,
but prior to this change, in the worst case it would just return 0,
contributing nothing. For 5.19, additional architectures are wired
up, and architectures that are entirely missing a cycle counter now
have a generic fallback path, which uses the highest resolution
clock available from the timekeeping subsystem.
Some of those clocks can actually be quite good, despite the CPU
not having a cycle counter of its own, and going off-core for a
stamp is generally thought to increase jitter, something positive
from the perspective of entropy gathering. Done very early on in
the development cycle, this has been sitting in next getting some
testing for a while now and has relevant acks from the archs, so it
should be pretty well tested and fine, but is nonetheless the thing
I'll be keeping my eye on most closely.
- Of particular note with the random_get_entropy() improvements is
MIPS, which, on CPUs that lack the c0 count register, will now
combine the high-speed but short-cycle c0 random register with the
lower-speed but long-cycle generic fallback path.
- With random_get_entropy() now always returning something useful,
the interrupt handler now collects entropy in a consistent
construction.
- Rather than comparing two samples of random_get_entropy() for the
jitter dance, the algorithm now tests many samples, and uses the
amount of differing ones to determine whether or not jitter entropy
is usable and how laborious it must be. The problem with comparing
only two samples was that if the cycle counter was extremely slow,
but just so happened to be on the cusp of a change, the slowness
wouldn't be detected. Taking many samples fixes that to some
degree.
This, combined with the other improvements to random_get_entropy(),
should make future unification of /dev/random and /dev/urandom
maybe more possible. At the very least, were we to attempt it again
today (we're not), it wouldn't break any of Guenter's test rigs
that broke when we tried it with 5.18. So, not today, but perhaps
down the road, that's something we can revisit.
- We attempt to reseed the RNG immediately upon waking up from system
suspend or hibernation, making use of the various timestamps about
suspend time and such available, as well as the usual inputs such
as RDRAND when available.
- Batched randomness now falls back to ordinary randomness before the
RNG is initialized. This provides more consistent guarantees to the
types of random numbers being returned by the various accessors.
- The "pre-init injection" code is now gone for good. I suspect you
in particular will be happy to read that, as I recall you
expressing your distaste for it a few months ago. Instead, to avoid
a "premature first" issue, while still allowing for maximal amount
of entropy availability during system boot, the first 128 bits of
estimated entropy are used immediately as it arrives, with the next
128 bits being buffered. And, as before, after the RNG has been
fully initialized, it winds up reseeding anyway a few seconds later
in most cases. This resulted in a pretty big simplification of the
initialization code and let us remove various ad-hoc mechanisms
like the ugly crng_pre_init_inject().
- The RNG no longer pretends to handle the "premature next" security
model, something that various academics and other RNG designs have
tried to care about in the past. After an interesting mailing list
thread, these issues are thought to be a) mainly academic and not
practical at all, and b) actively harming the real security of the
RNG by delaying new entropy additions after a potential compromise,
making a potentially bad situation even worse. As well, in the
first place, our RNG never even properly handled the premature next
issue, so removing an incomplete solution to a fake problem was
particularly nice.
This allowed for numerous other simplifications in the code, which
is a lot cleaner as a consequence. If you didn't see it before,
https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a
thread worth skimming through.
- While the interrupt handler received a separate code path years ago
that avoids locks by using per-cpu data structures and a faster
mixing algorithm, in order to reduce interrupt latency, input and
disk events that are triggered in hardirq handlers were still
hitting locks and more expensive algorithms. Those are now
redirected to use the faster per-cpu data structures.
- Rather than having the fake-crypto almost-siphash-based random32
implementation be used right and left, and in many places where
cryptographically secure randomness is desirable, the batched
entropy code is now fast enough to replace that.
- As usual, numerous code quality and documentation cleanups. For
example, the initialization state machine now uses enum symbolic
constants instead of just hard coding numbers everywhere.
- Since the RNG initializes once, and then is always initialized
thereafter, a pretty heavy amount of code used during that
initialization is never used again. It is now completely cordoned
off using static branches and it winds up in the .text.unlikely
section so that it doesn't reduce cache compactness after the RNG
is ready.
- A variety of functions meant for waiting on the RNG to be
initialized were only used by vsprintf, and in not a particularly
optimal way. Replacing that usage with a more ordinary setup made
it possible to remove those functions.
- A cleanup of how we warn userspace about the use of uninitialized
/dev/urandom and uninitialized get_random_bytes() usage.
Interestingly, with the change you merged for 5.18 that attempts to
use jitter (but does not block if it can't), the majority of users
should never see those warnings for /dev/urandom at all now, and
the one for in-kernel usage is mainly a debug thing.
- The file_operations struct for /dev/[u]random now implements
.read_iter and .write_iter instead of .read and .write, allowing it
to also implement .splice_read and .splice_write, which makes
splice(2) work again after it was broken here (and in many other
places in the tree) during the set_fs() removal. This was a bit of
a last minute arrival from Jens that hasn't had as much time to
bake, so I'll be keeping my eye on this as well, but it seems
fairly ordinary. Unfortunately, read_iter() is around 3% slower
than read() in my tests, which I'm not thrilled about. But Jens and
Al, spurred by this observation, seem to be making progress in
removing the bottlenecks on the iter paths in the VFS layer in
general, which should remove the performance gap for all drivers.
- Assorted other bug fixes, cleanups, and optimizations.
- A small SipHash cleanup"
* tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits)
random: check for signals after page of pool writes
random: wire up fops->splice_{read,write}_iter()
random: convert to using fops->write_iter()
random: convert to using fops->read_iter()
random: unify batched entropy implementations
random: move randomize_page() into mm where it belongs
random: remove mostly unused async readiness notifier
random: remove get_random_bytes_arch() and add rng_has_arch_random()
random: move initialization functions out of hot pages
random: make consistent use of buf and len
random: use proper return types on get_random_{int,long}_wait()
random: remove extern from functions in header
random: use static branch for crng_ready()
random: credit architectural init the exact amount
random: handle latent entropy and command line from random_init()
random: use proper jiffies comparison macro
random: remove ratelimiting for in-kernel unseeded randomness
random: move initialization out of reseeding hot path
random: avoid initializing twice in credit race
random: use symbolic constants for crng_init states
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmKKlIAeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC3oH/iPm/fLG2sJut8My
sU0RC9K+6ESV5h2Qy6k00/lqKstlu4EvBjw4V8vYpx3Q2+hbSFMn2SeWqqqT3Lkk
Zb8KINCFuuyMtdCBb42PV0zhUf5pCQF7ocm/Ae4jllDHtPmqk3WJ6IGtZBK5JBlw
z6RR/wKt0y0MRj9eZyPyYjOee2L2vuVh4tgnexK/4L8g2ZtMMRThhvUzSMWG4zxR
STYYNp0uFcfT1Vt85+ODevFH4TvdECAj+SqAegN+seHLM17YY7M0/WiIYpxGRv8P
lIpDQl4PBU8EBkpI5hkpJ/3qPincbuVOMLsYfxFtpcjjG12vGjFp2krGpS3TedZQ
3mvaJ7c=
=vLke
-----END PGP SIGNATURE-----
Merge tag 'v5.18' into rdma.git for-next
Following patches have dependencies.
Resolve the merge conflict in
drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names
for the fs functions following linux-next:
https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-05-23
We've added 113 non-merge commits during the last 26 day(s) which contain
a total of 121 files changed, 7425 insertions(+), 1586 deletions(-).
The main changes are:
1) Speed up symbol resolution for kprobes multi-link attachments, from Jiri Olsa.
2) Add BPF dynamic pointer infrastructure e.g. to allow for dynamically sized ringbuf
reservations without extra memory copies, from Joanne Koong.
3) Big batch of libbpf improvements towards libbpf 1.0 release, from Andrii Nakryiko.
4) Add BPF link iterator to traverse links via seq_file ops, from Dmitrii Dolgov.
5) Add source IP address to BPF tunnel key infrastructure, from Kaixi Fan.
6) Refine unprivileged BPF to disable only object-creating commands, from Alan Maguire.
7) Fix JIT blinding of ld_imm64 when they point to subprogs, from Alexei Starovoitov.
8) Add BPF access to mptcp_sock structures and their meta data, from Geliang Tang.
9) Add new BPF helper for access to remote CPU's BPF map elements, from Feng Zhou.
10) Allow attaching 64-bit cookie to BPF link of fentry/fexit/fmod_ret, from Kui-Feng Lee.
11) Follow-ups to typed pointer support in BPF maps, from Kumar Kartikeya Dwivedi.
12) Add busy-poll test cases to the XSK selftest suite, from Magnus Karlsson.
13) Improvements in BPF selftest test_progs subtest output, from Mykola Lysenko.
14) Fill bpf_prog_pack allocator areas with illegal instructions, from Song Liu.
15) Add generic batch operations for BPF map-in-map cases, from Takshak Chahande.
16) Make bpf_jit_enable more user friendly when permanently on 1, from Tiezhu Yang.
17) Fix an array overflow in bpf_trampoline_get_progs(), from Yuntao Wang.
====================
Link: https://lore.kernel.org/r/20220523223805.27931-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add support for Realtek 8761BUV
- Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk
- Add support for RTL8852C
- Add a new PID/VID 0489/e0c8 for MT7921
- Add support for Qualcomm WCN785x
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmKL8N4ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKfy6EACPSdWECxGv+20lpBmgRbfF
1Ahu2rFfEtK5iDaTiFLqBOczGkaYnhjq9/Er4VsYHMH1DVedOKoTskzfa2V68j0L
xyCU44E0x314YI/B2sWoMAYejkPX265r0292KzZUpXM2s+S0Doh3Z52glMrjh13C
GbaTfukocBZ6HNIiXp8xQLIaD6MbKGhs2VSWuoXLQqASujrixm1ucglsHNbjfL6e
huSL/SoxKukX8qhlO19pJniROLKq0aC5tQSk66ihvhHWlNBTNJYkg6HoJ8vrB81d
lBjVjPjElbTljQIC+PeOSCSeOn+ELizsUdGgWxiZ2fgzgQvko0fDSnGSnpRSCnNH
zxSBs/pn3cgVS6rSIIzFXsnhdBgAoiwqqhwy+bKsi3nnIXGriZMn82Dpr7z/+QWB
di2wKlZSXw8LEhDaBxi8espe1NkdhgSYWI8c2hh40PLkydBpGXwag9FBkMM4cgsU
4p/riOd776EEVtwXI+lDcAdMFcbMLTf094qZWi1of93VUgDPoy8TSy3ZdshupZah
8UQyeKKYb3lxXIXd/i6C/jhYgPvo/8ZWv+OwysPpAzvvyhAagEKS1czSGJ01yjsy
1k+ebPY4CIHBBT23UFYKxD6Tgn9SAiqKtf0XtbYzCV2UD+QtUadc8MeGFtYb+0j1
ybtkiTT++lU3M9KfAMtH8A==
=XFUz
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for Realtek 8761BUV
- Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk
- Add support for RTL8852C
- Add a new PID/VID 0489/e0c8 for MT7921
- Add support for Qualcomm WCN785x
* tag 'for-net-next-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (26 commits)
Bluetooth: hci_sync: use hci_skb_event() helper
Bluetooth: eir: Add helpers for managing service data
Bluetooth: hci_sync: Fix attempting to suspend with unfiltered passive scan
Bluetooth: MGMT: Add conditions for setting HCI_CONN_FLAG_REMOTE_WAKEUP
Bluetooth: btmtksdio: fix the reset takes too long
Bluetooth: btmtksdio: fix possible FW initialization failure
Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event
Bluetooth: btbcm: Add entry for BCM4373A0 UART Bluetooth
Bluetooth: btusb: Add a new PID/VID 0489/e0c8 for MT7921
Bluetooth: btusb: Add 0x0bda:0x8771 Realtek 8761BUV devices
Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ERR_DATA_REPORTING for QCA
Bluetooth: core: Fix missing power_on work cancel on HCI close
Bluetooth: btusb: add support for Qualcomm WCN785x
Bluetooth: protect le accept and resolv lists with hdev->lock
Bluetooth: use hdev lock for accept_list and reject_list in conn req
Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring
Bluetooth: btrtl: Add support for RTL8852C
Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN for QCA
Bluetooth: Print broken quirks
Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk
...
====================
Link: https://lore.kernel.org/r/20220523204151.3327345-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The handling of connection failures shall be handled by the request
completion callback as already done by hci_cs_le_create_conn, also make
sure to use hci_conn_failed instead of hci_le_conn_failed as the later
don't actually call hci_conn_del to cleanup.
Link: https://github.com/bluez/bluez/issues/340
Fixes: 8e8b92ee60 ("Bluetooth: hci_sync: Add hci_le_create_conn_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKotMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmVwEACo7qBTjrrneZEwlYUWrSr45QtDNsQHPWjv
aoK1dBLVH4ZjoZoOTI/aYcRgd5IJYo1P6I9tUrolM/+N3adM4UTEVC7i2PYDOaL3
WUm/YT2aSLiyHaHQON7SMyGSVU8kfM9YvJAGbj7ohalO9A2VVtHfUAmcAtBdgWqv
Dl/Uu6vbogOl19xztAwN4nvwqljA+GUMnbHJ/oeASzrMzYMOdQ0q3UsQbEt+pTXt
rBzv8fCsrKsT2uBc59Bi3eFKeBMM6ERzux/40TlqcOnXf3KUCK7nM4VaRgPbvXdt
GOOYfYs+j9L8SSEedvdKyYNq4vVwWgYfTRAKMNB0FPiOaTGZuUthqkgRZGYY8AA9
+lJWxa+mzPmWEOmL+E44kt0OwtKDHX72ccEJUD7PHhTp0g87yKZfS6mXRNYLSxm7
IYt7N1x3cOp0lrwUTvLDnSPOTuYOSEiB2JZtfkf+y3SuI5SWowIcudKOuO5p7G1r
IpAROsZrpHzMf/eniINoX3IrqBSqr254jzwq+9IgUaw/ky76oPYqM1dWP9BnVxCg
PXgvfT5zj6xrU43TxTeIPU92JoAqhMeXi6dcyoiAAf9+8Vih+sbmLzAdJbYb5F2v
G0ISy31+x/Goi43fQS59HzS/MNXJplcmy2mxKUYBT7/ZoJ2A26Q8SukTWD+U8sDn
XIrV4HEOUQ==
=PUw1
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block
Pull io_uring 'more data in socket' support from Jens Axboe:
"To be able to fully utilize the 'poll first' support in the core
io_uring branch, it's advantageous knowing if the socket was empty
after a receive. This adds support for that"
* tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block:
io_uring: return hint on whether more data is available after receive
tcp: pass back data left in socket after receive
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKorgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm0eEACdTzhm7h5cXn9KjIvWLkdocAb/NOL8GYPn
Q1mY1SqKQFZvs/fyKHkkZEiIBPxhvN6snVFXMpb4LDmPYeeH4GTUlNomrGTIjvf/
j6SnZN4lCs9A2NlE+iDVWnFQOPQFALza2Y9BhC5xzay326qnKlO+0fQv3C1vXXrc
/PNLqxQr7+GmO0a0PJnS6mGWGj6qF7nLqilB9apnKsTK6BKbJEec6ciKreqxU6ME
WHaux11uIAbcf8rc6C/2myEK0k6jCOAue3vZ0lizygf+8klUCl2vMqV5BLwCBlXG
/e7hBsUUrGr0CG0fryqhQQTUxsZLshioBbQH1vttSeZCli46mmWWAhPNy3/jb1ZU
72bazA84Fe9ney9uVZvZoMoBsG+6t6UOatqND13MeRFAXnkRr0jZRuau2iBxgqAr
OINJW+IVPU7IrCD+S4lV1/LCdhLhYcob8/zfKmIrdHMQnWG/gLonVpYJIBCyLDAv
2jvHFIPJuSMUSGVjRKCb16LLNV6u7YG6VOWbKuippxfJxDdwA3TOtOhvTJIpYq0u
TotPgpZ7bfcr4xDsGgD9mZS8E7jwsL/G0/MwsnixELykEXuhd++sgoTbr+RyUYdV
45Hm6DsxlytjzOb/5uQrqhwrso05eVt14K74XApPa3fWKL8aWCh1jGSdo3CSbIyW
iHwss919Ag==
=nb5i
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block
Pull io_uring socket() support from Jens Axboe:
"This adds support for socket(2) for io_uring. This is handy when using
direct / registered file descriptors with io_uring.
Outside of those two patches, a small series from Dylan on top that
improves the tracing by providing a text representation of the opcode
rather than needing to decode this by reading the header file every
time.
That sits in this branch as it was the last opcode added (until it
wasn't...)"
* tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block:
io_uring: use the text representation of ops in trace
io_uring: rename op -> opcode
io_uring: add io_uring_get_opcode
io_uring: add type to op enum
io_uring: add socket(2) support
net: add __sys_socket_file()
This instance is the only opencoded version of the macro, so have it
follow suit.
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
I noticed CPU pipeline stalls while using perf.
Once an svc thread is scheduled and executing an RPC, no other
processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are
not needed outside the svc thread scheduler.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If found, register the DSA internally allocated slave_mii_bus with an OF
"mdio" child object. It can save some drivers from creating their
custom internal MDIO bus.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc, but the latter may fail. Unfortunately, the caller
will only check the former. In this case, a NULL pointer reference
will occur in smc_clc_send_confirm_accept() when accessing
conn->rmb_desc.
This patch does two things:
1. Use the return code to determine whether V2 is available.
2. If the return code is NODEV, continue to check whether V1 is
available.
Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same trigger condition as commit 86434744. When setsockopt runs
in parallel to a connect(), and switch the socket into fallback
mode. Then the sk_refcnt is incremented in smc_connect(), but
its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the
corresponding sk_refcnt decrement in __smc_release() will not be
performed.
Fixes: 86434744fe ("net/smc: add fallback check to connect()")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most protocol-specific pointers in struct net_device are under
a respective ifdef. Wireless is the notable exception. Since
there's a sizable number of custom-built kernels for datacenter
workloads which don't build wireless it seems reasonable to
ifdefy those pointers as well.
While at it move IPv4 and IPv6 pointers up, those are special
for obvious reasons.
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the decision on when to generate an IDLE ACK by keeping a count of the
number of packets we've received, but not yet soft-ACK'd, and the number of
packets we've processed, but not yet hard-ACK'd, rather than trying to keep
track of which DATA sequence numbers correspond to those points.
We then generate an ACK when either counter exceeds 2. The counters are
both cleared when we transcribe the information into any sort of ACK packet
for transmission. IDLE and DELAY ACKs are skipped if both counters are 0
(ie. no change).
Fixes: 805b21b929 ("rxrpc: Send an ACK after every few DATA packets we receive")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The previousPacket field in the rx ACK packet should never go backwards -
it's now the highest DATA sequence number received, not the last on
received (it used to be used for out of sequence detection).
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix accidental overlapping of Rx-phase ACK accounting with Tx-phase ACK
accounting through variables shared between the two. call->acks_* members
refer to ACKs received in the Tx phase and call->ackr_* members to ACKs
sent/to be sent during the Rx phase.
Fixes: 1a2391c30c ("rxrpc: Fix detection of out of order acks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
rxrpc has a timer to trigger resending of unacked data packets in a call.
This is not cancelled when a client call switches to the receive phase on
the basis that most calls don't last long enough for it to ever expire.
However, if it *does* expire after we've started to receive the reply, we
shouldn't then go into trying to retransmit or pinging the server to find
out if an ack got lost.
Fix this by skipping the resend code if we're into receiving the reply to a
client call.
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump
up the sysctl), but whilst the preallocation circular buffers have 32 slots
in them, one of them has to be a dead slot because we're using CIRC_CNT().
This means that listen(rxrpc_sock, 32) will cause an oops when the socket
is closed because rxrpc_service_prealloc_one() allocated one too many calls
and rxrpc_discard_prealloc() won't then be able to get rid of them because
it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries
to abort them, but oopses because call->peer isn't yet set.
Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match
the ring capacity.
BUG: kernel NULL pointer dereference, address: 0000000000000086
...
RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc]
Call Trace:
<TASK>
? __wake_up_common_lock+0x7a/0x90
? rxrpc_notify_socket+0x8e/0x140 [rxrpc]
? rxrpc_abort_call+0x4c/0x60 [rxrpc]
rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc]
rxrpc_release+0xc9/0x1c0 [rxrpc]
__sock_release+0x37/0xa0
sock_close+0x11/0x20
__fput+0x89/0x240
task_work_run+0x59/0x90
do_exit+0x319/0xaa0
Fixes: 00e907127e ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html
Signed-off-by: David S. Miller <davem@davemloft.net>
If a client's address changes, say if it is NAT'd, this can disrupt an in
progress operation. For most operations, this is not much of a problem,
but StoreData can be different as some servers modify the target file as
the data comes in, so if a store request is disrupted, the file can get
corrupted on the server.
The problem is that the server doesn't recognise packets that come after
the change of address as belonging to the original client and will bounce
them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if
the packet number falls within the initial sequence number window of a call
or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting
it. In both cases, firstPacket will be 1 and previousPacket will be 0 in
the ACK information.
Fix this by the following means:
(1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1
and previousPacket as 0, assume this indicates that the server saw the
incoming packets from a different peer and thus as a different call.
Fail the call with error -ENETRESET.
(2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the
first packet has been hard-ACK'd. If it hasn't been hard-ACK'd, the
ACK packet will cause it to get retransmitted, so the call will just
be repeated.
(3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of
the operation.
(4) Prioritise the error code over things like -ECONNRESET as the server
did actually respond.
(5) Make writeback treat -ENETRESET as a retryable error and make it
redirty all the pages involved in a write so that the VM will retry.
Note that there is still a circumstance that I can't easily deal with: if
the operation is fully received and processed by the server, but the reply
is lost due to address change. There's no way to know if the op happened.
We can examine the server, but a conflicting change could have been made by
a third party - and we can't tell the difference. In such a case, a
message like:
kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a)
will be logged to dmesg on the next op to touch the file and the client
will reset the inode state, including invalidating clean parts of the
pagecache.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1
Signed-off-by: David S. Miller <davem@davemloft.net>
The RX_USER_ABORT code should really only be used to indicate that the user
of the rxrpc service (ie. userspace) implicitly caused a call to be aborted
- for instance if the AF_RXRPC socket is closed whilst the call was in
progress. (The user may also explicitly abort a call and specify the abort
code to use).
Change some of the points of generation to use other abort codes instead:
(1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see
ENOMEM and EFAULT during received data delivery and abort with
RX_CALL_DEAD in the default case.
(2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a
reply.
(3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had
heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't.
(4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not
completed successfully or been aborted.
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
If at the end of rxrpc sendmsg() or rxrpc_kernel_send_data() the call that
was being given data was aborted remotely or otherwise failed, return an
error rather than returning the amount of data buffered for transmission.
The call (presumably) did not complete, so there's not much point
continuing with it. AF_RXRPC considers it "complete" and so will be
unwilling to do anything else with it - and won't send a notification for
it, deeming the return from sendmsg sufficient.
Not returning an error causes afs to incorrectly handle a StoreData
operation that gets interrupted by a change of address due to NAT
reconfiguration.
This doesn't normally affect most operations since their request parameters
tend to fit into a single UDP packet and afs_make_call() returns before the
server responds; StoreData is different as it involves transmission of a
lot of data.
This can be triggered on a client by doing something like:
dd if=/dev/zero of=/afs/example.com/foo bs=1M count=512
at one prompt, and then changing the network address at another prompt,
e.g.:
ifconfig enp6s0 inet 192.168.6.2 && route add 192.168.6.1 dev enp6s0
Tracing packets on an Auristor fileserver looks something like:
192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001
192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538)
192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538)
192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001
<ARP exchange for 192.168.6.2>
192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0)
192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0)
192.168.6.1 -> 192.168.6.2 RX 107 ACK Exceeds Window Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001
192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001
192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 29321 Call: 4 Source Port: 7000 Destination Port: 7001
The Auristor fileserver logs code -453 (RXGEN_SS_UNMARSHAL), but the abort
code received by kafs is -5 (RX_PROTOCOL_ERROR) as the rx layer sees the
condition and generates an abort first and the unmarshal error is a
consequence of that at the application layer.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004810.html # v1
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a locking issue with the per-netns list of calls in rxrpc. The
pieces of code that add and remove a call from the list use write_lock()
and the calls procfile uses read_lock() to access it. However, the timer
callback function may trigger a removal by trying to queue a call for
processing and finding that it's already queued - at which point it has a
spare refcount that it has to do something with. Unfortunately, if it puts
the call and this reduces the refcount to 0, the call will be removed from
the list. Unfortunately, since the _bh variants of the locking functions
aren't used, this can deadlock.
================================
WARNING: inconsistent lock state
5.18.0-rc3-build4+ #10 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b
{SOFTIRQ-ON-W} state was registered at:
...
Possible unsafe locking scenario:
CPU0
----
lock(&rxnet->call_lock);
<Interrupt>
lock(&rxnet->call_lock);
*** DEADLOCK ***
1 lock held by ksoftirqd/2/25:
#0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d
Changes
=======
ver #2)
- Changed to using list_next_rcu() rather than rcu_dereference() directly.
Fixes: 17926a7932 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Move to using refcount_t rather than atomic_t for refcounts in rxrpc.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the list of in-use local UDP endpoints in the current network
namespace to be viewed in /proc.
To aid with this, the endpoint list is converted to an hlist and RCU-safe
manipulation is used so that the list can be read with only the RCU
read lock held.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have one tcp bind table (bhash) which hashes by port
number only. In the socket bind path, we check for bind conflicts by
traversing the specified port's inet_bind2_bucket while holding the
bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
In instances where there are tons of sockets hashed to the same port
at different addresses, checking for a bind conflict is time-intensive
and can cause softirq cpu lockups, as well as stops new tcp connections
since __inet_inherit_port() also contests for the spinlock.
This patch proposes adding a second bind table, bhash2, that hashes by
port and ip address. Searching the bhash2 table leads to significantly
faster conflict resolution and less time holding the spinlock.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When user sets skb_defer_max to 1 the kick threshold is 0
(half of 1). If we increment queue length before the check
the kick will never happen, and the skb may get stranded.
This is likely harmless but can be avoided by moving the
increment after the check. This way skb_defer_max == 1
will always kick. Still a silly config to have, but
somehow that feels more correct.
While at it drop a comment which seems to be outdated
or confusing, and wrap the defer_count write with
a WRITE_ONCE() since it's read on the fast path
that avoids taking the lock.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220518185522.2038683-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch implements a new struct bpf_func_proto, named
bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP,
and a new helper bpf_skc_to_mptcp_sock(), which invokes another new
helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct
mptcp_sock from a given subflow socket.
v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c,
remove build check for CONFIG_BPF_JIT
v5: Drop EXPORT_SYMBOL (Martin)
Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmKHsGETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6Q8B/97dkJamfa0rfcenW8qnb6Rx2DI6QmE
vEV2et8Qvrjxr9s10ylTaiH7veYG5Cgb986ufDN1Af52uDx1VdW7TOz4cD7Umx8G
QsjzviREL3VfN7Ag3WY0SsI5cjQ/iRJfjMJx/fB4G5bMkor1ouH32sQNtmcVLS6D
HHQZqVL7xP0ORV0lFvBns5EVUCsLHAKjoPGiLprmm7lwlhOo3e60WHBbBHTD9Isc
SrO8Gz5QiHYyVS6eksgYOZj0Tg5qLFKtKWXXxb1nyF8fLHcQU0S/zicf4AQKDj7i
5HOagl3S3Gmu+0g/wnWF9YyG3yoTVgfEfZ38XAh1rlwJJOkb1rbeFKAb
=tZDh
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a nasty use-after-free, marked for stable"
* tag 'ceph-for-5.18-rc8' of https://github.com/ceph/ceph-client:
libceph: fix misleading ceph_osdc_cancel_request() comment
libceph: fix potential use-after-free on linger ping and resends
Looks like the IPv6 version of the patch under Fixes was
a copy/paste of the IPv4 but hit the wrong spot.
It is tcp_v6_rcv() which uses drop_reason as a boolean, and
needs to be protected against reason == 0 before calling free.
tcp_v6_do_rcv() has a pretty straightforward flow.
The resulting warning looks like this:
WARNING: CPU: 1 PID: 0 at net/core/skbuff.c:775
Call Trace:
tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1767)
ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:438)
ip6_input_finish (include/linux/rcupdate.h:726)
ip6_input (include/linux/netfilter.h:307)
Fixes: f8319dfd1b ("net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()")
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20220520021347.2270207-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, misc
updates and fallout fixes from recent Florian's code rewritting (from
last pull request):
1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi.
2) Avoid unnecessary GC with a timestamp in conncount, from William Tu
and Yifeng Sun.
3) Remove TCP conntrack debugging, from Florian Westphal.
4) Fix compilation warning in ctnetlink, from Florian.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"
netfilter: conntrack: remove pr_debug callsites from tcp tracker
netfilter: nf_conncount: reduce unnecessary GC
netfilter: Use l3mdev flow key when re-routing mangled packets
====================
Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MPTCP socket's conn_list (list of subflows) requires the socket lock
to access. The MP_FAIL timeout code added such an access, where it would
check the list of subflows both in timer context and (later) in workqueue
context where the socket lock is held.
Rather than check the list twice, remove the check in the timeout
handler and only depend on the check in the workqueue. Also remove the
MPTCP_FAIL_NO_RESPONSE flag, since mptcp_mp_fail_no_response() has
insignificant overhead and can be checked on each worker run.
Fixes: 49fa1919d6 ("mptcp: reset subflow when MP_FAIL doesn't respond")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MP_FAIL timeout (waiting for a peer to respond to an MP_FAIL with
another MP_FAIL) is implemented using the MPTCP socket's sk_timer. That
timer is also used at MPTCP socket close, so it's important to not have
the two timer users interfere with each other.
At MPTCP socket close, all subflows are orphaned before sk_timer is
manipulated. By checking the SOCK_DEAD flag on the subflows, each
subflow can determine if the timer is safe to alter without acquiring
any MPTCP-level lock. This replaces code that was using the
mptcp_data_lock and MPTCP-level socket state checks that did not
correctly protect the timer.
Fixes: 49fa1919d6 ("mptcp: reset subflow when MP_FAIL doesn't respond")
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mentioned helper requires the msk socket lock, and the
current callers don't own it nor can't acquire it, so the
access is racy.
All the current callers are really checking for infinite mapping
fallback, and the latter condition is explicitly tracked by
the relevant msk variable: we can safely remove the caller usage
- and the caller itself.
The issue is present since MP_FAIL implementation, but the
fix only applies since the infinite fallback support, ence the
somewhat unexpected fixes tag.
Fixes: 0530020a7c ("mptcp: track and update contiguous data status")
Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch improves TCP PRR loss recovery behavior for a corner
case. Previously during PRR conservation-bound mode, it strictly
sends the amount equals to the amount newly acked or s/acked.
The patch changes s.t. PRR may send additional amount that was banked
previously (e.g. application-limited) in the conservation-bound
mode, similar to the slow-start mode. This unifies and simplifies the
algorithm further and may improve the recovery latency. This change
still follow the general packet conservation design principle and
always keep inflight/cwnd below the slow start threshold set
by the congestion control module.
PRR is described in RFC 6937. We'll include this change in the
latest revision rfc6937-bis as well.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220519003410.2531936-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Artem points out that skb may try to take over the skb and
queue it to its own list. Unlink the skb before calling out.
Fixes: b1a2c17863 ("tls: rx: clear ctx->recv_pkt earlier")
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmKGpZwTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCtfkuQ2KDTXR4SB/4nr48LlHedhSR57kOl1Zm7gKhu60nK
EoNPGLODMsaWv4QdYXzU6HP5B8gUUb36DUlKHv7pH642UonPgfXudxpRKGuam35d
zxkapZbcs/A5LfF7yrFZYdBm+wUiepGHDmauhy3BJ2cR/bxkw7fL+yea0HCLWnjQ
qdt5KXGhnnWzvojKJsX6a4/tZpClruRbv1czrGYf9DjqSC+2Zf7t1tLxO0wk/zHi
kRQR2qpXR6YiBqwb/f64qXM0DB3KEQk7CAN8Z3V+4RX5f112EGFjaHVGufpMGIOh
dS+4u5v/kOChnR/91RqqR5o0fuw8XSHyb2PAcI2eLs/2e5xMUoPhIjfQ
=69VA
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.19-20220519' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2022-05-19
Oliver Hartkopp contributes a patch for the ISO-TP CAN protocol to
update the validation of address information during bind.
The next patch is by Jakub Kicinski and converts the CAN network
drivers from netif_napi_add() to the netif_napi_add_weight() function.
Another patch by Oliver Hartkopp removes obsolete CAN specific LED
support.
Vincent Mailhol's patch for the mcp251xfd driver fixes a
-Wunaligned-access warning by clang-14.
* tag 'linux-can-next-for-5.19-20220519' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: mcp251xfd: silence clang's -Wunaligned-access warning
can: can-dev: remove obsolete CAN LED support
can: can-dev: move to netif_napi_add_weight()
can: isotp: isotp_bind(): do not validate unused address information
====================
Link: https://lore.kernel.org/r/20220519202308.1435903-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With commit 2aa39889c4 ("can: isotp: isotp_bind(): return -EINVAL on
incorrect CAN ID formatting") the bind() syscall returns -EINVAL when
the given CAN ID needed to be sanitized. But in the case of an unconfirmed
broadcast mode the rx CAN ID is not needed and may be uninitialized from
the caller - which is ok.
This patch makes sure the result of an inproper CAN ID format is only
provided when the address information is needed.
Link: https://lore.kernel.org/all/20220517145653.2556-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Second set of patches for v5.19 and most likely the last one. rtw89
got support for 8852ce devices and mt76 now supports Wireless Ethernet
Dispatch.
Major changes:
cfg80211/mac80211
* support disabling EHT mode
rtw89
* add support for Realtek 8852ce devices
mt76
* Wireless Ethernet Dispatch support for flow offload
* non-standard VHT MCS10-11 support
* mt7921 AP mode support
* mt7921 ipv6 NS offload support
ath11k
* enable keepalive during WoWLAN suspend
* implement remain-on-channel support
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmKGYt4RHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZuK2gf/ZswLtwE2CIwrEhz/Q0MDtxUvw8ulRhKl
d+1PC+bCd/VArMESjpu7le+WNAZ1OPBWdh1pgkDm8QpCQZYe7/hRED82DB/Jw3Cl
KmOx2nr6Xb4uEN+yjqZrSXzA+Hrysy24bCQRG2CJKjdToe/fwTuRiz8WIcPKtxio
b/d/Kz0LpSoHTlU1PzqIsXulN8QUKJA4kRw70rJHAlMJVYiTBuAD+AmXfbhHD8uX
t2CJDH2fykDd1CAWFQwcmI++2tS+xclYL81vDg3aEinQJ9aNcDz06qSE5qr2H+K5
lUYy42yc+ONkIIh8LlxrLgZie7oSmkrb7aA0Zc+F0SWp/B6ZO/k8FA==
=aILH
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v5.19
Second set of patches for v5.19 and most likely the last one. rtw89
got support for 8852ce devices and mt76 now supports Wireless Ethernet
Dispatch.
Major changes:
cfg80211/mac80211
- support disabling EHT mode
rtw89
- add support for Realtek 8852ce devices
mt76
- Wireless Ethernet Dispatch support for flow offload
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 ipv6 NS offload support
ath11k
- enable keepalive during WoWLAN suspend
- implement remain-on-channel support
* tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (135 commits)
iwlwifi: mei: fix potential NULL-ptr deref
iwlwifi: mei: clear the sap data header before sending
iwlwifi: mvm: remove vif_count
iwlwifi: mvm: always tell the firmware to accept MCAST frames in BSS
iwlwifi: mvm: add OTP info in case of init failure
iwlwifi: mvm: fix assert 1F04 upon reconfig
iwlwifi: fw: init SAR GEO table only if data is present
iwlwifi: mvm: clean up authorized condition
iwlwifi: mvm: use NULL instead of ERR_PTR when parsing wowlan status
iwlwifi: pcie: simplify MSI-X cause mapping
rtw89: pci: only mask out INT indicator register for disable interrupt v1
rtw89: convert rtw89_band to nl80211_band precisely
rtw89: 8852c: update txpwr tables to HALRF_027_00_052
rtw89: cfo: check mac_id to avoid out-of-bounds
rtw89: 8852c: set TX antenna path
rtw89: add ieee80211::sta_rc_update ops
wireless: Fix Makefile to be in alphabetical order
mac80211: refactor freeing the next_beacon
cfg80211: fix kernel-doc for cfg80211_beacon_data
mac80211: minstrel_ht: support ieee80211_rate_status
...
====================
Link: https://lore.kernel.org/r/20220519153334.8D051C385AA@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds helpers for accessing and appending service data (0x16) ad
type.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Clean up: There is one caller. The @cpu argument can be made
implicit now that a get_cpu/put_cpu pair is no longer needed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_xprt_enqueue() disables preemption via get_cpu() and then asks
for a pool of a specific CPU (current) via svc_pool_for_cpu().
While preemption is disabled, svc_xprt_enqueue() acquires
svc_pool::sp_lock with bottom-halfs disabled, which can sleep on
PREEMPT_RT.
Disabling preemption is not required here. The pool is protected with a
lock so the following list access is safe even cross-CPU. The following
iteration through svc_pool::sp_all_threads is under RCU-readlock and
remaining operations within the loop are atomic and do not rely on
disabled-preemption.
Use raw_smp_processor_id() as the argument for the requested CPU in
svc_pool_for_cpu().
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cache deferral injection stress-tests the cache deferral logic as
well as upper layer protocol deferred request handlers. This
facility is for developers and professional testers to ensure
coverage of the rqst deferral code paths. To date, we haven't
had an adequate way to ensure these code paths are covered
during testing, short of temporary code changes to force their
use.
A file called /sys/kernel/debug/fail_sunrpc/ignore-cache-wait
enables administrators to disable cache deferral injection while
allowing other types of sunrpc errors to be injected. The default
setting is that cache deferral injection is enabled (ignore=false).
To enable support for cache deferral injection,
CONFIG_FAULT_INJECTION, CONFIG_FAULT_INJECTION_DEBUG_FS, and
CONFIG_SUNRPC_DEBUG must all be set to "Y".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
netfilter subtrees.
Notably this reverts a recent TCP/DCCP netns-related change
to address a possible UaF.
Current release - regressions:
- tcp: revert "tcp/dccp: get rid of inet_twsk_purge()"
- xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown
Previous releases - regressions:
- netfilter: flowtable: fix TCP flow teardown
- can: revert "can: m_can: pci: use custom bit timings for Elkhart Lake"
- xfrm: check encryption module availability consistency
- eth: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf()
- eth: mlx5: initialize flow steering during driver probe
- eth: ice: fix crash when writing timestamp on RX rings
Previous releases - always broken:
- mptcp: fix checksum byte order
- eth: lan966x: fix assignment of the MAC address
- eth: mlx5: remove HW-GRO from reported features
- eth: ftgmac100: disable hardware checksum on AST2600
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmKGAYYSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkrt8P/2GyYNQT7q0h3Plsxc/m1tIUCPiERROE
zIU0R2QVc64xpkMISeVb3YYpa3eqhtQsNWgt7Xsr1NRXBmyx60dvGpS81w8Gnxuo
ruA7SxnH6OA0usviiYPmeGP9emvCEkO5YRW5kxl1Cpum19yNxjfZKJ6ARk0IDp/D
C1S91PYtF9s25Yytrlpv9lVVBvTHQxg2EQocZHxO+7/j2O8jJP/NAYltpVaRNC2W
gLcOWTAujrjAfpdsBhJsWXv4dTCQOAgnIXYP9P1JdFMAZtkXoYQUjaXP7dsaAXHw
iE9FBRkqDKVhj94CxR6VPOSo0kVvOuBfkc1eJeZ74lvahkHBq4EyiVCo6/JhNQTd
/bi/mTeUlI9yYyu/j9lMDy4CwOuiB69Dl4vNR/G5C1rF7l1vQkZr50pnD96MePwu
9fR5+ipZsDhj5c77OMiraqnnOyWXVtD2YCZCCw80a9/aWG4zxcIDtnNQIfqAACvx
0wNgG2bPSKRablytep1Qs84Vvupaa1cC2eTBbA+6LzQqk3CR9/YMUSD6MXitxQyD
RJYbm5QMqdW2QH8zE21E+8wzIPeN9m66lJFppuntuB+I/CHWAnP/CmdbWysR3FQ+
5ZisPh4PUqb1VIzGKUbym/D9FB20Vc8zq6oQa8LqiIOODUrxQMg3F2O43OWsYsn3
TDNCwo5BQ/Z8
=C848
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, xfrm and netfilter subtrees.
Notably this reverts a recent TCP/DCCP netns-related change to address
a possible UaF.
Current release - regressions:
- tcp: revert "tcp/dccp: get rid of inet_twsk_purge()"
- xfrm: set dst dev to blackhole_netdev instead of loopback_dev in
ifdown
Previous releases - regressions:
- netfilter: flowtable: fix TCP flow teardown
- can: revert "can: m_can: pci: use custom bit timings for Elkhart
Lake"
- xfrm: check encryption module availability consistency
- eth: vmxnet3: fix possible use-after-free bugs in
vmxnet3_rq_alloc_rx_buf()
- eth: mlx5: initialize flow steering during driver probe
- eth: ice: fix crash when writing timestamp on RX rings
Previous releases - always broken:
- mptcp: fix checksum byte order
- eth: lan966x: fix assignment of the MAC address
- eth: mlx5: remove HW-GRO from reported features
- eth: ftgmac100: disable hardware checksum on AST2600"
* tag 'net-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
ptp: ocp: change sysfs attr group handling
selftests: forwarding: fix missing backslash
netfilter: nf_tables: disable expression reduction infra
netfilter: flowtable: move dst_check to packet path
netfilter: flowtable: fix TCP flow teardown
net: ftgmac100: Disable hardware checksum on AST2600
igb: skip phy status check where unavailable
nfc: pn533: Fix buggy cleanup order
mptcp: Do TCP fallback on early DSS checksum failure
mptcp: fix checksum byte order
net: af_key: check encryption module availability consistency
net: af_key: add check for pfkey_broadcast in function pfkey_process
net/mlx5: Drain fw_reset when removing device
net/mlx5e: CT: Fix setting flow_source for smfs ct tuples
net/mlx5e: CT: Fix support for GRE tuples
net/mlx5e: Remove HW-GRO from reported features
net/mlx5e: Properly block HW GRO when XDP is enabled
net/mlx5e: Properly block LRO when XDP is enabled
net/mlx5e: Block rx-gro-hw feature in switchdev mode
...
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.
In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.
This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.
The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.
The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.
Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.
Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:
* non-zerocopy: 33.6 Gbit/s
* zerocopy: 79.92 Gbit/s
CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:
br1
/ \
/ \
/ \
br0.11 wlan0
|
br0
/ | \
p1 p2 p3
br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has a
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.
A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating that the switch has
dealt with forwarding the frame out ports p2 and p3 as needed. This
flag instructs the software bridge it does not need to pass the frame
back down again. However, the flag is not getting reset when the frame
is passed upwards. As a result br1 sees the flag, wrongly interprets
it, and fails to forward the frame to wlan0.
When passing a frame upwards, clear the flag. This is the Rx
equivalent of br_switchdev_frame_unmark() in br_dev_xmit().
Fixes: f1c2eddf4c ("bridge: switchdev: Use an helper to clear forward mark")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Reduce number of hardware offload retries from flowtable datapath
which might hog system with retries, from Felix Fietkau.
2) Skip neighbour lookup for PPPoE device, fill_forward_path() already
provides this and set on destination address from fill_forward_path for
PPPoE device, also from Felix.
4) When combining PPPoE on top of a VLAN device, set info->outdev to the
PPPoE device so software offload works, from Felix.
5) Fix TCP teardown flowtable state, races with conntrack gc might result
in resetting the state to ESTABLISHED and the time to one day. Joint
work with Oz Shlomo and Sven Auhagen.
6) Call dst_check() from flowtable datapath to check if dst is stale
instead of doing it from garbage collector path.
7) Disable register tracking infrastructure, either user-space or
kernel need to pre-fetch keys inconditionally, otherwise register
tracking assumes data is already available in register that might
not well be there, leading to incorrect reductions.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: disable expression reduction infra
netfilter: flowtable: move dst_check to packet path
netfilter: flowtable: fix TCP flow teardown
netfilter: nft_flow_offload: fix offload with pppoe + vlan
net: fix dev_fill_forward_path with pppoe + bridge
netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices
netfilter: flowtable: fix excessive hw offload attempts after failure
====================
Link: https://lore.kernel.org/r/20220518213841.359653-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cancel_request() never guaranteed that after its return the OSD
client would be completely done with the OSD request. The callback
(if specified) can still be invoked and a ref can still be held.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
request_reinit() is not only ugly as the comment rightfully suggests,
but also unsafe. Even though it is called with osdc->lock held for
write in all cases, resetting the OSD request refcount can still race
with handle_reply() and result in use-after-free. Taking linger ping
as an example:
handle_timeout thread handle_reply thread
down_read(&osdc->lock)
req = lookup_request(...)
...
finish_request(req) # unregisters
up_read(&osdc->lock)
__complete_request(req)
linger_ping_cb(req)
# req->r_kref == 2 because handle_reply still holds its ref
down_write(&osdc->lock)
send_linger_ping(lreq)
req = lreq->ping_req # same req
# cancel_linger_request is NOT
# called - handle_reply already
# unregistered
request_reinit(req)
WARN_ON(req->r_kref != 1) # fires
request_init(req)
kref_init(req->r_kref)
# req->r_kref == 1 after kref_init
ceph_osdc_put_request(req)
kref_put(req->r_kref)
# req->r_kref == 0 after kref_put, req is freed
<further req initialization/use> !!!
This happens because send_linger_ping() always (re)uses the same OSD
request for watch ping requests, relying on cancel_linger_request() to
unregister it from the OSD client and rip its messages out from the
messenger. send_linger() does the same for watch/notify registration
and watch reconnect requests. Unfortunately cancel_request() doesn't
guarantee that after it returns the OSD client would be completely done
with the OSD request -- a ref could still be held and the callback (if
specified) could still be invoked too.
The original motivation for request_reinit() was inability to deal with
allocation failures in send_linger() and send_linger_ping(). Switching
to using osdc->req_mempool (currently only used by CephFS) respects that
and allows us to get rid of request_reinit().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Either userspace or kernelspace need to pre-fetch keys inconditionally
before comparisons for this to work. Otherwise, register tracking data
is misleading and it might result in reducing expressions which are not
yet registers.
First expression is also guaranteed to be evaluated always, however,
certain expressions break before writing data to registers, before
comparing the data, leaving the register in undetermined state.
This patch disables this infrastructure by now.
Fixes: b2d306542f ("netfilter: nf_tables: do not reduce read-only expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes sporadic IPv6 packet loss when flow offloading is enabled.
IPv6 route GC and flowtable GC are not synchronized.
When dst_cache becomes stale and a packet passes through the flow before
the flowtable GC teardowns it, the packet can be dropped.
So, it is necessary to check dst every time in packet path.
Fixes: 227e1e4d0d ("netfilter: nf_flowtable: skip device lookup from interface index")
Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses three possible problems:
1. ct gc may race to undo the timeout adjustment of the packet path, leaving
the conntrack entry in place with the internal offload timeout (one day).
2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
timeout is reached before the flow offload del.
3. tcp ct is always set to ESTABLISHED with a very long timeout
in flow offload teardown/delete even though the state might be already
CLOSED. Also as a remark we cannot assume that the FIN or RST packet
is hitting flow table teardown as the packet might get bumped to the
slow path in nftables.
This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.
Moreover, teturn the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.
Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.
If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().
This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.
flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.
Joint work with Oz and Sven.
Fixes: 1e5b2471bc ("netfilter: nf_flow_table: teardown flow timeout race")
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.
The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.
Stop👏making👏bespoke👏random👏number👏generators👏.
Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.
The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.
There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.
If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).
This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.
Fixes: dd8bcd1768 ("mptcp: validate the data checksum")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.
As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.
Address the issue writing in the packet the unmodified __sum16 value.
MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275
Fixes: c5b39e26d0 ("mptcp: send out checksum for DSS")
Fixes: 390b95a5fb ("mptcp: receive checksum for DSS")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2022-05-18
1) Fix "disable_policy" flag use when arriving from different devices.
From Eyal Birger.
2) Fix error handling of pfkey_broadcast in function pfkey_process.
From Jiasheng Jiang.
3) Check the encryption module availability consistency in pfkey.
From Thomas Bartschies.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel
produces invalid pfkey acquire messages, when these encryption modules are disabled. This
happens because the availability of the algos wasn't checked in all necessary functions.
This patch adds these checks.
Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If skb_clone() returns null pointer, pfkey_broadcast() will
return error.
Therefore, it should be better to check the return value of
pfkey_broadcast() and return error if fails.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After merging the net-next tree, today's linux-next build (powerpc
ppc64_defconfig) produced this warning:
nf_conntrack_netlink.c:1717 warning: 'ctnetlink_dump_one_entry' defined but not used
Fixes: 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
There are sleep in atomic context bugs when the request to secure
element of st-nci is timeout. The root cause is that nci_skb_alloc
with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is
a timer handler. The call paths that could trigger bugs are shown below:
(interrupt context 1)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_skb_alloc(..., GFP_KERNEL) //may sleep
(interrupt context 2)
st_nci_se_wt_timeout
nci_hci_send_event
nci_hci_send_data
nci_send_data
nci_queue_tx_data_frags
nci_skb_alloc(..., GFP_KERNEL) //may sleep
This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to
GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: ed06aeefda ("nfc: st-nci: Rename st21nfcb to st-nci")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220517012530.75714-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.2 us 10.6 us (-0.6 us)
2B 11.2 us 10.7 us (-0.5 us)
4B 11.3 us 10.7 us (-0.6 us)
8B 11.2 us 10.6 us (-0.6 us)
16B 11.3 us 10.7 us (-0.6 us)
32B 11.3 us 10.6 us (-0.7 us)
64B 11.2 us 11.2 us (0 us)
128B 11.2 us 11.2 us (0 us)
256B 11.2 us 11.2 us (0 us)
512B 11.4 us 11.3 us (-0.1 us)
1KB 11.4 us 11.5 us (0.1 us)
2KB 11.5 us 11.5 us (0 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.9 us 11.2 us (-0.7 us)
2B 11.7 us 11.2 us (-0.5 us)
4B 11.7 us 11.3 us (-0.4 us)
8B 11.6 us 11.2 us (-0.4 us)
16B 11.7 us 11.3 us (-0.4 us)
32B 11.7 us 11.3 us (-0.4 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.4 us (-0.4 us)
1KB 11.9 us 11.4 us (-0.5 us)
2KB 12.1 us 11.5 us (-0.6 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The global blackhole_netdev has replaced pernet loopback_dev to become the
one given to the object that holds an netdev when ifdown in many places of
ipv4 and ipv6 since commit 8d7017fd62 ("blackhole_netdev: use
blackhole_netdev to invalidate dst entries").
Especially after commit faab39f63c ("net: allow out-of-order netdev
unregistration"), it's no longer safe to use loopback_dev that may be
freed before other netdev.
This patch is to set dst dev to blackhole_netdev instead of loopback_dev
in ifdown.
v1->v2:
- add Fixes tag as Eric suggested.
Fixes: faab39f63c ("net: allow out-of-order netdev unregistration")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/e8c87482998ca6fcdab214f5a9d582899ec0c648.1652665047.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmKCr/0THG1rbEBwZW5n
dXRyb25peC5kZQAKCRCtfkuQ2KDTXbVWCAClh9ywNEPiSxXsB2k9EdVoijz0hv6I
y6YLFPfHxc8w0ewOnBxJOBeLhVjIE2K3n81NaEudheBBImWV8CGlFPHu5sI4S01b
aFcAuLT6SOat52xzkh44WGGjW517+EZM1Of1xzPhKdnR0nLvva52SpJlYwgLrkCN
O6r30cDuU558H4g2BNxavbbNFPzL7HhkBpHmgWQg4q2p7LvLuTE6sN2CvNEgUWYM
nIyK6P/FIyy3l7M7E873WhIEGNGRnAXRBEPXivRmKTlw0wizcvPxlWKm4YxTA0WL
30fw58Hy4x6SZD+7+sP0QhdjiYOofgMtZFZn/LaJYE+rfaYLsFUH3A/2
=tUrE
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2022-05-16
the first 2 patches are by me and target the CAN raw protocol. The 1st
removes an unneeded assignment, the other one adds support for
SO_TXTIME/SCM_TXTIME.
Oliver Hartkopp contributes 2 patches for the ISOTP protocol. The 1st
adds support for transmission without flow control, the other let's
bind() return an error on incorrect CAN ID formatting.
Geert Uytterhoeven contributes a patch to clean up ctucanfd's Kconfig
file.
Vincent Mailhol's patch for the slcan driver uses the proper function
to check for invalid CAN frames in the xmit callback.
The next patch is by Geert Uytterhoeven and makes the interrupt-names
of the renesas,rcar-canfd dt bindings mandatory.
A patch by my update the ctucanfd dt bindings to include the common
CAN controller bindings.
The last patch is by Akira Yokosawa and fixes a breakage the
ctucanfd's documentation.
* tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure'
dt-bindings: can: ctucanfd: include common CAN controller bindings
dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required
can: slcan: slc_xmit(): use can_dropped_invalid_skb() instead of manual check
can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD
can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
can: isotp: add support for transmission without flow control
can: raw: add support for SO_TXTIME/SCM_TXTIME
can: raw: raw_sendmsg(): remove not needed setting of skb->sk
====================
Link: https://lore.kernel.org/r/20220516202625.1129281-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While preparing for Clang randstruct support (which duplicated many of
the warnings the randstruct GCC plugin warned about), one strange one
remained only for the randstruct GCC plugin. Eliminating this rids
the plugin of the last exception.
It seems the plugin is happy to dereference individual members of
a cross-struct cast, but it is upset about casting to a whole object
pointer. This only manifests in one place in the kernel, so just replace
the variable with individual member accesses. There is no change in
executable instruction output.
Drop the last exception from the randstruct GCC plugin.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: netdev@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/lkml/20220511022217.58586-1-kuniyu@amazon.co.jp
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/lkml/20220511151542.4cb3ff17@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Support this via passthrough to the underlying tcp listener socket.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/271
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>