Commit graph

1188059 commits

Author SHA1 Message Date
Ricardo Ribalda
20188baceb powerpc/purgatory: remove PGO flags
If profile-guided optimization is enabled, the purgatory ends up with
multiple .text sections.  This is not supported by kexec and crashes the
system.

Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-3-b05c520b7296@chromium.org
Fixes: 930457057a ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: <stable@vger.kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ricardo Ribalda
97b6b9cbba x86/purgatory: remove PGO flags
If profile-guided optimization is enabled, the purgatory ends up with
multiple .text sections.  This is not supported by kexec and crashes the
system.

Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-2-b05c520b7296@chromium.org
Fixes: 930457057a ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ricardo Ribalda
8652d44f46 kexec: support purgatories with .text.hot sections
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7.

When upreving llvm I realised that kexec stopped working on my test
platform.

The reason seems to be that due to PGO there are multiple .text sections
on the purgatory, and kexec does not supports that.


This patch (of 4):

Clang16 links the purgatory text in two sections when PGO is in use:

  [ 1] .text             PROGBITS         0000000000000000  00000040
       00000000000011a1  0000000000000000  AX       0     0     16
  [ 2] .rela.text        RELA             0000000000000000  00003498
       0000000000000648  0000000000000018   I      24     1     8
  ...
  [17] .text.hot.        PROGBITS         0000000000000000  00003220
       000000000000020b  0000000000000000  AX       0     0     1
  [18] .rela.text.hot.   RELA             0000000000000000  00004428
       0000000000000078  0000000000000018   I      24    17     8

And both of them have their range [sh_addr ... sh_addr+sh_size] on the
area pointed by `e_entry`.

This causes that image->start is calculated twice, once for .text and
another time for .text.hot. The second calculation leaves image->start
in a random location.

Because of this, the system crashes immediately after:

kexec_core: Starting new kernel

Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org
Fixes: 930457057a ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Philipp Rudo <prudo@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Peter Xu
5543d3c43c mm/uffd: allow vma to merge as much as possible
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.

For example, when we have:

  vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)

Then someone unregisters uffd on range (5-9), it should logically become:

  vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)

But with current code we'll have:

  vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)

This patch allows such merge to happen correctly before ioctl returns.

This behavior seems to have existed since the 1st day of uffd.  Since
pgoff for vma_merge() is only used to identify the possibility of vma
merging, meanwhile here what we did was always passing in a pgoff smaller
than what we should, so there should have no other side effect besides not
merging it.  Let's still tentatively copy stable for this, even though I
don't see anything will go wrong besides vma being split (which is mostly
not user visible).

Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Peter Xu
270aa01062 mm/uffd: fix vma operation where start addr cuts part of vma
Patch series "mm/uffd: Fix vma merge/split", v2.

This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues.

Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis.  The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.

Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.

Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].

[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e


This patch (of 2):

It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:

https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/

It's possible that "start" is contained within vma but not clamped to its
start.  We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to prev. 
As we loop, each subsequent VMA will be clamped to the start.

This patch will eliminate the report and make sure vma_merge() calls will
become legal again.

One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf1.  Commit 29417d292b helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.

Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Arnd Bergmann
bde1597d0f radix-tree: move declarations to header
The xarray.c file contains the only call to radix_tree_node_rcu_free(),
and it comes with its own extern declaration for it.  This means the
function definition causes a missing-prototype warning:

lib/radix-tree.c:288:6: error: no previous prototype for 'radix_tree_node_rcu_free' [-Werror=missing-prototypes]

Instead, move the declaration for this function to a new header that can
be included by both, and do the same for the radix_tree_node_cachep
variable that has the same underlying problem but does not cause a warning
with gcc.

[zhangpeng.00@bytedance.com: fix building radix tree test suite]
  Link: https://lkml.kernel.org/r/20230521095450.21332-1-zhangpeng.00@bytedance.com
Link: https://lkml.kernel.org/r/20230516194212.548910-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ryusuke Konishi
2f012f2bac nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.

This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.

Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().

Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:49 -07:00
Jens Axboe
fd37b88400 io_uring/io-wq: don't clear PF_IO_WORKER on exit
A recent commit gated the core dumping task exit logic on current->flags
remaining consistent in terms of PF_{IO,USER}_WORKER at task exit time.
This exposed a problem with the io-wq handling of that, which explicitly
clears PF_IO_WORKER before calling do_exit().

The reasons for this manual clear of PF_IO_WORKER is historical, where
io-wq used to potentially trigger a sleep on exit. As the io-wq thread
is exiting, it should not participate any further accounting. But these
days we don't need to rely on current->flags anymore, so we can safely
remove the PF_IO_WORKER clearing.

Reported-by: Zorro Lang <zlang@redhat.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/ZIZSPyzReZkGBEFy@dread.disaster.area/
Fixes: f9010dbdce ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-12 11:16:09 -07:00
Linus Torvalds
ace9e12da2 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSHScAACgkQxWXV+ddt
 WDvoLA/8CxGfC9i/zO2odxbV1id8JiubGyi2Q28ANE3ygwRBI2dh7u2TBTv9aKPF
 Bzm6VsafG2OwMuwu08jO3t98+QrxU9vb6YCzCPL4t+8IDLJhwpz6zdH/Lvl3RnyV
 nz+aKHi2vfTRKt1Cf4uB5dVzPM3QVHYi3vidt15Suf2nhKnXimu0FVGXabQfd44z
 cCE4ep8IkLshcrsEOwVQj44isRXztJza3D6P7zPfu0NB5Bue7VJNBI4JoGOAT8UQ
 8c+V1U6EbMARWcdbk4Vm34IoAAxcQW6MNnHG83+ie2OpuKJ9g7oNXMTPL73gntNr
 DtC38Vr8gbpXJFmqOCwD8+9f3jP2pX6LjJT0IR6eGJbCleWd6JPlvnfJ+QHdb/vE
 LblDjH84O0Js+0iPKOSKzglfrKZPYDEnIBUwbZQICj/8+aHPU1Y4eTRcv52bVnpa
 1umdz19Sjh0HjuX4k44E/fLgGnLw+ezxhe6WQ7RdDrnr4+9tXpz0z/ZsatIgl1Pc
 wfS5Y2XBIdzKBIF8FxAEL3xCXd6byOsMMhSRu6J7W8Tgw5dnvKiQLRCK+FIpBRru
 WZ7vrNKz67marmqcIp0Hpoipd5+ib6pAdZs69GAvk4bWvVoLZ0Vuyb3lQr5fg6Vm
 Xn1iwcYoWjlAYrpVW31dlaVCfoewm96qbzNa3XqA87I/6frGFcc=
 =ABpK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A  more fixes and regression fixes:

   - in subpage mode, fix crash when repairing metadata at the end of
     a stripe

   - properly enable async discard when remounting from read-only to
     read-write

   - scrub regression fixes:
      - respect read-only scrub when attempting to do a repair
      - fix reporting of found errors, the stats don't get properly
        accounted after a stripe repair"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: also report errors hit during the initial read
  btrfs: scrub: respect the read-only flag during repair
  btrfs: properly enable async discard when switching from RO->RW
  btrfs: subpage: fix a crash in metadata repair path
2023-06-12 10:53:35 -07:00
Uwe Kleine-König
3a2cb45ca0 net: mlxsw: i2c: Switch back to use struct i2c_driver's .probe()
After commit b8a1a4cd5a ("i2c: Provide a temporary .probe_new()
call-back type"), all drivers being converted to .probe_new() and then
commit 03c835f498 ("i2c: Switch .probe() to not take an id parameter")
convert back to (the new) .probe() to be able to eventually drop
.probe_new() from struct i2c_driver.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:56:41 +01:00
Daniel Golle
98c485eaf5 net: phy: add driver for MediaTek SoC built-in GE PHYs
Some of MediaTek's Filogic SoCs come with built-in gigabit Ethernet
PHYs which require calibration data from the SoC's efuse.
Despite the similar design the driver doesn't share any code with the
existing mediatek-ge.c.
Add support for such PHYs by introducing a new driver with basic
support for MediaTek SoCs MT7981 and MT7988 built-in 1GE PHYs.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:55:04 +01:00
David S. Miller
a89dc58703 mlx5-updates-2023-06-09
1) Embedded CPU Virtual Functions
 2) Lightweight local SFs
 
 Daniel Jurgens says:
 ====================
 Embedded CPU Virtual Functions
 
 This series enables the creation of virtual functions on Bluefield (the
 embedded CPU platform). Embedded CPU virtual functions (EC VFs). EC VF
 creation, deletion and management interfaces are the same as those for
 virtual functions in a server with a Connect-X NIC.
 
 When using EC VFs on the ARM the creation of virtual functions on the
 host system is still supported. Host VFs eswitch vports occupy a range
 of 1..max_vfs, the EC VF vport range is max_vfs+1..max_ec_vfs.
 
 Every function (PF, ECPF, VF, EC VF, and subfunction) has a function ID
 associated with it. Prior to this series the function ID and the eswitch
 vport were the same. That is no longer the case, the EC VF function ID
 range is 1..max_ec_vfs. When querying or setting the capabilities of an
 EC VF function an new bit must be set in the query/set HCA cap
 structure.
 
 This is a high level overview of the changes made:
 	- Allocate vports for EC VFs if they are enabled.
 	- Create representors and devlink ports for the EC VF vports.
 	- When querying/setting HCA caps by vport break the assumption
 	  that function ID is the same a vport number and adjust
 	  accordingly.
 	- Create a new type of page, so that when SRIOV on the ARM is
 	  disabled, but remains enabled on the host, the driver can
 	  wait for the correct pages.
 	- Update SRIOV code to support EC VF creation/deletion.
 
 ===================
 
 Lightweight local SFs:
 
 Last 3 patches form Shay Drory:
 
 SFs are heavy weight and by default they come with the full package of
 ConnectX features. Usually users want specialized SFs for one specific
 purpose and using devlink users will almost always override the set of
 advertises features of an SF and reload it.
 
 Shay Drory says:
 ================
 In order to avoid the wasted time and resources on the reload, local SFs
 will probe without any auxiliary sub-device, so that the SFs can be
 configured prior to its full probe.
 
 The defaults of the enable_* devlink params of these SFs are set to
 false.
 
 Usage example:
 Create SF:
 $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
 $ devlink port function set pci/0000:08:00.0/32768 \
                hw_addr 00:00:00:00:00:11 state active
 
 Enable ETH auxiliary device:
 $ devlink dev param set auxiliary/mlx5_core.sf.1 \
               name enable_eth value true cmode driverinit
 
 Now, in order to fully probe the SF, use devlink reload:
 $ devlink dev reload auxiliary/mlx5_core.sf.1
 
 At this point the user have SF devlink instance with auxiliary device
 for the Ethernet functionality only.
 
 ================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmSD1KUACgkQSD+KveBX
 +j5JMwf/SGo0DwXm0GTu+RlUmoUpCYG40QjJwNwJ9WUNIB/zMpnGrx59dtdhuVUb
 3gD8T9MVT08NXjQIzQReAZu9hm9MvM6jvTnUG++fFyjjlHuI/9hXWBgz1aSPcQOC
 ZR2ZHuFZsTTRWNe3JU5v9tuHOE0c6xZBAzPdDU3Lib0LvhIbZGK2eiSb11RIGU3V
 FTUskLUPMo5ObUNS55lgcTFquK90Qy41LUv0DX3MpbOY/YnDBiycikopwDoYtYQ6
 9/kOM+VOprW+BgalLyGHJ77BDg7O/k1ZxnL6J7mGn7tAhOkKX5+06xGfZnEifI2F
 nTuXYI5O8TDBdZYpmO+sBOh+SVyUrg==
 =H9kD
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

mlx5-updates-2023-06-09

1) Embedded CPU Virtual Functions
2) Lightweight local SFs

Daniel Jurgens says:
====================
Embedded CPU Virtual Functions

This series enables the creation of virtual functions on Bluefield (the
embedded CPU platform). Embedded CPU virtual functions (EC VFs). EC VF
creation, deletion and management interfaces are the same as those for
virtual functions in a server with a Connect-X NIC.

When using EC VFs on the ARM the creation of virtual functions on the
host system is still supported. Host VFs eswitch vports occupy a range
of 1..max_vfs, the EC VF vport range is max_vfs+1..max_ec_vfs.

Every function (PF, ECPF, VF, EC VF, and subfunction) has a function ID
associated with it. Prior to this series the function ID and the eswitch
vport were the same. That is no longer the case, the EC VF function ID
range is 1..max_ec_vfs. When querying or setting the capabilities of an
EC VF function an new bit must be set in the query/set HCA cap
structure.

This is a high level overview of the changes made:
	- Allocate vports for EC VFs if they are enabled.
	- Create representors and devlink ports for the EC VF vports.
	- When querying/setting HCA caps by vport break the assumption
	  that function ID is the same a vport number and adjust
	  accordingly.
	- Create a new type of page, so that when SRIOV on the ARM is
	  disabled, but remains enabled on the host, the driver can
	  wait for the correct pages.
	- Update SRIOV code to support EC VF creation/deletion.

===================

Lightweight local SFs:

Last 3 patches form Shay Drory:

SFs are heavy weight and by default they come with the full package of
ConnectX features. Usually users want specialized SFs for one specific
purpose and using devlink users will almost always override the set of
advertises features of an SF and reload it.

Shay Drory says:
================
In order to avoid the wasted time and resources on the reload, local SFs
will probe without any auxiliary sub-device, so that the SFs can be
configured prior to its full probe.

The defaults of the enable_* devlink params of these SFs are set to
false.

Usage example:
Create SF:
$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
$ devlink port function set pci/0000:08:00.0/32768 \
               hw_addr 00:00:00:00:00:11 state active

Enable ETH auxiliary device:
$ devlink dev param set auxiliary/mlx5_core.sf.1 \
              name enable_eth value true cmode driverinit

Now, in order to fully probe the SF, use devlink reload:
$ devlink dev reload auxiliary/mlx5_core.sf.1

At this point the user have SF devlink instance with auxiliary device
for the Ethernet functionality only.

================
2023-06-12 11:41:57 +01:00
David S. Miller
73f49f8cc1 Merge branch 'tcp-tx-headless'
Eric Dumazet says:

====================
tcp: tx path fully headless

This series completes transition of TCP stack tx path
to headless packets : All payload now reside in page frags,
never in skb->head.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:38:55 +01:00
Eric Dumazet
5882efff88 tcp: remove size parameter from tcp_stream_alloc_skb()
Now all tcp_stream_alloc_skb() callers pass @size == 0, we can
remove this parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:38:54 +01:00
Eric Dumazet
b4a2439713 tcp: remove some dead code
Now all skbs in write queue do not contain any payload in skb->head,
we can remove some dead code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:38:54 +01:00
Eric Dumazet
fbf934068f tcp: let tcp_send_syn_data() build headless packets
tcp_send_syn_data() is the last component in TCP transmit
path to put payload in skb->head.

Switch it to use page frags, so that we can remove dead
code later.

This allows to put more payload than previous implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:38:54 +01:00
David S. Miller
f2f069da4c Merge branch 'ethtool-extack'
Jakub Kicinski says:

====================
net: support extack in dump and simplify ethtool uAPI

Ethtool currently requires header nest to be always present even if
it doesn't have to carry any attr for a given request. This inflicts
unnecessary pain on the users.

What makes it worse is that extack was not working in dump's ->start()
callback. Address both of those issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:32:45 +01:00
Jakub Kicinski
500e1340d1 net: ethtool: don't require empty header nests
Ethtool currently requires a header nest (which is used to carry
the common family options) in all requests including dumps.

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 64 (48) nl_flags = 0x300 nl_type = 2
	error: -22      extack: {'msg': 'request header missing'}

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get \
           --json '{"header":{}}';  )
  [{'combined-count': 1,
    'combined-max': 1,
    'header': {'dev-index': 2, 'dev-name': 'enp1s0'}}]

Requiring the header nest to always be there may seem nice
from the consistency perspective, but it's not serving any
practical purpose. We shouldn't burden the user like this.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:32:45 +01:00
Jakub Kicinski
5ab8c41cef netlink: support extack in dump ->start()
Commit 4a19edb60d ("netlink: Pass extack to dump handlers")
added extack support to netlink dumps. It was focused on rtnl
and since rtnl does not use ->start(), ->done() callbacks
it ignored those. Genetlink on the other hand uses ->start()
extensively, for parsing and input validation.

Pass the extact in via struct netlink_dump_control and link
it to cb for the time of ->start(). Both struct netlink_dump_control
and extack itself live on the stack so we can't keep the same
extack for the duration of the dump. This means that the extack
visible in ->start() and each ->dump() callbacks will be different.
Corner cases like reporting a warning message in DONE across dump
calls are still not supported.

We could put the extack (for dumps) in the socket struct,
but layering makes it slightly awkward (extack pointer is decided
before the DO / DUMP split).

The genetlink dump error extacks are now surfaced:

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 64 (48) nl_flags = 0x300 nl_type = 2
	error: -22	extack: {'msg': 'request header missing'}

Previously extack was missing:

  $ cli.py --spec netlink/specs/ethtool.yaml --dump channels-get
  lib.ynl.NlError: Netlink error: Invalid argument
  nl_len = 36 (20) nl_flags = 0x100 nl_type = 2
	error: -22

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:32:44 +01:00
David S. Miller
238131684f Merge branch 'ynl-ethtool'
Jakub Kicinski says:

====================
tools: ynl: generate code for the ethtool family

And finally ethtool support. Thanks to Stan's work the ethtool family
spec is quite complete, so there is a lot of operations to support.

I chickened out of stats-get support, they require at the very least
type-value support on a u64 scalar. Type-value is an arrangement where
a u16 attribute is encoded directly in attribute type. Code gen can
support this if the inside is a nest, we just throw in an extra
field into that nest to carry the attr type. But a little more coding
is needed to for a scalar, because first we need to turn the scalar
into a struct with one member, then we can add the attr type.

Other than that ethtool required event support (notification which
does not share contents with any GET), but the previous series
already added that to the codegen.

I haven't tested all the ops here, and a few I tried seem to work.
====================

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
f561ff232a tools: ynl: add sample for ethtool
Configuring / reading ring sizes and counts is a fairly common
operation for ethtool netlink. Present a sample doing that with
YNL:

$ ./ethtool
Channels:
    enp1s0: combined 1
   eni1np1: combined 1
   eni2np1: combined 1
Rings:
    enp1s0: rx 256 tx 256
   eni1np1: rx 0 tx 0
   eni2np1: rx 0 tx 0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
2d7be507d6 tools: ynl: generate code for the ethtool family
Generate the protocol code for ethtool. Skip the stats
for now, they are the only outlier in terms of complexity.
Stats are a sort-of semi-polymorphic (attr space of a nest
depends on value of another attr) or a type-value-scalar,
depending on how one wants to look at it...
A challenge for another time.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
68335713d2 netlink: specs: ethtool: mark pads as pads
Pad is a separate type. Even though in practice they can
only be a u32 the value should be discarded.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
709d0c3b3d netlink: specs: ethtool: untangle stats-get
Code gen for stats is a bit of a challenge, but from looking
at the attrs I think that the format isn't quite right.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
37c8522227 netlink: specs: ethtool: untangle UDP tunnels and cable test a bit
UDP tunnel and cable test messages have a lot of nests,
which do not match the names of the enum entries in C uAPI.
Some of the structure / nesting also looks wrong.

Untangle this a little bit based on the names, comments and
educated guesses, I haven't actually tested the results.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
180ad45527 netlink: specs: ethtool: add empty enum stringset
C does not allow defining structures and enums with the same name.
Since enum ethtool_stringset exists in the uAPI we need to include
at least a stub of it in the spec. This will trigger name collision
avoidance in the code gen.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
2c9d47a095 tools: ynl-gen: resolve enum vs struct name conflicts
Ethtool has an attribute set called stringset, from which
we'll generate struct ethtool_stringset. Unfortunately,
the old ethtool header declares enum ethtool_stringset
(the same name), to which compilers object.

This seems unavoidable. Check struct names against known
constants and append an underscore if conflict is detected.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
dddc9f53da tools: ynl-gen: don't generate enum types if unnamed
If attr set or enum has empty enum name we need to use u32 or int
as function arguments and struct members.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
d4813b11d6 netlink: specs: ethtool: add C render hints
Most of the C enum names are guessed correctly, but there
is a handful of corner cases we need to name explicitly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:03 +01:00
Jakub Kicinski
ed2042cc77 netlink: specs: support setting prefix-name per attribute
Ethtool's PSE PoDL has a attr nest with different prefixes:

/* Power Sourcing Equipment */
enum {
	ETHTOOL_A_PSE_UNSPEC,
	ETHTOOL_A_PSE_HEADER,			/* nest - _A_HEADER_* */
	ETHTOOL_A_PODL_PSE_ADMIN_STATE,		/* u32 */
	ETHTOOL_A_PODL_PSE_ADMIN_CONTROL,	/* u32 */
	ETHTOOL_A_PODL_PSE_PW_D_STATUS,		/* u32 */

Header has a prefix of ETHTOOL_A_PSE_ and other attrs prefix of
ETHTOOL_A_PODL_PSE_ we can't cover them uniformly.
If PODL was after PSE life would be easy.

Now we either need to add prefixes to attr names which is yucky
or support setting prefix name per attr.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:02 +01:00
Jakub Kicinski
33eedb0071 tools: ynl-gen: record extra args for regen
ynl-regen needs to know the arguments used to generate a file.
Record excluded ops and, while at it, user headers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:02 +01:00
Jakub Kicinski
008bcd6835 tools: ynl-gen: support excluding tricky ops
The ethtool family has a small handful of quite tricky ops
and a lot of simple very useful ops. Teach ynl-gen to skip
ops so that we can bypass the tricky ones.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 11:01:02 +01:00
Rob Herring
b30a1f305b mdio: mdio-mux-mmioreg: Use of_property_read_reg() to parse "reg"
Use the recently added of_property_read_reg() helper to get the
untranslated "reg" address value.

Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:56:01 +01:00
Krzysztof Kozlowski
61ab5a060a dt-bindings: net: drop unneeded quotes
Cleanup bindings dropping unneeded quotes. Once all these are fixed,
checking for this can be enabled in yamllint.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:50:47 +01:00
David S. Miller
ba47545c75 Merge branch 'SCM_PIDFD-SCM_PEERPIDFD'
Alexander Mikhalitsyn says:

====================
Add SCM_PIDFD and SO_PEERPIDFD

1. Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.

2. Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.

3. Add SCM_PIDFD / SO_PEERPIDFD kselftest

Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/

Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this and Luca Boccassi for testing and reviewing this.

=== Motivation behind this patchset

Eric Dumazet raised a question:
> It seems that we already can use pidfd_open() (since linux-5.3), and
> pass the resulting fd in af_unix SCM_RIGHTS message ?

Yes, it's possible, but it means that from the receiver side we need
to trust the sent pidfd (in SCM_RIGHTS),
or always use combination of SCM_RIGHTS+SCM_CREDENTIALS, then we can
extract pidfd from SCM_RIGHTS,
then acquire plain pid from pidfd and after compare it with the pid
from SCM_CREDENTIALS.

A few comments from other folks regarding this.

Christian Brauner wrote:

>Let me try and provide some of the missing background.

>There are a range of use-cases where we would like to authenticate a
>client through sockets without being susceptible to PID recycling
>attacks. Currently, we can't do this as the race isn't fully fixable.
>We can only apply mitigations.

>What this patchset will allows us to do is to get a pidfd without the
>client having to send us an fd explicitly via SCM_RIGHTS. As that's
>already possibly as you correctly point out.

>But for protocols like polkit this is quite important. Every message is
>standalone and we would need to force a complete protocol change where
>we would need to require that every client allocate and send a pidfd via
>SCM_RIGHTS. That would also mean patching through all polkit users.

>For something like systemd-journald where we provide logging facilities
>and want to add metadata to the log we would also immensely benefit from
>being able to get a receiver-side controlled pidfd.

>With the message type we envisioned we don't need to change the sender
>at all and can be safe against pid recycling.

>Link: https://gitlab.freedesktop.org/polkit/polkit/-/merge_requests/154
>Link: https://uapi-group.org/kernel-features

Lennart Poettering wrote:

>So yes, this is of course possible, but it would mean the pidfd would
>have to be transported as part of the user protocol, explicitly sent
>by the sender. (Moreover, the receiver after receiving the pidfd would
>then still have to somehow be able to prove that the pidfd it just
>received actually refers to the peer's process and not some random
>process. – this part is actually solvable in userspace, but ugly)

>The big thing is simply that we want that the pidfd is associated
>*implicity* with each AF_UNIX connection, not explicitly. A lot of
>userspace already relies on this, both in the authentication area
>(polkit) as well as in the logging area (systemd-journald). Right now
>using the PID field from SO_PEERCREDS/SCM_CREDENTIALS is racy though
>and very hard to get right. Making this available as pidfd too, would
>solve this raciness, without otherwise changing semantics of it all:
>receivers can still enable the creds stuff as they wish, and the data
>is then implicitly appended to the connections/datagrams the sender
>initiates.

>Or to turn this around: things like polkit are typically used to
>authenticate arbitrary dbus methods calls: some service implements a
>dbus method call, and when an unprivileged client then issues that
>call, it will take the client's info, go to polkit and ask it if this
>is ok. If we wanted to send the pidfd as part of the protocol we
>basically would have to extend every single method call to contain the
>client's pidfd along with it as an additional argument, which would be
>a massive undertaking: it would change the prototypes of basically
>*all* methods a service defines… And that's just ugly.

>Note that Alex' patch set doesn't expose anything that wasn't exposed
>before, or attach, propagate what wasn't before. All it does, is make
>the field already available anyway (the struct ucred .pid field)
>available also in a better way (as a pidfd), to solve a variety of
>races, with no effect on the protocol actually spoken within the
>AF_UNIX transport. It's a seamless improvement of the status quo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:45:50 +01:00
Alexander Mikhalitsyn
97154bcf4d af_unix: Kconfig: make CONFIG_UNIX bool
Let's make CONFIG_UNIX a bool instead of a tristate.
We've decided to do that during discussion about SCM_PIDFD patchset [1].

[1] https://lore.kernel.org/lkml/20230524081933.44dc8bea@kernel.org/

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:45:50 +01:00
Alexander Mikhalitsyn
ec80f48825 selftests: net: add SCM_PIDFD / SO_PEERPIDFD test
Basic test to check consistency between:
- SCM_CREDENTIALS and SCM_PIDFD
- SO_PEERCRED and SO_PEERPIDFD

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:45:50 +01:00
Alexander Mikhalitsyn
7b26952a91 net: core: add getsockopt SO_PEERPIDFD
Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd.
This thing is direct analog of SO_PEERCRED which allows to get plain PID.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:45:50 +01:00
Alexander Mikhalitsyn
5e2ff6704a scm: add SO_PASSPIDFD and SCM_PIDFD
Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS,
but it contains pidfd instead of plain pid, which allows programmers not
to care about PID reuse problem.

We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because
it depends on a pidfd_prepare() API which is not exported to the kernel
modules.

Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/

Big thanks to Christian Brauner and Lennart Poettering for productive
discussions about this.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Tested-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 10:45:49 +01:00
David S. Miller
55d7c91406 Merge branch 'mlxsw-cleanups'
Petr Machata says:

====================
mlxsw: Cleanups in router code

This patchset moves some router-related code from spectrum.c to
spectrum_router.c where it should be. It also simplifies handlers of
netevent notifications.

- Patch #1 caches router pointer in a dedicated variable. This obviates the
  need to access the same as mlxsw_sp->router, making lines shorter, and
  permitting a future patch to add code that fits within 80 character
  limit.

- Patch #2 moves IP / IPv6 validation notifier blocks from spectrum.c
  to spectrum_router, where the handlers are anyway.

- In patch #3, pass router pointer to scheduler of deferred work directly,
  instead of having it deduce it on its own.

- This makes the router pointer available in the handler function
  mlxsw_sp_router_netevent_event(), so in patch #4, use it directly,
  instead of finding it through mlxsw_sp_port.

- In patch #5, extend mlxsw_sp_router_schedule_work() so that the
  NETEVENT_NEIGH_UPDATE handler can use it directly instead of inlining
  equivalent code.

- In patches #6 and #7, add helpers for two common operations involving
  a backing netdev of a RIF. This makes it unnecessary for the function
  mlxsw_sp_rif_dev() to be visible outside of the router module, so in
  patch #8, hide it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
df95ae66cc mlxsw: spectrum_router: Privatize mlxsw_sp_rif_dev()
Now that the external users of mlxsw_sp_rif_dev() have been converted in
the preceding patches, make the function static.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
5374a50f2e mlxsw: Convert does-RIF-have-this-netdev queries to a dedicated helper
In a number of places, a netdevice underlying a RIF is obtained only to
compare it to another pointer. In order to clean up the interface between
the router and the other modules, add a new helper to specifically answer
this question, and convert the relevant uses to this new interface.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
0255f74845 mlxsw: Convert RIF-has-netdevice queries to a dedicated helper
In a number of places, a netdevice underlying a RIF is obtained only to
check if it a NULL pointer. In order to clean up the interface between the
router and the other modules, add a new helper to specifically answer this
question, and convert the relevant uses to this new interface.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
151b89f602 mlxsw: spectrum_router: Reuse work neighbor initialization in work scheduler
After the struct mlxsw_sp_netevent_work.n field initialization is moved
here, the body of code that handles NETEVENT_NEIGH_UPDATE is almost
identical to the one in the helper function. Therefore defer to the helper
instead of inlining the equivalent.

Note that previously, the code took and put a reference of the netdevice.
The new code defers to mlxsw_sp_dev_lower_is_port() to obviate the need for
taking the reference.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
14304e7063 mlxsw: spectrum_router: Use the available router pointer for netevent handling
This code handles NETEVENT_DELAY_PROBE_TIME_UPDATE, which is invoked every
time the delay_probe_time changes. mlxsw router currently only maintains
one timer, so the last delay_probe_time set wins.

Currently, mlxsw uses mlxsw_sp_port_lower_dev_hold() to find a reference to
the router. This is no longer necessary. But as a side effect, this makes
sure that only updates to "interesting netdevices" (ones that have a
physical netdevice lower) are projected.

Retain that side effect by calling mlxsw_sp_port_dev_lower_find_rcu() and
punting if there is none. Then just proceed using the router pointer that's
already at hand in the helper.

Note that previously, the code took and put a reference of the netdevice.
Because the mlxsw_sp pointer is now obtained from the notifier block, the
port pointer (non-) NULL-ness is all that's relevant, and the reference
does not need to be taken anymore.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:30 +01:00
Petr Machata
48dde35ea1 mlxsw: spectrum_router: Pass router to mlxsw_sp_router_schedule_work() directly
Instead of passing a notifier block and deducing the router pointer from
that in the helper, do that in the caller, and pass the result. In the
following patches, the pointer will also be made useful in the caller.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:29 +01:00
Petr Machata
41b2bd208e mlxsw: spectrum_router: Move here inetaddr validator notifiers
The validation logic is already in the router code. Move there the notifier
blocks themselves as well.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:29 +01:00
Petr Machata
50f6c3d57e mlxsw: spectrum_router: mlxsw_sp_router_fini(): Extract a helper variable
Make mlxsw_sp_router_fini() more similar to the _init() function (and more
concise) by extracting the `router' handle to a named variable and using
that throughout. The availability of a dedicated `router' variable will
come in handy in following patches.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:49:29 +01:00
Aaron Conole
e069ba07e6 net: openvswitch: add support for l4 symmetric hashing
Since its introduction, the ovs module execute_hash action allowed
hash algorithms other than the skb->l4_hash to be used.  However,
additional hash algorithms were not implemented.  This means flows
requiring different hash distributions weren't able to use the
kernel datapath.

Now, introduce support for symmetric hashing algorithm as an
alternative hash supported by the ovs module using the flow
dissector.

Output of flow using l4_sym hash:

    recirc_id(0),in_port(3),eth(),eth_type(0x0800),
    ipv4(dst=64.0.0.0/192.0.0.0,proto=6,frag=no), packets:30473425,
    bytes:45902883702, used:0.000s, flags:SP.,
    actions:hash(sym_l4(0)),recirc(0xd)

Some performance testing with no GRO/GSO, two veths, single flow:

    hash(l4(0)):      4.35 GBits/s
    hash(l4_sym(0)):  4.24 GBits/s

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:46:30 +01:00
David S. Miller
6513787733 Merge branch 'taprio-xstats'
Vladimir Oltean says:

====================
Fixes for taprio xstats

1. Taprio classes correspond to TXQs, and thus, class stats are TXQ
   stats not TC stats.
2. Drivers reporting taprio xstats should report xstats for *this*
   taprio, not for a previous one.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12 09:43:31 +01:00