Commit Graph

65015 Commits

Author SHA1 Message Date
Chuck Lever 8a053433de xprtrdma: Do not wake RPC consumer on a failed LocalInv
Throw away any reply where the LocalInv flushes or could not be
posted. The registered memory region is in an unknown state until
the disconnect completes.

rpcrdma_xprt_disconnect() will find and release the MR. No need to
put it back on the MR free list in this case.

The client retransmits pending RPC requests once it reestablishes a
fresh connection, so a replacement reply should be forthcoming on
the next connection instance.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:25:19 -04:00
Chuck Lever e4b52ca013 xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.

I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:25:12 -04:00
Chuck Lever 44438ad9ae xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:47 -04:00
Chuck Lever f912af77e2 xprtrdma: Rename frwr_release_mr()
Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:29 -04:00
Chuck Lever 1363e6388c xprtrdma: rpcrdma_mr_pop() already does list_del_init()
The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.

Fixes: 847568942f ("xprtrdma: Remove fr_state")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:22 -04:00
Chuck Lever c35ca60d49 xprtrdma: Delete rpcrdma_recv_buffer_put()
Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:12 -04:00
Chuck Lever 35d8b10a25 xprtrdma: Fix cwnd update ordering
After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().

Fixes: 2ae50ad68c ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:05 -04:00
Chuck Lever 9e3ca33b62 xprtrdma: Improve locking around rpcrdma_rep creation
Defensive clean up: Protect the rb_all_reps list during rep
creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:23:40 -04:00
Chuck Lever 8b5292be68 xprtrdma: Improve commentary around rpcrdma_reps_unmap()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:23:19 -04:00
Chuck Lever eaf86e8cc8 xprtrdma: Improve locking around rpcrdma_rep destruction
Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.

Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:22:29 -04:00
Chuck Lever 5030c9a938 xprtrdma: Put flushed Receives on free list instead of destroying them
Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:49 -04:00
Chuck Lever 15788d1d10 xprtrdma: Do not refresh Receive Queue while it is draining
Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.

Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:44 -04:00
Chuck Lever 32e6b68167 xprtrdma: Avoid Receive Queue wrapping
Commit e340c2d6ef ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.

This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.

Fixes: e340c2d6ef ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:36 -04:00
Pablo Neira Ayuso a655536571 netfilter: nfnetlink: add struct nfnl_info and pass it to callbacks
Add a new structure to reduce callback footprint and to facilite
extensions of the nfnetlink callback interface in the future.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:58:17 +02:00
Pablo Neira Ayuso d59d2f82f9 netfilter: nftables: add nft_pernet() helper function
Consolidate call to net_generic(net, nf_tables_net_id) in this
wrapper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:58:17 +02:00
David S. Miller 1e5e4acb66 mlx5-updates-2021-04-21
devlink external port attribute for SF (Sub-Function) port flavour
 
 This adds the support to instantiate Sub-Functions on external hosts
 E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
 are now able to spawn new Sub-Functions on the Host server CPU.
 
 Parav Pandit Says:
 ==================
 
 This series introduces and uses external attribute for the SF port to
 indicate that a SF port belongs to an external controller.
 
 This is needed to generate unique phys_port_name when PF and SF numbers
 are overlapping between local and external controllers.
 For example two controllers 0 and 1, both of these controller have a SF.
 having PF number 0, SF number 77. Here, phys_port_name has duplicate
 entry which doesn't have controller number in it.
 
 Hence, add controller number optionally when a SF port is for an
 external controller. This extension is similar to existing PF and VF
 eswitch ports of the external controller.
 
 When a SF is for external controller an example view of external SF
 port and config sequence:
 
 On eswitch system:
 $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
 
 $ devlink port show
 pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
 pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
   function:
     hw_addr 00:00:00:00:00:00
 
 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
   function:
     hw_addr 00:00:00:00:00:00 state inactive opstate detached
 
 phys_port_name construction:
 $ cat /sys/class/net/eth1/phys_port_name
 c1pf0sf77
 
 Patch summary:
 First 3 patches prepares the eswitch to handle vports in more generic
 way using xarray to lookup vport from its unique vport number.
 Patch-1 returns maximum eswitch ports only when eswitch is enabled
 Patch-2 prepares eswitch to return eswitch max ports from a struct
 Patch-3 uses xarray for vport and representor lookup
 Patch-4 considers SF for an additioanl range of SF vports
 Patch-5 relies on SF hw table to check SF support
 Patch-6 extends SF devlink port attribute for external flag
 Patch-7 stores the per controller SF allocation attributes
 Patch-8 uses SF function id for filtering events
 Patch-9 uses helper for allocation and free
 Patch-10 splits hw table into per controller table and generic one
 Patch-11 extends sf table for additional range
 
 ==================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCDz80ACgkQSD+KveBX
 +j4LmggAwS9otYoo639Kmow/wlMZ6yyLsH02zVMFLEJ2AE4VbL73i4iiQ67ZWygL
 yQ8HawPAnythx4RsN/M6/WjSKRpdqTC27C9CpdM78zhXb1vnOrlzba7rYngqmo7N
 5fIkGyjsUGHNqq+15SftK7JYbXFTe1b5RdWawXkQoyBlXTTBamyxD7C5NMpoDots
 /e88Bs8Zy5nVPZqPchIId8TZEKKuO/heTz8ks6q6s/t1MGj7QP+ddxVMgNg00NR5
 OpNTr7YYdpHxpfLSUZgdHaptwwKOx+nou8LdJkIKWPs7SHX6HDggyZJjGBOEWtE7
 qG7oSip4olOTM0w9PZrAewLwSYhq7Q==
 =oqBr
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-04-21

devlink external port attribute for SF (Sub-Function) port flavour

This adds the support to instantiate Sub-Functions on external hosts
E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
are now able to spawn new Sub-Functions on the Host server CPU.

Parav Pandit Says:
==================

This series introduces and uses external attribute for the SF port to
indicate that a SF port belongs to an external controller.

This is needed to generate unique phys_port_name when PF and SF numbers
are overlapping between local and external controllers.
For example two controllers 0 and 1, both of these controller have a SF.
having PF number 0, SF number 77. Here, phys_port_name has duplicate
entry which doesn't have controller number in it.

Hence, add controller number optionally when a SF port is for an
external controller. This extension is similar to existing PF and VF
eswitch ports of the external controller.

When a SF is for external controller an example view of external SF
port and config sequence:

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Patch summary:
First 3 patches prepares the eswitch to handle vports in more generic
way using xarray to lookup vport from its unique vport number.
Patch-1 returns maximum eswitch ports only when eswitch is enabled
Patch-2 prepares eswitch to return eswitch max ports from a struct
Patch-3 uses xarray for vport and representor lookup
Patch-4 considers SF for an additioanl range of SF vports
Patch-5 relies on SF hw table to check SF support
Patch-6 extends SF devlink port attribute for external flag
Patch-7 stores the per controller SF allocation attributes
Patch-8 uses SF function id for filtering events
Patch-9 uses helper for allocation and free
Patch-10 splits hw table into per controller table and generic one
Patch-11 extends sf table for additional range

==================

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:31:35 -07:00
Phil Sutter 593268ddf3 netfilter: nf_log_syslog: Unset bridge logger in pernet exit
Without this, a stale pointer remains in pernet loggers after module
unload causing a kernel oops during dereference. Easily reproduced by:

| # modprobe nf_log_syslog
| # rmmod nf_log_syslog
| # cat /proc/net/netfilter/nf_log

Fixes: 77ccee96a6 ("netfilter: nf_log_bridge: merge with nf_log_syslog")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:47 +02:00
Florian Westphal ee177a5441 netfilter: ip6_tables: pass table pointer via nf_hook_ops
Same patch as the ip_tables one: removal of all accesses to ip6_tables
xt_table pointers.  After this patch the struct net xt_table anchors
can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:47 +02:00
Florian Westphal f9006acc8d netfilter: arp_tables: pass table pointer via nf_hook_ops
Same change as previous patch.  Only difference:
no need to handle NULL template_ops parameter, the only caller
(arptable_filter) always passes non-NULL argument.

This removes all remaining accesses to net->ipv4.arptable_filter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal ae68933422 netfilter: ip_tables: pass table pointer via nf_hook_ops
iptable_x modules rely on 'struct net' to contain a pointer to the
table that should be evaluated.

In order to remove these pointers from struct net, pass them via
the 'priv' pointer in a similar fashion as nf_tables passes the
rule data.

To do that, duplicate the nf_hook_info array passed in from the
iptable_x modules, update the ops->priv pointers of the copy to
refer to the table and then change the hookfn implementations to
just pass the 'priv' argument to the traverser.

After this patch, the xt_table pointers can already be removed
from struct net.

However, changes to struct net result in re-compile of the entire
network stack, so do the removal after arptables and ip6tables
have been converted as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal a4aeafa28c netfilter: xt_nat: pass table to hookfn
This changes how ip(6)table nat passes the ruleset/table to the
evaluation loop.

At the moment, it will fetch the table from struct net.

This change stores the table in the hook_ops 'priv' argument
instead.

This requires to duplicate the hook_ops for each netns, so
they can store the (per-net) xt_table structure.

The dupliated nat hook_ops get stored in net_generic data area.
They are free'd in the namespace exit path.

This is a pre-requisite to remove the xt_table/ruleset pointers
from struct net.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal f68772ed67 netfilter: x_tables: remove paranoia tests
No need for these.
There is only one caller, the xtables core, when the table is registered
for the first time with a particular network namespace.

After ->table_init() call, the table is linked into the tables[af] list,
so next call to that function will skip the ->table_init().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal 4d70539919 netfilter: arptables: unregister the tables by name
and again, this time for arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal 6c0717545f netfilter: ip6tables: unregister the tables by name
Same as the previous patch, but for ip6tables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal 20a9df3359 netfilter: iptables: unregister the tables by name
xtables stores the xt_table structs in the struct net.  This isn't
needed anymore, the structures could be passed via the netfilter hook
'private' pointer to the hook functions, which would allow us to remove
those pointers from struct net.

As a first step, reduce the number of accesses to the
net->ipv4.ip6table_{raw,filter,...} pointers.
This allows the tables to get unregistered by name instead of having to
pass the raw address.

The xt_table structure cane looked up by name+address family instead.

This patch is useless as-is (the backends still have the raw pointer
address), but it lowers the bar to remove those.

It also allows to put the 'was table registered in the first place' check
into ip_tables.c rather than have it in each table sub module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal 1ef4d6d1af netfilter: x_tables: add xt_find_table
This will be used to obtain the xt_table struct given address family and
table name.

Followup patches will reduce the number of direct accesses to the xt_table
structures via net->ipv{4,6}.ip(6)table_{nat,mangle,...} pointers, then
remove them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:39 +02:00
Florian Westphal 7716bf090e netfilter: x_tables: remove ipt_unregister_table
Its the same function as ipt_unregister_table_exit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:08 +02:00
Florian Westphal 4c95e0728e netfilter: ebtables: remove the 3 ebtables pointers from struct net
ebtables stores the table internal data (what gets passed to the
ebt_do_table() interpreter) in struct net.

nftables keeps the internal interpreter format in pernet lists
and passes it via the netfilter core infrastructure (priv pointer).

Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup,
then the ops->priv pointer is set to the table that is being registered.

After that, the netfilter core passes this table info to the hookfn.

This allows to remove the pointers from struct net.

Same pattern can be applied to ip/ip6/arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal de8c12110a netfilter: disable defrag once its no longer needed
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.

In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.

Moreover, this won't work if user isn't allowed to remove nf_defrag module.

This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.

ipvs isn't converted here, it will behave as before this patch and
will need module removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Pablo Neira Ayuso e0bb96db96 netfilter: nft_socket: add support for cgroupsv2
Allow to match on the cgroupsv2 id from ancestor level.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal 885e8c6824 netfilter: nat: move nf_xfrm_me_harder to where it is used
remove the export and make it static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
David S. Miller 5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Erik Flodin e6b031d3c3 can: proc: fix rcvlist_* header alignment on 64-bit system
Before this fix, the function and userdata columns weren't aligned:
  device   can_id   can_mask  function  userdata   matches  ident
   vcan0  92345678  9fffffff  0000000000000000  0000000000000000         0  raw
   vcan0     123    00000123  0000000000000000  0000000000000000         0  raw

After the fix they are:
  device   can_id   can_mask      function          userdata       matches  ident
   vcan0  92345678  9fffffff  0000000000000000  0000000000000000         0  raw
   vcan0     123    00000123  0000000000000000  0000000000000000         0  raw

Link: Link: https://lore.kernel.org/r/20210425141440.229653-1-erik@flodin.me
Signed-off-by: Erik Flodin <erik@flodin.me>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-04-25 19:43:00 +02:00
Linus Torvalds 799bac5512 Revert "net/rds: Avoid potential use after free in rds_send_remove_from_sock"
This reverts commit 0c85a7e874.

The games with 'rm' are on (two separate instances) of a local variable,
and make no difference.

Quoting Aditya Pakki:
 "I was the author of the patch and it was the cause of the giant UMN
  revert.

  The patch is garbage and I was unaware of the steps involved in
  retracting it. I *believed* the maintainers would pull it, given it
  was already under Greg's list. The patch does not introduce any bugs
  but is pointless and is stupid. I accept my incompetence and for not
  requesting a revert earlier."

Link: https://lwn.net/Articles/854319/
Requested-by: Aditya Pakki <pakki001@umn.edu>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 09:32:35 -07:00
Parav Pandit a1ab3e4554 devlink: Extend SF port attributes to have external attribute
Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.

External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.

When a SF is for external controller an example view of external SF
port and config sequence.

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:53 -07:00
Yonglong Li ca4fb89257 mptcp: add MSG_PEEK support
This patch adds support for MSG_PEEK flag. Packets are not removed
from the receive_queue if MSG_PEEK set in recv() system call.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni 987858e5d0 mptcp: ignore unsupported msg flags
Currently mptcp_sendmsg() fails with EOPNOTSUPP if the
user-space provides some unsupported flag. That is unexpected
and may foul existing applications migrated to MPTCP, which
expect a different behavior.

Change the mentioned function to silently ignore the unsupported
flags except MSG_FASTOPEN. This is the only flags currently not
supported by MPTCP with user-space visible side-effects.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/162
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni d976092ce1 mptcp: implement MSG_TRUNC support
The mentioned flag is currently silenlty ignored. This
change implements the TCP-like behaviour, dropping the
pending data up to the specified length.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sigend-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni cb9d80f494 mptcp: implement dummy MSG_ERRQUEUE support
mptcp_recvmsg() currently silently ignores MSG_ERRQUEUE, returning
input data instead of error cmsg.

This change provides a dummy implementation for MSG_ERRQUEUE - always
returns no data. That is consistent with the current lack of a suitable
IP_RECVERR setsockopt() support.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Mat Martineau 6477dd39e6 mptcp: Retransmit DATA_FIN
With this change, the MPTCP-level retransmission timer is used to resend
DATA_FIN. The retranmit timer is not stopped while waiting for a
MPTCP-level ACK of DATA_FIN, and retransmitted DATA_FINs are sent on all
subflows. The retry interval starts at TCP_RTO_MIN and then doubles on
each attempt, up to TCP_RTO_MAX.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/146
Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:03:02 -07:00
David S. Miller 7679f864a0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-04-23

1) The SPI flow key in struct flowi has no consumers,
   so remove it. From Florian Westphal.

2) Remove stray synchronize_rcu from xfrm_init.
   From Florian Westphal.

3) Use the new exit_pre hook to reset the netlink socket
   on net namespace destruction. From Florian Westphal.

4) Remove an unnecessary get_cpu() in ipcomp, that
   code is always called with BHs off.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:37:47 -07:00
Tonghao Zhang ed744d8193 net: sock: remove the unnecessary check in proto_register
tw_prot_cleanup will check the twsk_prot.

Fixes: 0f5907af39 ("net: Fix potential memory leak in proto_register()")
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:10:03 -07:00
Ilya Maximets 7d742b509d openvswitch: meter: remove rate from the bucket size calculation
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change proposed for the userspace implementation.

Fixes: 96fbc13d7e ("openvswitch: Add meter infrastructure")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:08:47 -07:00
Yunjian Wang b9f83ffaa0 SUNRPC: Fix null pointer dereference in svc_rqst_free()
When alloc_pages_node() returns null in svc_rqst_alloc(), the
null rq_scratch_page pointer will be dereferenced when calling
put_page() in svc_rqst_free(). Fix it by adding a null check.

Addresses-Coverity: ("Dereference after null check")
Fixes: 5191955d6f ("SUNRPC: Prepare for xdr_stream-style decoding on the server-side")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-23 10:43:05 -04:00
Lin Ma e2cb6b891a bluetooth: eliminate the potential race condition when removing the HCI controller
There is a possible race condition vulnerability between issuing a HCI
command and removing the cont.  Specifically, functions hci_req_sync()
and hci_dev_do_close() can race each other like below:

thread-A in hci_req_sync()      |   thread-B in hci_dev_do_close()
                                |   hci_req_sync_lock(hdev);
test_bit(HCI_UP, &hdev->flags); |
...                             |   test_and_clear_bit(HCI_UP, &hdev->flags)
hci_req_sync_lock(hdev);        |
                                |
In this commit we alter the sequence in function hci_req_sync(). Hence,
the thread-A cannot issue th.

Signed-off-by: Lin Ma <linma@zju.edu.cn>
Cc: Marcel Holtmann <marcel@holtmann.org>
Fixes: 7c6a329e44 ("[Bluetooth] Fix regression from using default link policy")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-23 13:25:04 +02:00
David Howells f105da1a79 afs: Don't truncate iter during data fetch
Don't truncate the iterator to correspond to the actual data size when
fetching the data from the server - rather, pass the length we want to read
to rxrpc.

This will allow the clear-after-read code in future to simply clear the
remaining iterator capacity rather than having to reinitialise the
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861249201.340223.13035445866976590375.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465825061.1377938.14403904452300909320.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588531418.3465195.10712005940763063144.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118148567.1232039.13380313332292947956.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161044610.2537118.17908520793806837792.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340407907.1303470.6501394859511712746.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539551721.286939.14655713136572200716.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653807790.2770958.14034599989374173734.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789090823.6155.15673999934535049102.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:26 +01:00
Li RongQing e7a1c13008 xsk: Align XDP socket batch size with DPDK
DPDK default burst size is 32, however, kernel xsk sendto
syscall can not handle all 32 at one time, and return with
error.

So make kernel XDP socket batch size larger to avoid
unnecessary syscall fail and context switch which will help
to increase performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1618378752-4191-1-git-send-email-lirongqing@baidu.com
2021-04-23 09:50:35 +02:00
Martin Willi 22b6034323 net, xdp: Update pkt_type if generic XDP changes unicast MAC
If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
2021-04-22 23:18:02 +02:00
Dan Carpenter cb57908653 SUNRPC: fix ternary sign expansion bug in tracing
This code is supposed to pass negative "err" values for tracing but it
passes positive values instead.  The problem is that the
trace_svcsock_tcp_send() function takes a long but "err" is an int and
"sent" is a u32.  The negative is first type promoted to u32 so it
becomes a high positive then it is promoted to long and it stays
positive.

Fix this by casting "err" directly to long.

Fixes: 998024dee1 ("SUNRPC: Add more svcsock tracepoints")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-22 11:02:28 -04:00
Chinmay Agarwal eefb45eef5 neighbour: Prevent Race condition in neighbour subsytem
Following Race Condition was detected:

<CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core()
-> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which
takes a reference on neighbour entry 'n'.
Moves further along, arp_process() and calls neigh_update()->
__neigh_update(). Neighbour entry is unlocked just before a call to
neigh_update_gc_list.

This unlocking paves way for another thread that may take a reference on
the same and mark it dead and remove it from gc_list.

<CPU B, t1> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be
removed from gc_list.
Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t1,
'n' couldn't be destroyed.

<CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry
set as dead.

<CPU A, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry and we have a destroyed ntry still part of gc_list.

Fixes: eb4e8fac00d1("neighbour: Prevent a dead entry from updating gc_list")
Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 14:47:43 -07:00
Vladimir Oltean 68f5c12abb net: bridge: fix error in br_multicast_add_port when CONFIG_NET_SWITCHDEV=n
When CONFIG_NET_SWITCHDEV is disabled, the shim for switchdev_port_attr_set
inside br_mc_disabled_update returns -EOPNOTSUPP. This is not caught,
and propagated to the caller of br_multicast_add_port, preventing ports
from joining the bridge.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: ae1ea84b33 ("net: bridge: propagate error code and extack from br_mc_disabled_update")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 13:13:33 -07:00
Bjorn Andersson 47a017f339 net: qrtr: Avoid potential use after free in MHI send
It is possible that the MHI ul_callback will be invoked immediately
following the queueing of the skb for transmission, leading to the
callback decrementing the refcount of the associated sk and freeing the
skb.

As such the dereference of skb and the increment of the sk refcount must
happen before the skb is queued, to avoid the skb to be used after free
and potentially the sk to drop its last refcount..

Fixes: 6e728f3213 ("net: qrtr: Add MHI transport layer")
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 11:01:03 -07:00
Oleksij Rempel 70a7c484c7 net: dsa: fix bridge support for drivers without port_bridge_flags callback
Starting with patch:
a8b659e7ff ("net: dsa: act as passthrough for bridge port flags")

drivers without "port_bridge_flags" callback will fail to join the bridge.
Looking at the code, -EOPNOTSUPP seems to be the proper return value,
which makes at least microchip and atheros switches work again.

Fixes: 5961d6a12c ("net: dsa: inherit the actual bridge port flags at join time")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 10:47:27 -07:00
Stefano Garzarella 8432b81149 vsock/virtio: free queued packets when closing socket
As reported by syzbot [1], there is a memory leak while closing the
socket. We partially solved this issue with commit ac03046ece
("vsock/virtio: free packets during the socket release"), but we
forgot to drain the RX queue when the socket is definitely closed by
the scheduled work.

To avoid future issues, let's use the new virtio_transport_remove_sock()
to drain the RX queue before removing the socket from the af_vsock lists
calling vsock_remove_sock().

[1] https://syzkaller.appspot.com/bug?extid=24452624fc4c571eedd9

Fixes: ac03046ece ("vsock/virtio: free packets during the socket release")
Reported-and-tested-by: syzbot+24452624fc4c571eedd9@syzkaller.appspotmail.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 17:07:21 -07:00
Tobias Waldekranz deff710703 net: dsa: Allow default tag protocol to be overridden from DT
Some combinations of tag protocols and Ethernet controllers are
incompatible, and it is hard for the driver to keep track of these.

Therefore, allow the device tree author (typically the board vendor)
to inform the driver of this fact by selecting an alternate protocol
that is known to work.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:51:20 -07:00
Tobias Waldekranz 21e0b508c8 net: dsa: Only notify CPU ports of changes to the tag protocol
Previously DSA ports were also included, on the assumption that the
protocol used by the CPU port had to the matched throughout the entire
tree.

As there is not yet any consumer in need of this, drop the call.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:51:19 -07:00
David S. Miller 08322284c1 Another set of updates, all over the map:
* set sk_pacing_shift for 802.3->802.11 encap offload
  * some monitor support for 802.11->802.3 decap offload
  * HE (802.11ax) spec updates
  * userspace API for TDLS HE support
  * along with various other small features, cleanups and
    fixups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmB+4y8ACgkQB8qZga/f
 l8SiGw/9Fz3XETnNDYMvyY7ppmWzZ6vofRq307YJiCz1fszEKqwyyzMQOrHA9tg2
 Nasl711egWlVyHTBCN+VCSaTQUjkODsK/5t4XWoxdJ0J3lZkgryVGBJljpl+k4A6
 11qpvwUnO1WCmt0s49V2yU/jWgZ9itHfu9dosu/YIq+NfXUVA7ylKmP3gqfmcCeV
 631z5AnM8/9N8QVMpnk5F2fE57WUXbA+KdVsw0LXMmjXYSsQ9MyTBX/lRDVcaMWV
 7cOtHekkzD0MVfsOoBVvsJl+bybBgEPOfZn2Kt22Rh4JzAch/uUhwRQGzsGxcR3p
 D8W9BABXCU8C5mhP8gcKlOSuH3h7ydKKqrXXNeRO+y5hymOtUSGJxia93m+uQ8qC
 97wootP3cb97/dEzv5cWqw5Pa39uEsny6mQqueD5WcMI9imL98HEo3hrZElbctx8
 s9ZE37WAlZ0zw+cGIsmElZfE2qMqEhjxF3mGFcpXLkk9/Y/1jmypYopkBLJh6KcS
 mIfwk9qWgADbPT5df1A/1388lMkjBRcQGc1SriYxy/olvb70mD8IPPiDSD2kULDt
 Sq2frnOdvjW0Q5DB6jBKzdMudAxY3WP5MlcGDy1iYwEbY6s4lPfQXG48joJpRQFG
 I3zPM6Z+Pimx7vcTd5a+IUyKvDoF+DtxiOu8DGKYT2M5tv3/tpI=
 =b0NQ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2021-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of updates, all over the map:
 * set sk_pacing_shift for 802.3->802.11 encap offload
 * some monitor support for 802.11->802.3 decap offload
 * HE (802.11ax) spec updates
 * userspace API for TDLS HE support
 * along with various other small features, cleanups and
   fixups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:44:04 -07:00
Oleksij Rempel a71acad90a net: dsa: enable selftest support for all switches by default
Most of generic selftest should be able to work with probably all ethernet
controllers. The DSA switches are not exception, so enable it by default at
least for DSA.

This patch was tested with SJA1105 and AR9331.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:08:02 -07:00
Oleksij Rempel 3e1e58d64c net: add generic selftest support
Port some parts of the stmmac selftest and reuse it as basic generic selftest
library. This patch was tested with following combinations:
- iMX6DL FEC -> AT8035
- iMX6DL FEC -> SJA1105Q switch -> KSZ8081
- iMX6DL FEC -> SJA1105Q switch -> KSZ9031
- AR9331 ag71xx -> AR9331 PHY
- AR9331 ag71xx -> AR9331 switch -> AR9331 PHY

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20 16:08:02 -07:00
Ingo Molnar d0d252b8ca Linux 5.12-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmB8qHweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGEXIIAILUbsTJsNsvZIkZ
 uQ6SY6gnsPFkRiSRjY0YsZLUnqjTuiiHeTz4gzkonddwdnAp/9g6OIHIEBaeTqBh
 sTUMU/61Fgtrt/IvkA1yJ3rlawqgwdMe2VdimB+EFhufcSKq+5vpd3MVP4IuGx4E
 J3psoTU4gVltFs5t+1QjvI3XmByN0Qm8FMRXR7iL+zov1QTmGwR3G6Rn4AymG+QT
 pdruKboyZPfsrFGSVx7wd3HpFyQcrclEX9rKmBNZqets9d9JGWnqnEN4vQKmwO86
 4MV29ucdMXH0AMB3kzGdVp0Ji2Ykt5W0K+MUWbFLtcSxnpu1OyBKGsEAMlRbD7ik
 gm0bMSw=
 =qHI0
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-20 10:13:58 +02:00
Jakub Kicinski d1f0a5e1fb ethtool: stats: clarify the initialization to ETHTOOL_STAT_NOT_SET
Ido suggests we add a comment about the init of stats to -1.
This is unlikely to be clear to first time readers.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 16:23:32 -07:00
Di Zhu c1102e9d49 net: fix a data race when get vlan device
We encountered a crash: in the packet receiving process, we got an
illegal VLAN device address, but the VLAN device address saved in vmcore
is correct. After checking the code, we found a possible data
competition:
CPU 0:                             CPU 1:
    (RCU read lock)                  (RTNL lock)
    vlan_do_receive()		       register_vlan_dev()
      vlan_find_dev()

        ->__vlan_group_get_device()	 ->vlan_group_prealloc_vid()

In vlan_group_prealloc_vid(), We need to make sure that memset()
in kzalloc() is executed before assigning  value to vlan devices array:
=================================
kzalloc()
    ->memset(object, 0, size)

smp_wmb()

vg->vlan_devices_arrays[pidx][vidx] = array;
==================================

Because __vlan_group_get_device() function depends on this order.
otherwise we may get a wrong address from the hardware cache on
another cpu.

So fix it by adding memory barrier instruction to ensure the order
of memory operations.

Signed-off-by: Di Zhu <zhudi21@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 16:08:03 -07:00
Alexander Lobakin 7ad18ff644 gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check
Commit 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

From v1 [0]:
 - inline tiny skb_gro_reset_offset() to let the code be optimized
   more efficively (esp. for the !NET_IP_ALIGN case) (Eric);
 - pull in Reviewed-by from Eric.

[0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me

Fixes: 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 16:03:32 -07:00
David S. Miller 6dd06ec7c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add vlan match and pop actions to the flowtable offload,
   patches from wenxu.

2) Reduce size of the netns_ct structure, which itself is
   embedded in struct net Make netns_ct a read-mostly structure.
   Patches from Florian Westphal.

3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
   collector path, as required by the tc CT action. From Roi Dayan.

4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
   and c-vlan selectors. Fix match of VLAN id due to incorrect
   byteorder. Add a new routine to properly populate flow dissector
   ethertypes.

5) Missing keys in ip{6}_route_me_harder() results in incorrect
   routes. This includes an update for selftest infra. Patches
   from Ido Schimmel.

6) Add counter hardware offload support through FLOW_CLS_STATS.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:49:50 -07:00
Du Cheng ed8157f1eb net: sched: tapr: prevent cycle_time == 0 in parse_taprio_schedule
There is a reproducible sequence from the userland that will trigger a WARN_ON()
condition in taprio_get_start_time, which causes kernel to panic if configured
as "panic_on_warn". Catch this condition in parse_taprio_schedule to
prevent this condition.

Reported as bug on syzkaller:
https://syzkaller.appspot.com/bug?extid=d50710fd0873a9c6b40c

Reported-by: syzbot+d50710fd0873a9c6b40c@syzkaller.appspotmail.com
Signed-off-by: Du Cheng <ducheng2@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:29:07 -07:00
Gustavo A. R. Silva c1d9e34e11 ethtool: ioctl: Fix out-of-bounds warning in store_link_ksettings_for_user()
Fix the following out-of-bounds warning:

net/ethtool/ioctl.c:492:2: warning: 'memcpy' offset [49, 84] from the object at 'link_usettings' is out of the bounds of referenced subobject 'base' with type 'struct ethtool_link_settings' at offset 0 [-Warray-bounds]

The problem is that the original code is trying to copy data into a
some struct members adjacent to each other in a single call to
memcpy(). This causes a legitimate compiler warning because memcpy()
overruns the length of &link_usettings.base. Fix this by directly
using &link_usettings and _from_ as destination and source addresses,
instead.

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:27:37 -07:00
Taehee Yoo 83c1ca257a mld: remove unnecessary prototypes
Some prototypes are unnecessary, so delete it.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:23:00 -07:00
Ido Schimmel 9e46fb656f nexthop: Restart nexthop dump based on last dumped nexthop identifier
Currently, a multi-part nexthop dump is restarted based on the number of
nexthops that have been dumped so far. This can result in a lot of
nexthops not being dumped when nexthops are simultaneously deleted:

 # ip nexthop | wc -l
 65536
 # ip nexthop flush
 Dump was interrupted and may be inconsistent.
 Flushed 36040 nexthops
 # ip nexthop | wc -l
 29496

Instead, restart the dump based on the nexthop identifier (fixed number)
of the last successfully dumped nexthop:

 # ip nexthop | wc -l
 65536
 # ip nexthop flush
 Dump was interrupted and may be inconsistent.
 Flushed 65536 nexthops
 # ip nexthop | wc -l
 0

Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Tested-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 15:20:34 -07:00
Stefano Garzarella e16edc99d6 vsock/vmci: log once the failed queue pair allocation
VMCI feature is not supported in conjunction with the vSphere Fault
Tolerance (FT) feature.

VMware Tools can repeatedly try to create a vsock connection. If FT is
enabled the kernel logs is flooded with the following messages:

    qp_alloc_hypercall result = -20
    Could not attach to queue pair with -20

"qp_alloc_hypercall result = -20" was hidden by commit e8266c4c33
("VMCI: Stop log spew when qp allocation isn't possible"), but "Could
not attach to queue pair with -20" is still there flooding the log.

Since the error message can be useful in some cases, print it only once.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 14:30:23 -07:00
Johannes Berg 010bfbe768 cfg80211: scan: drop entry from hidden_list on overflow
If we overflow the maximum number of BSS entries and free the
new entry, drop it from any hidden_list that it may have been
added to in the code above or in cfg80211_combine_bsses().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20210416094212.5de7d1676ad7.Ied283b0bc5f504845e7d6ab90626bdfa68bb3dc0@changeid
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 13:25:50 +02:00
Johannes Berg 2f5164447c wireless: fix spelling of A-MSDU in HE capabilities
In the HE capabilities, spell A-MSDU correctly, not "A-MDSU".

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.9e6ff1af1181.If6868bc6902ccd9a95c74c78f716c4b41473ef14@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:50:15 +02:00
Johannes Berg 1f851b8dfd wireless: align HE capabilities A-MPDU Length Exponent Extension
The A-MPDU length exponent extension is defined differently in
802.11ax D6.1, align with that.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.c2a257d3e2df.I3455245d388c52c61dace7e7958dbed7e807cfb6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:50:15 +02:00
Johannes Berg 76cf422133 wireless: align some HE capabilities with the spec
Some names were changed, align that with the spec as of
802.11ax-D6.1.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.b1e5fbab0d8c.I3eb6076cb0714ec6aec6b8f9dee613ce4a05d825@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:50:15 +02:00
Sabrina Dubroca 747b67088f xfrm: ipcomp: remove unnecessary get_cpu()
While testing ipcomp on a realtime kernel, Xiumei reported a "sleeping
in atomic" bug, caused by a memory allocation while preemption is
disabled (ipcomp_decompress -> alloc_page -> ... get_page_from_freelist).

As Sebastian noted [1], this get_cpu() isn't actually needed, since
ipcomp_decompress() is called in napi context anyway, so BH is already
disabled.

This patch replaces get_cpu + per_cpu_ptr with this_cpu_ptr, then
simplifies the error returns, since there isn't any common operation
left.

[1] https://lore.kernel.org/lkml/20190820082810.ixkmi56fp7u7eyn2@linutronix.de/

Cc: Juri Lelli <jlelli@redhat.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:49:29 +02:00
Florian Westphal 6218fe1861 xfrm: avoid synchronize_rcu during netns destruction
Use the new exit_pre hook to NULL the netlink socket.
The net namespace core will do a synchronize_rcu() between the exit_pre
and exit/exit_batch handlers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:25:11 +02:00
Florian Westphal 7baf867fef xfrm: remove stray synchronize_rcu from xfrm_init
This function is called during boot, from ipv4 stack, there is no need
to set the pointer to NULL (static storage duration, so already NULL).

No need for the synchronize_rcu either.  Remove both.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:25:11 +02:00
Florian Westphal b07dd26f07 flow: remove spi key from flowi struct
xfrm session decode ipv4 path (but not ipv6) sets this, but there are no
consumers.  Remove it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-04-19 12:25:11 +02:00
Naftali Goldstein 7dd231eb9c mac80211: drop the connection if firmware crashed while in CSA
Don't bother keeping the link in that case. It is way
too complicated to keep the connection.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Naftali Goldstein <naftali.goldstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.a126c8833398.I677bdac314dd50d90474a90593902c17f9410cc4@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:06:44 +02:00
Emmanuel Grumbach 253907ab8b mac80211: properly drop the connection in case of invalid CSA IE
In case the frequency is invalid, ieee80211_parse_ch_switch_ie
will fail and we may not even reach the check in
ieee80211_sta_process_chanswitch. Drop the connection
in case ieee80211_parse_ch_switch_ie failed, but still
take into account the CSA mode to remember not to send
a deauth frame in case if it is forbidden to.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.34712ef96a0a.I75d7ad7f1d654e8b0aa01cd7189ff00a510512b3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:04:26 +02:00
Emmanuel Grumbach f30386a85f mac80211: make ieee80211_vif_to_wdev work when the vif isn't in the driver
This will allow the low level driver to get the wdev during
the add_interface flow.
In order to do that, remove a few checks from there and do
not return NULL for vifs that were not yet added to the
driver. Note that all the current callers of this helper
function assume that the vif already exists:
 - The callers from the drivers already have a vif pointer.
 Before this change, ieee80211_vif_to_wdev would return NULL
 in some cases, but those callers don't even check they
 get a non-NULL pointer from ieee80211_vif_to_wdev.
 - The callers from net/mac80211/cfg.c assume the vif is
 already added to the driver as well.

So, this change has no impact on existing callers of this
helper function.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.6078d3517095.I1907a45f267a62dab052bcc44428aa7a2005ffc9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:03:13 +02:00
Avraham Stern 73807523f9 nl80211/cfg80211: add a flag to negotiate for LMR feedback in NDP ranging
Add a flag that indicates that the ISTA shall indicate support for
LMR feedback in NDP ranging negotiation.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.eff546283504.I2606161e700ac24d94d0b50c8edcdedd4c0395c2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:02:51 +02:00
Johannes Berg 8de8570489 mac80211: aes_cmac: check crypto_shash_setkey() return value
As crypto_shash_setkey() can fail, we should check the return value.

Addresses-Coverity-ID: 1401813 ("Unchecked return value")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.533ff7acf1d2.I034bafa201c4a6823333f8410aeaa60cca5ee9e0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:01:40 +02:00
Colin Ian King bab7f5ca81 mac80211: minstrel_ht: remove extraneous indentation on if statement
The increment of idx is indented one level too deeply, clean up the
code by removing the extraneous tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210416095137.2033469-1-colin.king@canonical.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:00:48 +02:00
Johannes Berg ca47b46294 mac80211: properly process TXQ management frames
My previous commit to not apply flow control to management frames
that are going over TXQs (which is currently only the case for
iwlwifi, I think) broke things, with iwlwifi firmware crashing on
certain frames. As it turns out, that was due to the frame being
too short: space for the MIC wasn't added at the end of encrypted
management frames.

Clearly, this is due to using the 'frags' queue - this is meant
only for frames that have already been processed for TX, and the
code in ieee80211_tx_dequeue() just returns them. This caused all
management frames to now not get any TX processing.

To fix this, use IEEE80211_TX_INTCFL_NEED_TXPROCESSING (which is
currently used only in other circumstances) to indicate that the
frames need processing, and clear it immediately after so that,
at least in theory, MMPDUs can be fragmented.

Fixes: 73bc9e0af5 ("mac80211: don't apply flow control on management frames")
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20210416134702.ef8486a64293.If0a9025b39c71bb91b11dd6ac45547aba682df34@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 12:00:24 +02:00
Joe Perches 623b988f2d cfg80211: constify ieee80211_get_response_rate return
It's not modified so make it const with the eventual goal of moving
data to text for various static struct ieee80211_rate arrays.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/8b210b5f5972e39eded269b35a1297cf824c4181.camel@perches.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-04-19 11:59:33 +02:00
Pablo Neira Ayuso b72920f6e4 netfilter: nftables: counter hardware offload support
This patch adds the .offload_stats operation to synchronize hardware
stats with the expression data. Update the counter expression to use
this new interface. The hardware stats are retrieved from the netlink
dump path via FLOW_CLS_STATS command to the driver.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:04:49 +02:00
Ido Schimmel 812fa71f0d netfilter: Dissect flow after packet mangling
Netfilter tries to reroute mangled packets as a different route might
need to be used following the mangling. When this happens, netfilter
does not populate the IP protocol, the source port and the destination
port in the flow key. Therefore, FIB rules that match on these fields
are ignored and packets can be misrouted.

Solve this by dissecting the outer flow and populating the flow key
before rerouting the packet. Note that flow dissection only happens when
FIB rules that match on these fields are installed, so in the common
case there should not be a penalty.

Reported-by: Michal Soltys <msoltyspl@yandex.pl>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:04:16 +02:00
Pablo Neira Ayuso 783003f3bb netfilter: nftables_offload: special ethertype handling for VLAN
The nftables offload parser sets FLOW_DISSECTOR_KEY_BASIC .n_proto to the
ethertype field in the ethertype frame. However:

- FLOW_DISSECTOR_KEY_BASIC .n_proto field always stores either IPv4 or IPv6
  ethertypes.
- FLOW_DISSECTOR_KEY_VLAN .vlan_tpid stores either the 802.1q and 802.1ad
  ethertypes. Same as for FLOW_DISSECTOR_KEY_CVLAN.

This function adjusts the flow dissector to handle two scenarios:

1) FLOW_DISSECTOR_KEY_VLAN .vlan_tpid is set to 802.1q or 802.1ad.
   Then, transfer:
   - the .n_proto field to FLOW_DISSECTOR_KEY_VLAN .tpid.
   - the original FLOW_DISSECTOR_KEY_VLAN .tpid to the
     FLOW_DISSECTOR_KEY_CVLAN .tpid
   - the original FLOW_DISSECTOR_KEY_CVLAN .tpid to the .n_proto field.

2) .n_proto is set to 802.1q or 802.1ad. Then, transfer:
   - the .n_proto field to FLOW_DISSECTOR_KEY_VLAN .tpid.
   - the original FLOW_DISSECTOR_KEY_VLAN .tpid to the .n_proto field.

Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:02:21 +02:00
Pablo Neira Ayuso ff4d90a89d netfilter: nftables_offload: VLAN id needs host byteorder in flow dissector
The flow dissector representation expects the VLAN id in host byteorder.
Add the NFT_OFFLOAD_F_NETWORK2HOST flag to swap the bytes from nft_cmp.

Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:02:21 +02:00
Pablo Neira Ayuso 14c20643ef netfilter: nft_payload: fix C-VLAN offload support
- add another struct flow_dissector_key_vlan for C-VLAN
- update layer 3 dependency to allow to match on IPv4/IPv6

Fixes: 89d8fd44ab ("netfilter: nft_payload: add C-VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-18 22:02:21 +02:00
Jakub Kicinski 8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Geliang Tang 442279154c mptcp: use mptcp_for_each_subflow in mptcp_close
This patch used the macro helper mptcp_for_each_subflow() instead of
list_for_each_entry() in mptcp_close.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang d96a838a7c mptcp: add tracepoint in subflow_check_data_avail
This patch added a tracepoint in subflow_check_data_avail() to show the
mapping status.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang ed66bfb4ce mptcp: add tracepoint in ack_update_msk
This patch added a tracepoint in ack_update_msk() to track the
incoming data_ack and window/snd_una updates.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang 0918e34b85 mptcp: add tracepoint in get_mapping_status
This patch added a tracepoint in the mapping status function
get_mapping_status() to dump every mpext field.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang e10a989209 mptcp: add tracepoint in mptcp_subflow_get_send
This patch added a tracepoint in the packet scheduler function
mptcp_subflow_get_send().

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang 43f1140b96 mptcp: export mptcp_subflow_active
This patch moved the static function mptcp_subflow_active to protocol.h
as an inline one.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Geliang Tang e4b6135134 mptcp: fix format specifiers for unsigned int
Some of the sequence numbers are printed as the negative ones in the debug
log:

[   46.250932] MPTCP: DSS
[   46.250940] MPTCP: data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1
[   46.250948] MPTCP: data_ack=2344892449471675613
[   46.251012] MPTCP: msk=000000006e157e3f status=10
[   46.251023] MPTCP: msk=000000006e157e3f snd_data_fin_enable=0 pending=0 snd_nxt=2344892449471700189 write_seq=2344892449471700189
[   46.251343] MPTCP: msk=00000000ec44a129 ssk=00000000f7abd481 sending dfrag at seq=-1658937016627538668 len=100 already sent=0
[   46.251360] MPTCP: data_seq=16787807057082012948 subflow_seq=1 data_len=100 dsn64=1

This patch used the format specifier %u instead of %d for the unsigned int
values to fix it.

Fixes: d9ca1de8c0 ("mptcp: move page frag allocation in mptcp_sendmsg()")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Nico Pache 3fcc8a25e3 kunit: mptcp: adhere to KUNIT formatting standard
Drop 'S' from end of CONFIG_MPTCP_KUNIT_TESTS in order to adhere to the
KUNIT *_KUNIT_TEST config name format.

Fixes: a00a582203 (mptcp: move crypto test to KUNIT)
Reviewed-by: David Gow <davidgow@google.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:10:40 -07:00
Gustavo A. R. Silva 1e3d976dbb flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target()
Fix the following out-of-bounds warning:

net/core/flow_dissector.c:835:3: warning: 'memcpy' offset [33, 48] from the object at 'flow_keys' is out of the bounds of referenced subobject 'ipv6_src' with type '__u32[4]' {aka 'unsigned int[4]'} at offset 16 [-Warray-bounds]

The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy().  So, the compiler legitimately complains about it. As these
are just a couple of members, fix this by copying each one of them in
separate calls to memcpy().

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:02:27 -07:00
Florian Westphal f2764bd4f6 netlink: don't call ->netlink_bind with table lock held
When I added support to allow generic netlink multicast groups to be
restricted to subscribers with CAP_NET_ADMIN I was unaware that a
genl_bind implementation already existed in the past.

It was reverted due to ABBA deadlock:

1. ->netlink_bind gets called with the table lock held.
2. genetlink bind callback is invoked, it grabs the genl lock.

But when a new genl subsystem is (un)registered, these two locks are
taken in reverse order.

One solution would be to revert again and add a comment in genl
referring 1e82a62fec, "genetlink: remove genl_bind").

This would need a second change in mptcp to not expose the raw token
value anymore, e.g.  by hashing the token with a secret key so userspace
can still associate subflow events with the correct mptcp connection.

However, Paolo Abeni reminded me to double-check why the netlink table is
locked in the first place.

I can't find one.  netlink_bind() is already called without this lock
when userspace joins a group via NETLINK_ADD_MEMBERSHIP setsockopt.
Same holds for the netlink_unbind operation.

Digging through the history, commit f773608026
("netlink: access nlk groups safely in netlink bind and getname")
expanded the lock scope.

commit 3a20773bee ("net: netlink: cap max groups which will be considered in netlink_bind()")
... removed the nlk->ngroups access that the lock scope
extension was all about.

Reduce the lock scope again and always call ->netlink_bind without
the table lock.

The Fixes tag should be vs. the patch mentioned in the link below,
but that one got squash-merged into the patch that came earlier in the
series.

Fixes: 4d54cc3211 ("mptcp: avoid lock_fast usage in accept path")
Link: https://lore.kernel.org/mptcp/20210213000001.379332-8-mathew.j.martineau@linux.intel.com/T/#u
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 17:01:04 -07:00
Jakub Kicinski a8b06e9d40 ethtool: add interface to read RMON stats
Most devices maintain RMON (RFC 2819) stats - particularly
the "histogram" of packets received by size. Unlike other
RFCs which duplicate IEEE stats, the short/oversized frame
counters in RMON don't seem to match IEEE stats 1-to-1 either,
so expose those, too. Do not expose basic packet, CRC errors
etc - those are already otherwise covered.

Because standard defines packet ranges only up to 1518, and
everything above that should theoretically be "oversized"
- devices often create their own ranges.

Going beyond what the RFC defines - expose the "histogram"
in the Tx direction (assume for now that the ranges will
be the same).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:20 -07:00
Jakub Kicinski bfad2b979d ethtool: add interface to read standard MAC Ctrl stats
Number of devices maintains the standard-based MAC control
counters for control frames. Add a API for those.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:20 -07:00
Jakub Kicinski ca2244547e ethtool: add interface to read standard MAC stats
Most of the MAC statistics are included in
struct rtnl_link_stats64, but some fields
are aggregated. Besides it's good to expose
these clearly hardware stats separately.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:20 -07:00
Jakub Kicinski f09ea6fb12 ethtool: add a new command for reading standard stats
Add an interface for reading standard stats, including
stats which don't have a corresponding control interface.

Start with IEEE 802.3 PHY stats. There seems to be only
one stat to expose there.

Define API to not require user space changes when new
stats or groups are added. Groups are based on bitset,
stats have a string set associated.

v1: wrap stats in a nest

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:59:20 -07:00
Gustavo A. R. Silva e5272ad4aa sctp: Fix out-of-bounds warning in sctp_process_asconf_param()
Fix the following out-of-bounds warning:

net/sctp/sm_make_chunk.c:3150:4: warning: 'memcpy' offset [17, 28] from the object at 'addr' is out of the bounds of referenced subobject 'v4' with type 'struct sockaddr_in' at offset 0 [-Warray-bounds]

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 16:55:15 -07:00
Taehee Yoo aa8caa767e mld: fix suspicious RCU usage in __ipv6_dev_mc_dec()
__ipv6_dev_mc_dec() internally uses sleepable functions so that caller
must not acquire atomic locks. But caller, which is addrconf_verify_rtnl()
acquires rcu_read_lock_bh().
So this warning occurs in the __ipv6_dev_mc_dec().

Test commands:
    ip netns add A
    ip link add veth0 type veth peer name veth1
    ip link set veth1 netns A
    ip link set veth0 up
    ip netns exec A ip link set veth1 up
    ip a a 2001:db8::1/64 dev veth0 valid_lft 2 preferred_lft 1

Splat looks like:
============================
WARNING: suspicious RCU usage
5.12.0-rc6+ #515 Not tainted
-----------------------------
kernel/sched/core.c:8294 Illegal context switch in RCU-bh read-side
critical section!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by kworker/4:0/1997:
 #0: ffff88810bd72d48 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at:
process_one_work+0x761/0x1440
 #1: ffff888105c8fe00 ((addr_chk_work).work){+.+.}-{0:0}, at:
process_one_work+0x795/0x1440
 #2: ffffffffb9279fb0 (rtnl_mutex){+.+.}-{3:3}, at:
addrconf_verify_work+0xa/0x20
 #3: ffffffffb8e30860 (rcu_read_lock_bh){....}-{1:2}, at:
addrconf_verify_rtnl+0x23/0xc60

stack backtrace:
CPU: 4 PID: 1997 Comm: kworker/4:0 Not tainted 5.12.0-rc6+ #515
Workqueue: ipv6_addrconf addrconf_verify_work
Call Trace:
 dump_stack+0xa4/0xe5
 ___might_sleep+0x27d/0x2b0
 __mutex_lock+0xc8/0x13f0
 ? lock_downgrade+0x690/0x690
 ? __ipv6_dev_mc_dec+0x49/0x2a0
 ? mark_held_locks+0xb7/0x120
 ? mutex_lock_io_nested+0x1270/0x1270
 ? lockdep_hardirqs_on_prepare+0x12c/0x3e0
 ? _raw_spin_unlock_irqrestore+0x47/0x50
 ? trace_hardirqs_on+0x41/0x120
 ? __wake_up_common_lock+0xc9/0x100
 ? __wake_up_common+0x620/0x620
 ? memset+0x1f/0x40
 ? netlink_broadcast_filtered+0x2c4/0xa70
 ? __ipv6_dev_mc_dec+0x49/0x2a0
 __ipv6_dev_mc_dec+0x49/0x2a0
 ? netlink_broadcast_filtered+0x2f6/0xa70
 addrconf_leave_solict.part.64+0xad/0xf0
 ? addrconf_join_solict.part.63+0xf0/0xf0
 ? nlmsg_notify+0x63/0x1b0
 __ipv6_ifa_notify+0x22c/0x9c0
 ? inet6_fill_ifaddr+0xbe0/0xbe0
 ? lockdep_hardirqs_on_prepare+0x12c/0x3e0
 ? __local_bh_enable_ip+0xa5/0xf0
 ? ipv6_del_addr+0x347/0x870
 ipv6_del_addr+0x3b1/0x870
 ? addrconf_ifdown+0xfe0/0xfe0
 ? rcu_read_lock_any_held.part.27+0x20/0x20
 addrconf_verify_rtnl+0x8a9/0xc60
 addrconf_verify_work+0xf/0x20
 process_one_work+0x84c/0x1440

In order to avoid this problem, it uses rcu_read_unlock_bh() for
a short time. RCU is used for avoiding freeing
ifp(struct *inet6_ifaddr) while ifp is being used. But this will
not be released even if rcu_read_unlock_bh() is used.
Because before rcu_read_unlock_bh(), it uses in6_ifa_hold(ifp).
So this is safe.

Fixes: 63ed8de4be ("mld: add mc_lock for protecting per-interface mld data")
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:40:59 -07:00
Florian Westphal aa1fbd94e5 mptcp: sockopt: add TCP_CONGESTION and TCP_INFO
TCP_CONGESTION is set for all subflows.
The mptcp socket gains icsk_ca_ops too so it can be used to keep the
authoritative state that should be set on new/future subflows.

TCP_INFO will return first subflow only.
The out-of-tree kernel has a MPTCP_INFO getsockopt, this could be added
later on.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal a03c99b253 mptcp: setsockopt: SO_DEBUG and no-op options
Handle SO_DEBUG and set it on all subflows.
Ignore those values not implemented on TCP sockets.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 6f0d719808 mptcp: setsockopt: add SO_INCOMING_CPU
Replicate to all subflows.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 36704413db mptcp: setsockopt: add SO_MARK support
Value is synced to all subflows.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 268b123874 mptcp: setsockopt: support SO_LINGER
Similar to PRIORITY/KEEPALIVE: needs to be mirrored to all subflows.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 5d0a6bc82d mptcp: setsockopt: handle receive/send buffer and device bind
Similar to previous patch: needs to be mirrored to all subflows.

Device bind is simpler: it is only done on the initial (listener) sk.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 1b3e7ede13 mptcp: setsockopt: handle SO_KEEPALIVE and SO_PRIORITY
start with something simple: both take an integer value, both
need to be mirrored to all subflows.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal df00b087da mptcp: tag sequence_seq with socket state
Paolo Abeni suggested to avoid re-syncing new subflows because
they inherit options from listener. In case options were set on
listener but are not set on mptcp-socket there is no need to
do any synchronisation for new subflows.

This change sets sockopt_seq of new mptcp sockets to the seq of
the mptcp listener sock.

Subflow sequence is set to the embedded tcp listener sk.
Add a comment explaing why sk_state is involved in sockopt_seq
generation.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Florian Westphal 7896248983 mptcp: add skeleton to sync msk socket options to subflows
Handle following cases:
1. setsockopt is called with multiple subflows.
   Change might have to be mirrored to all of them.
   This is done directly in process context/setsockopt call.
2. Outgoing subflow is created after one or several setsockopt()
   calls have been made.  Old setsockopt changes should be
   synced to the new socket.
3. Incoming subflow, after setsockopt call(s).

Cases 2 and 3 are handled right after the join list is spliced to the conn
list.

Not all sockopt values can be just be copied by value, some require
helper calls.  Those can acquire socket lock (which can sleep).

If the join->conn list splicing is done from preemptible context,
synchronization can be done right away, otherwise its deferred to work
queue.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:10 -07:00
Paolo Abeni d9e4c12918 mptcp: only admit explicitly supported sockopt
Unrolling mcast state at msk dismantel time is bug prone, as
syzkaller reported:

======================================================
WARNING: possible circular locking dependency detected
5.11.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor905/8822 is trying to acquire lock:
ffffffff8d678fe8 (rtnl_mutex){+.+.}-{3:3}, at: ipv6_sock_mc_close+0xd7/0x110 net/ipv6/mcast.c:323

but task is already holding lock:
ffff888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1600 [inline]
ffff888024390120 (sk_lock-AF_INET6){+.+.}-{0:0}, at: mptcp6_release+0x57/0x130 net/mptcp/protocol.c:3507

which lock already depends on the new lock.

Instead we can simply forbid any mcast-related setsockopt.
Let's do the same with all other non supported sockopts.

Fixes: 717e79c867 ("mptcp: Add setsockopt()/getsockopt() socket operations")
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:09 -07:00
Paolo Abeni 0abdde82b1 mptcp: move sockopt function into a new file
The MPTCP sockopt implementation is going to be much
more big and complex soon. Let's move it to a different
source file.

No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:09 -07:00
Matthieu Baerts bd005f5386 mptcp: revert "mptcp: forbit mcast-related sockopt on MPTCP sockets"
This change reverts commit 86581852d7 ("mptcp: forbit mcast-related sockopt on MPTCP sockets").

As announced in the cover letter of the mentioned patch above, the
following commits introduce a larger MPTCP sockopt implementation
refactor.

This time, we switch from a blocklist to an allowlist. This is safer for
the future where new sockoptions could be added while not being fully
supported with MPTCP sockets and thus causing unstabilities.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:23:09 -07:00
Vladimir Oltean 2c4eca3ef7 net: bridge: switchdev: include local flag in FDB notifications
As explained in bugfix commit 6ab4c3117a ("net: bridge: don't notify
switchdev for local FDB addresses") as well as in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/

the switchdev notifiers for FDB entries managed to have a zero-day bug,
which was that drivers would not know what to do with local FDB entries,
because they were not told that they are local. The bug fix was to
simply not notify them of those addresses.

Let us now add the 'is_local' bit to bridge FDB entries, and make all
drivers ignore these entries by their own choice.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Tobias Waldekranz e5b4b8988b net: bridge: switchdev: refactor br_switchdev_fdb_notify
Instead of having to add more and more arguments to
br_switchdev_fdb_call_notifiers, get rid of it and build the info
struct directly in br_switchdev_fdb_notify.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 15:15:45 -07:00
Eric Dumazet e7ad33fa7b scm: fix a typo in put_cmsg()
We need to store cmlen instead of len in cm->cmsg_len.

Fixes: 38ebcf5096 ("scm: optimize put_cmsg()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16 11:41:07 -07:00
Jakub Kicinski be85dbfeb3 ethtool: add FEC statistics
Similarly to pause statistics add stats for FEC.

The IEEE standard mandates two sets of counters:
 - 30.5.1.1.17 aFECCorrectedBlocks
 - 30.5.1.1.18 aFECUncorrectableBlocks
where block is a block of bits FEC operates on.
Each of these counters is defined per lane (PCS instance).

Multiple vendors provide number of corrected _bits_ rather
than/as well as blocks.

This set adds the 2 standard-based block counters and a extra
one for corrected bits.

Counters are exposed to user space via netlink in new attributes.
Each attribute carries an array of u64s, first element is
the total count, and the following ones are a per-lane break down.

Much like with pause stats the operation will not fail when driver
does not implement the get_fec_stats callback (nor can the driver
fail the operation by returning an error). If stats can't be
reported the relevant attributes will be empty.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-15 17:08:29 -07:00
Jakub Kicinski 3d7cc109ec ethtool: fec_prepare_data() - jump to error handling
Refactor fec_prepare_data() a little bit to skip the body
of the function and exit on error. Currently the code
depends on the fact that we only have one call which
may fail between ethnl_ops_begin() and ethnl_ops_complete()
and simply saves the error code. This will get hairy with
the stats also being queried.

No functional changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-15 17:08:29 -07:00
Jakub Kicinski c5797f8a64 ethtool: move ethtool_stats_init
We'll need it for FEC stats as well.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-15 17:08:29 -07:00
Eric Dumazet 38ebcf5096 scm: optimize put_cmsg()
Calling two copy_to_user() for very small regions has very high overhead.

Switch to inlined unsafe_put_user() to save one stac/clac sequence,
and avoid copy_to_user().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-15 17:03:33 -07:00
Eric Dumazet 94f633ea8a net/packet: remove data races in fanout operations
af_packet fanout uses RCU rules to ensure f->arr elements
are not dismantled before RCU grace period.

However, it lacks rcu accessors to make sure KCSAN and other tools
wont detect data races. Stupid compilers could also play games.

Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: "Gong, Sishuai" <sishuai@purdue.edu>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 14:34:38 -07:00
Florian Fainelli ae1ea84b33 net: bridge: propagate error code and extack from br_mc_disabled_update
Some Ethernet switches might only be able to support disabling multicast
snooping globally, which is an issue for example when several bridges
span the same physical device and request contradictory settings.

Propagate the return value of br_mc_disabled_update() such that this
limitation is transmitted correctly to user-space.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 14:32:05 -07:00
Paolo Abeni 17c3df7078 skbuff: revert "skbuff: remove some unnecessary operation in skb_segment_list()"
the commit 1ddc3229ad ("skbuff: remove some unnecessary operation
in skb_segment_list()") introduces an issue very similar to the
one already fixed by commit 53475c5dd8 ("net: fix use-after-free when
UDP GRO with shared fraglist").

If the GSO skb goes though skb_clone() and pskb_expand_head() before
entering skb_segment_list(), the latter  will unshare the frag_list
skbs and will release the old list. With the reverted commit in place,
when skb_segment_list() completes, skb->next points to the just
released list, and later on the kernel will hit UaF.

Note that since commit e0e3070a9b ("udp: properly complete L4 GRO
over UDP tunnel packet") the critical scenario can be reproduced also
receiving UDP over vxlan traffic with:

NIC (NETIF_F_GRO_FRAGLIST enabled) -> vxlan -> UDP sink

Attaching a packet socket to the NIC will cause skb_clone() and the
tunnel decapsulation will call pskb_expand_head().

Fixes: 1ddc3229ad ("skbuff: remove some unnecessary operation in skb_segment_list()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 13:54:08 -07:00
David S. Miller 8c1186be3f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-04-14

Not much this time:

1) Simplification of some variable calculations in esp4 and esp6.
   From Jiapeng Chong and Junlin Yang.

2) Fix a clang Wformat warning in esp6 and ah6.
   From Arnd Bergmann.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 13:15:12 -07:00
Jakub Kicinski 16756d3e77 ethtool: pause: make sure we init driver stats
The intention was for pause statistics to not be reported
when driver does not have the relevant callback (only
report an empty netlink nest). What happens currently
we report all 0s instead. Make sure statistics are
initialized to "not set" (which is -1) so the dumping
code skips them.

Fixes: 9a27a33027 ("ethtool: add standard pause stats")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-14 13:03:06 -07:00
Chuck Lever 8727f78855 svcrdma: Pass a useful error code to the send_err tracepoint
Capture error codes in @ret, which is passed to the send_err
tracepoint, so that they can be logged when something goes awry.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-14 12:09:27 -04:00
Chuck Lever c7731d5e05 svcrdma: Rename goto labels in svc_rdma_sendto()
Clean up: Make the goto labels consistent with other similar
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-14 12:09:27 -04:00
Chuck Lever 351461f332 svcrdma: Don't leak send_ctxt on Send errors
Address a rare send_ctxt leak in the svc_rdma_sendto() error paths.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-14 12:09:27 -04:00
Chris Dion 09252177d5 SUNRPC: Handle major timeout in xprt_adjust_timeout()
Currently if a major timeout value is reached, but the minor value has
not been reached, an ETIMEOUT will not be sent back to the caller.
This can occur if the v4 server is not responding to requests and
retrans is configured larger than the default of two.

For example, A TCP mount with a configured timeout value of 50 and a
retransmission count of 3 to a v4 server which is not responding:

1. Initial value and increment set to 5s, maxval set to 20s, retries at 3
2. Major timeout is set to 20s, minor timeout set to 5s initially
3. xport_adjust_timeout() is called after 5s, retry with 10s timeout,
   minor timeout is bumped to 10s
4. And again after another 10s, 15s total time with minor timeout set
   to 15s
5. After 20s total time xport_adjust_timeout is called as major timeout is
   reached, but skipped because the minor timeout is not reached
       - After this time the cpu spins continually calling
       	 xport_adjust_timeout() and returning 0 for 10 seconds.
	 As seen on perf sched:
   	 39243.913182 [0005]  mount.nfs[3794] 4607.938      0.017   9746.863
6. This continues until the 15s minor timeout condition is reached (in
   this case for 10 seconds). After which the ETIMEOUT is processed
   back to the caller, the cpu spinning stops, and normal operations
   continue

Fixes: 7de62bc09f ("SUNRPC dont update timeout value on connection reset")
Signed-off-by: Chris Dion <Christopher.Dion@dell.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14 09:36:30 -04:00
Chuck Lever 6cf23783f7 SUNRPC: Remove trace_xprt_transmit_queued
This tracepoint can crash when dereferencing snd_task because
when some transports connect, they put a cookie in that field
instead of a pointer to an rpc_task.

BUG: KASAN: use-after-free in trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
Read of size 2 at addr ffff8881a83bd3a0 by task git/331872

CPU: 11 PID: 331872 Comm: git Tainted: G S                5.12.0-rc2-00007-g3ab6e585a7f9 #1453
Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Call Trace:
 dump_stack+0x9c/0xcf
 print_address_description.constprop.0+0x18/0x239
 kasan_report+0x174/0x1b0
 trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
 xprt_prepare_transmit+0x8e/0xc1 [sunrpc]
 call_transmit+0x4d/0xc6 [sunrpc]

Fixes: 9ce07ae5eb ("SUNRPC: Replace dprintk() call site in xprt_prepare_transmit")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14 09:36:30 -04:00
Chuck Lever e936a5970e SUNRPC: Add tracepoint that fires when an RPC is retransmitted
A separate tracepoint can be left enabled all the time to capture
rare but important retransmission events. So for example:

kworker/u26:3-568   [009]   156.967933: xprt_retransmit:      task:44093@5 xid=0xa25dbc79 nfsv3 WRITE ntrans=2

Or, for example, enable all nfs and nfs4 tracepoints, and set up a
trigger to disable tracing when xprt_retransmit fires to capture
everything that leads up to it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14 09:36:29 -04:00
Chuck Lever 7638e0bfae SUNRPC: Move fault injection call sites
I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.

The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.

Fixes: 4a06825839 ("SUNRPC: Transport fault injection")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-14 09:36:29 -04:00
Eric Dumazet 38ec4944b5 gro: ensure frag0 meets IP header alignment
After commit 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Guenter Roeck reported one failure in his tests using sh architecture.

After much debugging, we have been able to spot silent unaligned accesses
in inet_gro_receive()

The issue at hand is that upper networking stacks assume their header
is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
bytes before the Ethernet header to make that happen.

This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
if the fragment is not properly aligned.

Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
as 0, this extra check will be a NOP for them.

Note that if frag0 is not used, GRO will call pskb_may_pull()
as many times as needed to pull network and transport headers.

Fixes: 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Fixes: 78a478d0ef ("gro: Inline skb_gro_header and cache frag0 virtual address")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 15:09:31 -07:00
Or Cohen b166a20b07 net/sctp: fix race condition in sctp_destroy_sock
If sctp_destroy_sock is called without sock_net(sk)->sctp.addr_wq_lock
held and sp->do_auto_asconf is true, then an element is removed
from the auto_asconf_splist without any proper locking.

This can happen in the following functions:
1. In sctp_accept, if sctp_sock_migrate fails.
2. In inet_create or inet6_create, if there is a bpf program
   attached to BPF_CGROUP_INET_SOCK_CREATE which denies
   creation of the sctp socket.

The bug is fixed by acquiring addr_wq_lock in sctp_destroy_sock
instead of sctp_close.

This addresses CVE-2021-23133.

Reported-by: Or Cohen <orcohen@paloaltonetworks.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Fixes: 6102365876 ("bpf: Add new cgroup attach type to enable sock modifications")
Signed-off-by: Or Cohen <orcohen@paloaltonetworks.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:59:46 -07:00
Jonathon Reinhart 97684f0970 net: Make tcp_allowed_congestion_control readonly in non-init netns
Currently, tcp_allowed_congestion_control is global and writable;
writing to it in any net namespace will leak into all other net
namespaces.

tcp_available_congestion_control and tcp_allowed_congestion_control are
the only sysctls in ipv4_net_table (the per-netns sysctl table) with a
NULL data pointer; their handlers (proc_tcp_available_congestion_control
and proc_allowed_congestion_control) have no other way of referencing a
struct net. Thus, they operate globally.

Because ipv4_net_table does not use designated initializers, there is no
easy way to fix up this one "bad" table entry. However, the data pointer
updating logic shouldn't be applied to NULL pointers anyway, so we
instead force these entries to be read-only.

These sysctls used to exist in ipv4_table (init-net only), but they were
moved to the per-net ipv4_net_table, presumably without realizing that
tcp_allowed_congestion_control was writable and thus introduced a leak.

Because the intent of that commit was only to know (i.e. read) "which
congestion algorithms are available or allowed", this read-only solution
should be sufficient.

The logic added in recent commit
31c4d2f160eb: ("net: Ensure net namespace isolation of sysctls")
does not and cannot check for NULL data pointers, because
other table entries (e.g. /proc/sys/net/netfilter/nf_log/) have
.data=NULL but use other methods (.extra2) to access the struct net.

Fixes: 9cb8e048e5 ("net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:42:51 -07:00
Andreas Roeseler 314332023b icmp: ICMPV6: pass RFC 8335 reply messages to ping_rcv
The current icmp_rcv function drops all unknown ICMP types, including
ICMP_EXT_ECHOREPLY (type 43). In order to parse Extended Echo Reply messages, we have
to pass these packets to the ping_rcv function, which does not do any
other filtering and passes the packet to the designated socket.

Pass incoming RFC 8335 ICMP Extended Echo Reply packets to the ping_rcv
handler instead of discarding the packet.

Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:38:01 -07:00
Michael Walle 83216e3988 of: net: pass the dst buffer to of_get_mac_address()
of_get_mac_address() returns a "const void*" pointer to a MAC address.
Lately, support to fetch the MAC address by an NVMEM provider was added.
But this will only work with platform devices. It will not work with
PCI devices (e.g. of an integrated root complex) and esp. not with DSA
ports.

There is an of_* variant of the nvmem binding which works without
devices. The returned data of a nvmem_cell_read() has to be freed after
use. On the other hand the return of_get_mac_address() points to some
static data without a lifetime. The trick for now, was to allocate a
device resource managed buffer which is then returned. This will only
work if we have an actual device.

Change it, so that the caller of of_get_mac_address() has to supply a
buffer where the MAC address is written to. Unfortunately, this will
touch all drivers which use the of_get_mac_address().

Usually the code looks like:

  const char *addr;
  addr = of_get_mac_address(np);
  if (!IS_ERR(addr))
    ether_addr_copy(ndev->dev_addr, addr);

This can then be simply rewritten as:

  of_get_mac_address(np, ndev->dev_addr);

Sometimes is_valid_ether_addr() is used to test the MAC address.
of_get_mac_address() already makes sure, it just returns a valid MAC
address. Thus we can just test its return code. But we have to be
careful if there are still other sources for the MAC address before the
of_get_mac_address(). In this case we have to keep the
is_valid_ether_addr() call.

The following coccinelle patch was used to convert common cases to the
new style. Afterwards, I've manually gone over the drivers and fixed the
return code variable: either used a new one or if one was already
available use that. Mansour Moufid, thanks for that coccinelle patch!

<spml>
@a@
identifier x;
expression y, z;
@@
- x = of_get_mac_address(y);
+ x = of_get_mac_address(y, z);
  <...
- ether_addr_copy(z, x);
  ...>

@@
identifier a.x;
@@
- if (<+... x ...+>) {}

@@
identifier a.x;
@@
  if (<+... x ...+>) {
      ...
  }
- else {}

@@
identifier a.x;
expression e;
@@
- if (<+... x ...+>@e)
-     {}
- else
+ if (!(e))
      {...}

@@
expression x, y, z;
@@
- x = of_get_mac_address(y, z);
+ of_get_mac_address(y, z);
  ... when != x
</spml>

All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
compile-time tested.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:35:02 -07:00
Hristo Venev 941ea91e87 net: ip6_tunnel: Unregister catch-all devices
Similarly to the sit case, we need to remove the tunnels with no
addresses that have been moved to another network namespace.

Fixes: 0bd8762824 ("ip6tnl: add x-netns support")
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:31:52 -07:00
Hristo Venev 610f8c0fc8 net: sit: Unregister catch-all devices
A sit interface created without a local or a remote address is linked
into the `sit_net::tunnels_wc` list of its original namespace. When
deleting a network namespace, delete the devices that have been moved.

The following script triggers a null pointer dereference if devices
linked in a deleted `sit_net` remain:

    for i in `seq 1 30`; do
        ip netns add ns-test
        ip netns exec ns-test ip link add dev veth0 type veth peer veth1
        ip netns exec ns-test ip link add dev sit$i type sit dev veth0
        ip netns exec ns-test ip link set dev sit$i netns $$
        ip netns del ns-test
    done
    for i in `seq 1 30`; do
        ip link del dev sit$i
    done

Fixes: 5e6700b3bf ("sit: add support of x-netns")
Signed-off-by: Hristo Venev <hristo@venev.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 14:31:52 -07:00
Roi Dayan 78ed0a9bc6 netfilter: flowtable: Add FLOW_OFFLOAD_XMIT_UNSPEC xmit type
It could be xmit type was not set and would default to FLOW_OFFLOAD_XMIT_NEIGH
and in this type the gc expect to have a route info.
Fix that by adding FLOW_OFFLOAD_XMIT_UNSPEC which defaults to 0.

Fixes: 8b9229d158 ("netfilter: flowtable: dst_check() from garbage collector path")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:47:00 +02:00
Florian Westphal 9b1a4d0f91 netfilter: conntrack: convert sysctls to u8
log_invalid sysctl allows values of 0 to 255 inclusive so we no longer
need a range check: the min/max values can be removed.

This also removes all member variables that were moved to net_generic
data in previous patches.

This reduces size of netns_ct struct by one cache line.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal c53bd0e966 netfilter: conntrack: move ct counter to net_generic data
Its only needed from slowpath (sysctl, ctnetlink, gc worker) and
when a new conntrack object is allocated.

Furthermore, each write dirties the otherwise read-mostly pernet
data in struct net.ct, which are accessed from packet path.

Move it to the net_generic data.  This makes struct netns_ct
read-mostly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal f6f2e580d5 netfilter: conntrack: move expect counter to net_generic data
Creation of a new conntrack entry isn't a frequent operation (compared
to 'ct entry already exists').  Creation of a new entry that is also an
expected (related) connection even less so.

Place this counter in net_generic data.

A followup patch will also move the conntrack count -- this will make
netns_ct a read-mostly structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00
Florian Westphal 67f28216ca netfilter: conntrack: move autoassign_helper sysctl to net_generic data
While at it, make it an u8, no need to use an integer for a boolean.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-13 13:10:39 +02:00