linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-30 14:19:16 +00:00

Author	SHA1	Message	Date
Shalom Toledo	552ec3d9d2	selftests: devlink_lib: Check devlink info command is supported Sanity check for devlink info command. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:15 -08:00
Shalom Toledo	6697b51ed3	selftests: mlxsw: Add shared buffer configuration test Test physical ports' shared buffer configuration options using random values related to a specific configuration option. There are 3 configuration options: pool, TC bind and portpool. Each sub-test, test a different configuration option and random the related values as the follow: * For pools, pool's size will be randomized. * For TC bind, pool number and threshold will be randomized. * For portpools, threshold will be randomized. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Danielle Ratson	1cbe65e09b	selftests: mlxsw: Use busywait helper in rtnetlink test Rtnetlink test uses offload indication checks. Use a busywait helper and wait until the offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Danielle Ratson	05ef614c55	selftests: mlxsw: Use busywait helper in vxlan test Vxlan test uses offload indication checks. Use a busywait helper and wait until the offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Danielle Ratson	0c22f993c9	selftests: mlxsw: Use busywait helper in blackhole routes test Blackhole routes test uses offload indication checks. Use busywait helper and wait until the routes offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Ido Schimmel	5d66773f41	selftests: devlink_trap_l3_drops: Avoid race condition The test checks that packets are trapped when they should egress a router interface (RIF) that has become disabled. This is a temporary state in a RIF's deletion sequence. Currently, the test deletes the RIF by flushing all the IP addresses configured on the associated netdev (br0). However, this is racy, as this also flushes all the routes pointing to the netdev and if the routes are deleted from the device before the RIF is disabled, then no packets will try to egress the disabled RIF and the trap will not be triggered. Instead, trigger the deletion of the RIF by unlinking the mlxsw port from the bridge that is backing the RIF. Unlike before, this will not cause the kernel to delete the routes pointing to the bridge. Note that due to current mlxsw locking scheme the RIF is always deleted first, but this is going to change. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Jiri Pirko	ab2b8ab253	selftests: add a mirror test to mlxsw tc flower restrictions Include test of forbidding to have multiple mirror actions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Jiri Pirko	c84e903f62	selftests: add egress redirect test to mlxsw tc flower restrictions Include test of forbidding to have redirect rule on egress-bound block. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Petr Machata	3de611b507	selftests: mlxsw: Add a RED selftest This tests that below the queue minimum length, there is no dropping / marking, and above max, everything is dropped / marked. The test is structured as a core file with topology and test code, and three wrappers: one for RED used as a root Qdisc, and two for testing (W)RED under PRIO and ETS. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
Petr Machata	4113b04823	selftests: forwarding: lib.sh: Add start_tcp_traffic Extract a helper __start_traffic() configurable by protocol type. Allow passing through extra mausezahn arguments. Add a wrapper, start_tcp_traffic(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:10:14 -08:00
David S. Miller	f4979b41f3	Merge branch 'hinic-BugFixes' Luo bin says: ==================== hinic: BugFixes the bug fixed in patch #2 has been present since the first commit. the bugs fixed in patch #1 and patch #3 have been present since the following commits: patch #1: `352f58b0d9` ("net-next/hinic: Set Rxq irq to specific cpu for NUMA") patch #3: `421e952628` ("hinic: add rss support") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:08:01 -08:00
Luo bin	386d4716fd	hinic: fix a bug of rss configuration should use real receive queue number to configure hw rss indirect table rather than maximal queue number Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:08:01 -08:00
Luo bin	d2ed69ce9e	hinic: fix a bug of setting hw_ioctxt a reserved field is used to signify prime physical function index in the latest firmware version, so we must assign a value to it correctly Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:08:01 -08:00
Luo bin	0bff777bd0	hinic: fix a irq affinity bug can not use a local variable as an input parameter of irq_set_affinity_hint Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-27 11:08:01 -08:00
Linus Torvalds	e46bfaba59	A pair of docs-build fixes. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl5U9RsACgkQF0NaE2wM fljetAgAkIzf4eaENcrU8LUOjR7p66a6IIBLA5WAlSc2CvmfHrI6fJgxUWHZKxZp 0NhzoNiYo0zPkrPpjabpWLwyiIAIDR3RA7B+dOuxCW0AieLuPV6ltg3ytXukeGvo ZOVXEVgorG6Kx4oNvCKqrUVPICU+SErSRmREVggLWi4iM1cQzUxcMIBJEw8SgHEV mDT5igcfSqWWxU8rF6zGY0zV+l+pN12gf7EfmKnJ6E0oCWNZ6VHbmyHaVxIS76C7 9aiNW5+ATeCPEYvXseAaOSCbJWOU0wlxrskpHSN2yBgGHTGipEEFZF5rAj8QbeYH 9UPqvmgKFLZqQ3aCqAK3jltm8ojyRA== =rH5d -----END PGP SIGNATURE----- Merge tag 'docs-5.6-fixes' of git://git.lwn.net/linux Pull documentation fixes from Jonathan Corbet: "A pair of docs-build fixes" * tag 'docs-5.6-fixes' of git://git.lwn.net/linux: docs: Fix empty parallelism argument docs: remove MPX from the x86 toc	2020-02-27 11:07:13 -08:00
Linus Torvalds	ed5fa55918	audit/stable-5.6 PR 20200226 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl5XHiYUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOn4w/9EWM7O3ZNr1BJb/p6Z1DxAHi1e11s UC5G6njvfihvddS1dhT00eSUX/Bk9WrCLNX5mzXbi5SD8iPP1bDk2L/1xFuR/Dji DHOruYsCqBI6t/rmEhwJiAaMWWQnGzHVfL+Fh7cOBhk2GNdfLUSs+BPXrRmchtH5 OlOogvpMCigK4pvYFN1AXNpvHnBYTvWrmC32KIjj9WcyGuY9Uz0TI+AOnKRJAp/Q YKV7vyw6ZpEt2/nT9lQHTWrMrsXLIybJLRF+BX9pVsz3bzi3u5FqROF1FntIOv1R OF5cI7ly9ixY2VMyxh/0k2X5aB/6BqyB2kmk58x0Wbqxcwi+rOqnMtm5chwpjQ+u bWN+mcB8XGe2Rj+/jjLjoiZReX6CnMuXZe+JNNYMJ+gV6x/Y9aCfg/G5K2ZvXdE2 Cc1mkhGACivLDlYxgpoELR5UC7KEQ2rlYenP6nYiNEsqyMCYPw6g0Rac4dvjppYv YYOp0xyZuK9U+aCGg6Tk2UoOgHRSsG7HUK3yn3Zy8ol5A5NtpOS+8wyyT1e3B/iX xjQfaGx6P66r9qhs6ikVMbOcx9ZZxJXERiA6pNjlGcX4HyZdOokuP946NSYJZ4TK EUA9f4KcKIyBrDMFne2r1O8fNLybkU+2/HkFdak8eMTVtJx0l9oPEAZV/6TOReBv Smf0PXFLTv/oRzc= =DyVs -----END PGP SIGNATURE----- Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()	2020-02-27 11:01:22 -08:00
David S. Miller	2b99e54b30	Merge branch 'VLANs-DSA-switches-and-multiple-bridges' Russell King says: ==================== VLANs, DSA switches and multiple bridges This is a repost of the previously posted RFC back in December, which did not get fully reviewed. I've dropped the RFC tag this time as no one really found anything too problematical in the RFC posting. I've been trying to configure DSA for VLANs and not having much success. The setup is quite simple: - The main network is untagged - The wifi network is a vlan tagged with id $VN running over the main network. I have an Armada 388 Clearfog with a PCIe wifi card which I'm trying to setup to provide wifi access to the vlan $VN network, while the switch is also part of the main network. However, I'm encountering problems: 1) vlan support in DSA has a different behaviour from the Linux software bridge implementation. # bridge vlan port vlan ids lan1 1 PVID Egress Untagged ... shows the default setup - the bridge ports are all configured for vlan 1, untagged egress, and vlan 1 as the port vid. Issuing: # ip li set dev br0 type bridge vlan_filtering 1 with no other vlan configuration commands on a Linux software bridge continues to allow untagged traffic to flow across the bridge. This difference in behaviour is because the MV88E6xxx VTU is completely empty - because net/dsa ignores all vlan settings for a port if br_vlan_enabled(dp->bridge_dev) is false - this reflects the vlan filtering state of the bridge, not whether the bridge is vlan aware. What this means is that attempting to configure the bridge port vlans before enabling vlan filtering works for Linux software bridges, but fails for DSA bridges. 2) Assuming the above is sorted, we move on to the next issue, which is altogether more weird. Let's take a setup where we have a DSA bridge with lan1..6 in a bridge device, br0, with vlan filtering enabled. lan1 is the upstream port, lan2 is a downstream port that also wants to see traffic on vlan id $VN. Both lan1 and lan2 are configured for that: # bridge vlan add vid $VN dev lan1 # bridge vlan add vid $VN dev lan2 # ip li set br0 type bridge vlan_filtering 1 Untagged traffic can now pass between all the six lan ports, and vlan $VN between lan1 and lan2 only. The MV88E6xxx 8021q_mode debugfs file shows all lan ports are in mode "secure" - this is important! /sys/class/net/br0/bridge/vlan_filtering contains 1. tcpdumping from another machine on lan4 shows that no $VN traffic reaches it. Everything seems to be working correctly... In order to further bridge vlan $VN traffic to hostapd's wifi interface, things get a little more complex - we can't add hostapd's wifi interface to br0 directly, because hostapd will bring up the wifi interface and leak the main, untagged traffic onto the wifi. (hostapd does have vlan support, but only as a dynamic per-client thing, and there's no hooks I can see to allow script-based config of the network setup before hostapd up's the wifi interface.) So, what I tried was: # ip li add link br0 name br0.$VN type vlan id $VN # bridge vlan add vid $VN dev br0 self # ip li set dev br0.$VN up So far so good, we get a vlan interface on top of the bridge, and tcpdumping it shows we get traffic. The 8021q_mode file has not changed state. Everything still seems to be correct. # bridge addbr br1 Still nothing has changed. # bridge addif br1 br0.$VN And now the 8021q_mode debugfs file shows that all ports are now in "disabled" mode, but /sys/class/net/br0/bridge/vlan_filtering still contains '1'. In other words, br0 still thinks vlan filtering is enabled, but the hardware has had vlan filtering disabled. Adding some stack traces to an appropriate point indicates that this is because __switchdev_handle_port_attr_set() recurses down through the tree of interfaces, skipping over the vlan interface, applying br1's configuration to br0's ports. This surely can not be right - surely __switchdev_handle_port_attr_set() and similar should stop recursing down through another master bridge device? There are probably other network device classes that switchdev shouldn't recurse down too. I've considered whether switchdev is the right level to do it, and I think it is - as we want the check/set callbacks to be called for the top level device even if it is a master bridge device, but we don't want to recurse through a lower master bridge device. v2: dropped patch 3, since that has an outstanding issue, and my question on it has not been answered. Otherwise, these are the same patches. Maybe we can move forward with just these two? v3: include DSA ports in patch 2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:58:33 -08:00
Russell King	933b442508	net: dsa: mv88e6xxx: fix duplicate vlan warning When setting VLANs on DSA switches, the VLAN is added to both the port concerned as well as the CPU port by dsa_slave_vlan_add(), as well as any DSA ports. If multiple ports are configured with the same VLAN ID, this triggers a warning on the CPU and DSA ports. Avoid this warning for CPU and DSA ports. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:58:33 -08:00
Russell King	07c6f9805f	net: switchdev: do not propagate bridge updates across bridges When configuring a tree of independent bridges, propagating changes from the upper bridge across a bridge master to the lower bridge ports brings surprises. For example, a lower bridge may have vlan filtering enabled. It may have a vlan interface attached to the bridge master, which may then be incorporated into another bridge. As soon as the lower bridge vlan interface is attached to the upper bridge, the lower bridge has vlan filtering disabled. This occurs because switchdev recursively applies its changes to all lower devices no matter what. Reviewed-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:58:33 -08:00
Karsten Graul	a2f2ef4a54	net/smc: check for valid ib_client_data In smc_ib_remove_dev() check if the provided ib device was actually initialized for SMC before. Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com Fixes: `a4cf0443c4` ("smc: introduce SMC as an IB-client") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:56:25 -08:00
Aaro Koskinen	474a31e13a	net: stmmac: fix notifier registration We cannot register the same netdev notifier multiple times when probing stmmac devices. Register the notifier only once in module init, and also make debugfs creation/deletion safe against simultaneous notifier call. Fixes: `481a7d154c` ("stmmac: debugfs entry name is not be changed when udev rename device name.") Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:55:14 -08:00
Antoine Tenart	c87a9d6fc6	net: phy: mscc: fix firmware paths The firmware paths for the VSC8584 PHYs not not contain the leading 'microchip/' directory, as used in linux-firmware, resulting in an error when probing the driver. This patch fixes it. Fixes: `a5afc16780` ("net: phy: mscc: add support for VSC8584 PHY") Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:53:09 -08:00
Dan Carpenter	9baeea5071	net: qrtr: Fix error pointer vs NULL bugs The callers only expect NULL pointers, so returning an error pointer will lead to an Oops. Fixes: `0c2204a4ad` ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:52:31 -08:00
Antoine Tenart	1ac7b090ec	net: phy: mscc: add missing shift for media operation mode selection This patch adds a missing shift for the media operation mode selection. This does not fix the driver as the current operation mode (copper) has a value of 0, but this wouldn't work for other modes. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:51:42 -08:00
Paolo Abeni	dc24f8b4ec	mptcp: add dummy icsk_sync_mss() syzbot noted that the master MPTCP socket lacks the icsk_sync_mss callback, and was able to trigger a null pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0 Oops: 0010 [#1] PREEMPT SMP KASAN CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246 RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600 RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140 RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0 R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba FS: 0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888 netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989 smack_netlabel security/smack/smack_lsm.c:2425 [inline] smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716 security_inode_setsecurity+0xb2/0x140 security/security.c:1364 __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197 vfs_setxattr fs/xattr.c:224 [inline] setxattr+0x335/0x430 fs/xattr.c:451 __do_sys_fsetxattr fs/xattr.c:506 [inline] __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495 __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x440199 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199 RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8 R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20 R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: CR2: 0000000000000000 Address the issue adding a dummy icsk_sync_mss callback. To properly sync the subflows mss and options list we need some additional infrastructure, which will land to net-next. Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com Fixes: `2303f994b3` ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:49:50 -08:00
Arthur Kiyanovski	92040c6daa	net: ena: fix broken interface between ENA driver and FW In this commit we revert the part of commit `1a63443afd` ("net/amazon: Ensure that driver version is aligned to the linux kernel"), which breaks the interface between the ENA driver and FW. We also replace the use of DRIVER_VERSION with DRIVER_GENERATION when we bring back the deleted constants that are used in interface with ENA device FW. This commit does not change the driver version reported to the user via ethtool, which remains the kernel version. Fixes: `1a63443afd` ("net/amazon: Ensure that driver version is aligned to the linux kernel") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:47:58 -08:00
David S. Miller	621135a0f9	Merge branch 'mptcp-update-mptcp-ack-sequence-outside-of-recv-path' Florian Westphal says: ==================== mptcp: update mptcp ack sequence outside of recv path This series moves mptcp-level ack sequence update outside of the recvmsg path. Current approach has two problems: 1. There is delay between arrival of new data and the time we can ack this data. 2. If userspace doesn't call recv for some time, mptcp ack_seq is not updated at all, even if this data is queued in the subflow socket receive queue. Move skbs from the subflow socket receive queue to the mptcp-level receive queue, updating the mptcp-level ack sequence and have recv take skbs from the mptcp-level receive queue. The first place where we will attempt to update the mptcp level acks is from the subflows' data_ready callback, even before we make userspace aware of new data. Because of possible deadlock (we need to take the mptcp socket lock while already holding the subflow sockets lock), we may still need to defer the mptcp-level ack update. In such case, this work will be either done from work queue or recv path, depending on which runs sooner. In order to avoid pointless scheduling of the work queue, work will be queued from the mptcp sockets lock release callback. This allows to detect when the socket owner did drain the subflow socket receive queue. Please see individual patches for more information. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Paolo Abeni	14c441b564	mptcp: defer work schedule until mptcp lock is released Don't schedule the work queue right away, instead defer this to the lock release callback. This has the advantage that it will give recv path a chance to complete -- this might have moved all pending packets from the subflow to the mptcp receive queue, which allows to avoid the schedule_work(). Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Florian Westphal	2e52213c79	mptcp: avoid work queue scheduling if possible We can't lock_sock() the mptcp socket from the subflow data_ready callback, it would result in ABBA deadlock with the subflow socket lock. We can however grab the spinlock: if that succeeds and the mptcp socket is not owned at the moment, we can process the new skbs right away without deferring this to the work queue. This avoids the schedule_work and hence the small delay until the work item is processed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Florian Westphal	bfae9dae44	mptcp: remove mptcp_read_actor Only used to discard stale data from the subflow, so move it where needed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Florian Westphal	600911ff5f	mptcp: add rmem queue accounting If userspace never drains the receive buffers we must stop draining the subflow socket(s) at some point. This adds the needed rmem accouting for this. If the threshold is reached, we stop draining the subflows. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Florian Westphal	6771bfd9ee	mptcp: update mptcp ack sequence from work queue If userspace is not reading data, all the mptcp-level acks contain the ack_seq from the last time userspace read data rather than the most recent in-sequence value. This causes pointless retransmissions for data that is already queued. The reason for this is that all the mptcp protocol level processing happens at mptcp_recv time. This adds work queue to move skbs from the subflow sockets receive queue on the mptcp socket receive queue (which was not used so far). This allows us to announce the correct mptcp ack sequence in a timely fashion, even when the application does not call recv() on the mptcp socket for some time. We still wake userspace tasks waiting for POLLIN immediately: If the mptcp level receive queue is empty (because the work queue is still pending) it can be filled from in-sequence subflow sockets at recv time without a need to wait for the worker. The skb_orphan when moving skbs from subflow to mptcp level is needed, because the destructor (sock_rfree) relies on skb->sk (ssk!) lock being taken. A followup patch will add needed rmem accouting for the moved skbs. Other problem: In case application behaves as expected, and calls recv() as soon as mptcp socket becomes readable, the work queue will only waste cpu cycles. This will also be addressed in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Paolo Abeni	8099201715	mptcp: add work queue skeleton Will be extended with functionality in followup patches. Initial user is moving skbs from subflows receive queue to the mptcp-level receive queue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
Florian Westphal	101f6f851e	mptcp: add and use mptcp_data_ready helper allows us to schedule the work queue to drain the ssk receive queue in a followup patch. This is needed to avoid sending all-to-pessimistic mptcp-level acknowledgements. At this time, the ack_seq is what was last read by userspace instead of the highest in-sequence number queued for reading. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:46:26 -08:00
David S. Miller	5cd129dd5e	Merge branch 'mlxsw-Small-driver-update' Jiri Pirko says: ==================== mlxsw: Small driver update This patchset contains couple of patches not related to each other. They are small optimization and extension changes to the driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:44:42 -08:00
Petr Machata	3b909c552a	mlxsw: spectrum: Add mlxsw_sp_span_ops.buffsize_get for Spectrum-3 The buffer factor on Spectrum-3 is larger than on Spectrum-2. Add a new callback and use it for mlxsw_sp->span_ops on Spectrum-3. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:44:42 -08:00
Ido Schimmel	b401ff8541	mlxsw: spectrum: Initialize advertised speeds to supported speeds During port initialization the driver instructs the device to only advertise speeds that can be supported by the port's current width. Since the device now returns the supported speeds based on the port's current width, the driver no longer needs to compute the speeds that can be advertised. Simplify port initialization by setting the advertised speeds to the queried supported speeds. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:44:42 -08:00
Petr Machata	8a29581eb0	mlxsw: spectrum: Move the ECN-marked packet counter to ethtool Spectrum-1 and Spectrum-2 do not have a per-TC counter of number of packets marked by ECN. The value reported currently is the total number of marked packets. Showing this value at individual TC Qdiscs is misleading. Move the counter to ethtool instead. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:44:42 -08:00
Jiri Pirko	648e53cac7	mlxsw: spectrum_switchdev: Optimize SFN records processing Currently, only one SFN query is done from repetitive work at a time, processing 64 entries. Another work iteration is scheduled in 100ms, that means that the max rate of learned FDB entries is limited to 6400/s. That is slow. Fix this by doing 2 optimizations: 1) Run 10 SFN queries at a time. 2) In case the SFN is not drained, schedule work with 0 delay to allow to continue processing rest of the records. On a testing setup with 500K entries the time to process decreased from 870secs to 10secs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Tested-by: Alex Kushnarov <alexanderk@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:44:42 -08:00
Sudheesh Mavila	4f31c532ad	net: phy: corrected the return value for genphy_check_and_restart_aneg and genphy_c45_check_and_restart_aneg When auto-negotiation is not required, return value should be zero. Changes v1->v2: - improved comments and code as Andrew Lunn and Heiner Kallweit suggestion - fixed issue in genphy_c45_check_and_restart_aneg as Russell King suggestion. Fixes: `2a10ab043a` ("net: phy: add genphy_check_and_restart_aneg()") Fixes: `1af9f16840` ("net: phy: add genphy_c45_check_and_restart_aneg()") Signed-off-by: Sudheesh Mavila <sudheesh.mavila@amd.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:41:42 -08:00
Randy Dunlap	c535f92032	af_llc: fix if-statement empty body warning When debugging via dprintk() is not enabled, make the dprintk() macro be an empty do-while loop, as is done in <linux/sunrpc/debug.h>. This fixes a gcc warning when -Wextra is set: ../net/llc/af_llc.c:974:51: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] I have verified that there is not object code change (with gcc 7.5.0). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:38:13 -08:00
yangerkun	f596c87005	slip: not call free_netdev before rtnl_unlock in slip_open As the description before netdev_run_todo, we cannot call free_netdev before rtnl_unlock, fix it by reorder the code. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:37:06 -08:00
David S. Miller	165b94ffcf	mlx5-updates-2020-02-25 The following series provides some misc updates to mlx5 driver: 1) From Maxim, Refactoring for mlx5e netdev channels recreation flow. - Add error handling - Add context to the preactivate hook - Use preactivate hook with context where it can be used and subsequently unify channel recreation flow everywhere. - Fix XPS cpumask to not reset upon channel recreation. 2) From Tariq: - Use indirect calls wrapper on RX. - Check LRO capability bit 3) Multiple small cleanups -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl5VxJAACgkQSD+KveBX +j5q6gf+NPVz9uaWlegeAD0J/AkrycIOlceaBhwQ7zDJ1u2qy8Aan6QuprFLKwkW hHDfyqxMXdsHnQJAfOiUU1HGgK2v092BWoArdHMJwift333TQkRzJNXCM9WBKAOU PpyOgvRESLWOjcgOcm3AYv/FKjiW5akm7rlW+jYy3JswjY4a7rqJvVKUnXXPP9B1 5oiis1OIQVz9y2GTEOYwBx3euZ87TCqzsgF+qNkKtFil74gOnVHNx4kqFTLE3b0Z awq+dvKOUlCIobAX69BmKDj6ieAKNO7l0mfMYnmbVk8nNt33F0z94MBIw2ztCcPq AN9hDZdQitnQFH9okdVCf4rKxaQoxA== =lixf -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2020-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-02-25 The following series provides some misc updates to mlx5 driver: 1) From Maxim, Refactoring for mlx5e netdev channels recreation flow. - Add error handling - Add context to the preactivate hook - Use preactivate hook with context where it can be used and subsequently unify channel recreation flow everywhere. - Fix XPS cpumask to not reset upon channel recreation. 2) From Tariq: - Use indirect calls wrapper on RX. - Check LRO capability bit 3) Multiple small cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:30:43 -08:00
David S. Miller	06baf4be20	Merge branch 'net-smc-improve-peer-ID-in-CLC-decline' Hans Wippel says: ==================== net/smc: improve peer ID in CLC decline The following two patches improve the peer ID in CLC decline messages if RoCE devices are present in the host but no suitable device is found for a connection. The first patch reworks the peer ID initialization. The second patch contains the actual changes of the CLC decline messages. Changes v1 -> v2: * make smc_ib_is_valid_local_systemid() static in first patch * changed if in smc_clc_send_decline() to remove curly braces Changes RFC -> v1: * split the patch into two parts * removed zero assignment to global variable (thanks Leon) Thanks to Leon Romanovsky and Karsten Graul for the feedback! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:27:06 -08:00
Hans Wippel	a082ec897f	net/smc: improve peer ID in CLC decline for SMC-R According to RFC 7609, all CLC messages contain a peer ID that consists of a unique instance ID and the MAC address of one of the host's RoCE devices. But if a SMC-R connection cannot be established, e.g., because no matching pnet table entry is found, the current implementation uses a zero value in the CLC decline message although the host's peer ID is set to a proper value. If no RoCE and no ISM device is usable for a connection, there is no LGR and the LGR check in smc_clc_send_decline() prevents that the peer ID is copied into the CLC decline message for both SMC-D and SMC-R. So, this patch modifies the check to also accept the case of no LGR. Also, only a valid peer ID is copied into the decline message. Signed-off-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:27:06 -08:00
Hans Wippel	366bb249b5	net/smc: rework peer ID handling This patch initializes the peer ID to a random instance ID and a zero MAC address. If a RoCE device is in the host, the MAC address part of the peer ID is overwritten with the respective address. Also, a function for checking if the peer ID is valid is added. A peer ID is considered valid if the MAC address part contains a non-zero MAC address. Signed-off-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:27:06 -08:00
Arjun Roy	0b7f41f687	tcp-zerocopy: Update returned getsockopt() optlen. TCP receive zerocopy currently does not update the returned optlen for getsockopt() if the user passed in a larger than expected value. Thus, userspace cannot properly determine if all the fields are set in the passed-in struct. This patch sets the optlen for this case before returning, in keeping with the expected operation of getsockopt(). Fixes: `c8856c0514` ("tcp-zerocopy: Return inq along with tcp receive zerocopy.") Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:24:22 -08:00
Eric Dumazet	b6f6118901	ipv6: restrict IPV6_ADDRFORM operation IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one. While this operation sounds illogical, we have to support it. One of the things it does for TCP socket is to switch sk->sk_prot to tcp_prot. We now have other layers playing with sk->sk_prot, so we should make sure to not interfere with them. This patch makes sure sk_prot is the default pointer for TCP IPv6 socket. syzbot reported : BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD a0113067 P4D a0113067 PUD a8771067 PMD 0 Oops: 0010 [#1] PREEMPT SMP KASAN CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246 RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40 RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5 R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098 R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000 FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427 __sock_release net/socket.c:605 [inline] sock_close+0xe1/0x260 net/socket.c:1283 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x176/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_usermode_loop arch/x86/entry/common.c:164 [inline] prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195 syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278 do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x45c429 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429 RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004 RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c Modules linked in: CR2: 0000000000000000 ---[ end trace 82567b5207e87bae ]--- RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246 RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40 RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5 R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098 R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000 FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:20:58 -08:00
Ursula Braun	51e3dfa890	net/smc: fix cleanup for linkgroup setup failures If an SMC connection to a certain peer is setup the first time, a new linkgroup is created. In case of setup failures, such a linkgroup is unusable and should disappear. As a first step the linkgroup is removed from the linkgroup list in smc_lgr_forget(). There are 2 problems: smc_listen_decline() might be called before linkgroup creation resulting in a crash due to calling smc_lgr_forget() with parameter NULL. If a setup failure occurs after linkgroup creation, the connection is never unregistered from the linkgroup, preventing linkgroup freeing. This patch introduces an enhanced smc_lgr_cleanup_early() function which * contains a linkgroup check for early smc_listen_decline() invocations * invokes smc_conn_free() to guarantee unregistering of the connection. * schedules fast linkgroup removal of the unusable linkgroup And the unused function smcd_conn_free() is removed from smc_core.h. Fixes: `3b2dec2603` ("net/smc: restructure client and server code in af_smc") Fixes: `2a0674fffb` ("net/smc: improve abnormal termination of link groups") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:18:07 -08:00
David S. Miller	ebb4a4bf76	Merge branch 'net-fix-sysfs-permssions-when-device-changes-network' Christian Brauner says: ==================== net: fix sysfs permssions when device changes network /* v7 / This is v7 with a build warning fixup that slipped past me the last time. It removes to unused variables in sysfs_group_change_owner(). I observed no warning when building just now. / v6 / This is v6 with two small fixups. I missed adapting the commit message to reflect the renamed helper for changing the owner of sysfs files and I also forgot to make the new dpm helper static inline. / v5 / This is v5 with a small fixup requested by Rafael. / v4 / This is v4 with more documentation and other fixes that Greg requested. / v3 */ This is v3 with explicit uid and gid parameters added to functions that change sysfs object ownership as Greg requested. (I've tagged this with net-next since it's triggered by a bug for network device files but it also touches driver core aspects so it's not clear-cut. I can of course split this series into separate patchsets.) We have been struggling with a bug surrounding the ownership of network device sysfs files when moving network devices between network namespaces owned by different user namespaces reported by multiple users. Currently, when moving network devices between network namespaces the ownership of the corresponding sysfs entries is not changed. This leads to problems when tools try to operate on the corresponding sysfs files. I also causes a bug when creating a network device in a network namespaces owned by a user namespace and moving that network device back to the host network namespaces. Because when a network device is created in a network namespaces it will be owned by the root user of the user namespace and all its associated sysfs files will also be owned by the root user of the corresponding user namespace. If such a network device has to be moved back to the host network namespace the permissions will still be set to the root user of the owning user namespaces of the originating network namespace. This means unprivileged users can e.g. re-trigger uevents for such incorrectly owned devices on the host or in other network namespaces. They can also modify the settings of the device itself through sysfs when they wouldn't be able to do the same through netlink. Both of these things are unwanted. For example, quite a few workloads will create network devices in the host network namespace. Other tools will then proceed to move such devices between network namespaces owner by other user namespaces. While the ownership of the device itself is updated in net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs entry for the device is not. Below you'll find that moving a network device (here a veth device) from a network namespace into another network namespaces owned by a different user namespace with a different id mapping. As you can see the permissions are wrong even though it is owned by the userns root user after it has been moved and can be interacted with through netlink: drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent Constrast this with creating a device of the same type in the network namespace directly. In this case the device's sysfs permissions will be correctly updated. (Please also note, that in a lot of workloads this strategy of creating the network device directly in the network device to workaround this issue can not be used. Either because the network device is dedicated after it has been created or because it used by a process that is heavily sandboxed and couldn't create network devices itself.): drwxr-xr-x 5 root root 0 Jan 25 18:12 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len -r--r--r-- 1 root root 4096 Jan 25 18:12 address -r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port -r--r--r-- 1 root root 4096 Jan 25 18:12 dormant -r--r--r-- 1 root root 4096 Jan 25 18:12 duplex -rw-r--r-- 1 root root 4096 Jan 25 18:12 flags -rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias -r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex -r--r--r-- 1 root root 4096 Jan 25 18:12 iflink -r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu -r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 root root 4096 Jan 25 18:12 operstate -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 root root 0 Jan 25 18:12 power -rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 root root 0 Jan 25 18:12 queues -r--r--r-- 1 root root 4096 Jan 25 18:12 speed drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 root root 4096 Jan 25 18:12 type -rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent Now, when creating a network device in a network namespace owned by a user namespace and moving it to the host the permissions will be set to the id that the user namespace root user has been mapped to on the host leading to all sorts of permission issues mentioned above: 458752 drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 . drwxr-xr-x 9 root root 0 Jan 25 18:08 .. -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent Fix this by changing the basic sysfs files associated with network devices when moving them between network namespaces. To this end we add some infrastructure to sysfs. The patchset takes care to only do this when the owning user namespaces changes and the kids differ. So there's only a performance overhead, when the owning user namespace of the network namespace is different __and__ the kid mappings for the root user are different for the two user namespaces: Assume we have a netdev eth0 which we create in netns1 owned by userns1. userns1 has an id mapping of 0 100000 100000. Now we move eth0 into netns2 which is owned by userns2 which also defines an id mapping of 0 100000 100000. In this case sysfs doesn't need updating. The patch will handle this case and not do any needless work. Now assume eth0 is moved into netns3 which is owned by userns3 which defines an id mapping of 0 123456 65536. In this case the root user in each namespace corresponds to different kid and sysfs needs updating. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-02-26 20:07:26 -08:00

... 3 4 5 6 7 ...

902069 commits