linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-21 18:11:39 +00:00

Author	SHA1	Message	Date
Geliang Tang	61c131f5d4	selftests: mptcp: add mptcp_lib_get_counter To avoid duplicated code in different MPTCP selftests, we can add and use helpers defined in mptcp_lib.sh. The helper get_counter() in mptcp_join.sh and get_mib_counter() in mptcp_connect.sh have the same functionality, export get_counter() into mptcp_lib.sh and rename it as mptcp_lib_get_counter(). Use this new helper instead of get_counter() and get_mib_counter(). Use this helper in test_prio() in userspace_pm.sh too instead of open-coding. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-11-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:18 -08:00
Geliang Tang	b850f2c7dd	selftests: mptcp: add mptcp_lib_is_v6 To avoid duplicated code in different MPTCP selftests, we can add and use helpers defined in mptcp_lib.sh. is_v6() helper is defined in mptcp_connect.sh, mptcp_join.sh and mptcp_sockopt.sh, so export it into mptcp_lib.sh and rename it as mptcp_lib_is_v6(). Use this new helper in all scripts. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-10-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:18 -08:00
Geliang Tang	bdbef0a6ff	selftests: mptcp: add mptcp_lib_kill_wait To avoid duplicated code in different MPTCP selftests, we can add and use helpers defined in mptcp_lib.sh. Export kill_wait() helper in userspace_pm.sh into mptcp_lib.sh and rename it as mptcp_lib_kill_wait(). It can be used to instead of kill_wait() in mptcp_join.sh. Use the new helper in both scripts. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-9-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:18 -08:00
Geliang Tang	b9fb176081	selftests: mptcp: userspace pm send RM_ADDR for ID 0 This patch adds a selftest for userspace PM to remove id 0 address. Use userspace_pm_add_addr() helper to add an id 10 address, then use userspace_pm_rm_addr() helper to remove id 0 address. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-8-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:18 -08:00
Geliang Tang	e3b47e460b	selftests: mptcp: userspace pm remove initial subflow This patch adds a selftest for userspace PM to remove the initial subflow. Use userspace_pm_add_sf() to add a subflow, and pass initial IP address to userspace_pm_rm_sf() to remove the initial subflow. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-7-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:18 -08:00
Geliang Tang	b3ac570aae	mptcp: userspace pm rename remove_err to out The value of 'err' will not be only '-EINVAL', but can be '0' in some cases. So it's better to rename the label 'remove_err' to 'out' to avoid confusions. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-6-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Geliang Tang	b2e2248f36	selftests: mptcp: userspace pm create id 0 subflow This patch adds a selftest to create id 0 subflow. Pass id 0 to the helper userspace_pm_add_sf() to create id 0 subflow. chk_mptcp_info shows one subflow but chk_subflows_total shows two subflows in each namespace. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-5-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Geliang Tang	757c828ce9	selftests: mptcp: update userspace pm test helpers This patch adds a new argument namespace to userspace_pm_add_addr() and userspace_pm_add_sf() to make these two helper more versatile. Add two more versatile helpers for userspace pm remove subflow or address: userspace_pm_rm_addr() and userspace_pm_rm_sf(). The original test helpers userspace_pm_rm_sf_addr_ns1() and userspace_pm_rm_sf_addr_ns2() can be replaced by these new helpers. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-4-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Geliang Tang	8077541288	selftests: mptcp: add chk_subflows_total helper This patch adds a new helper chk_subflows_total(), in it use the newly added counter mptcpi_subflows_total to get the "correct" amount of subflows, including the initial one. To be compatible with old 'ss' or kernel versions not supporting this counter, get the total subflows by listing TCP connections that are MPTCP subflows: ss -ti state state established state syn-sent state syn-recv \| grep -c tcp-ulp-mptcp. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-3-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Geliang Tang	06848c0f34	selftests: mptcp: add evts_get_info helper This patch adds a new helper get_info_value(), using 'sed' command to parse the value of the given item name in the line with the given keyword, to make chk_mptcp_info() and pedit_action_pkts() more readable. Also add another helper evts_get_info() to use get_info_value() to parse the output of 'pm_nl_ctl events' command, to make all the userspace pm selftests more readable, both in mptcp_join.sh and userspace_pm.sh. Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-2-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Geliang Tang	6ebf6f90ab	mptcp: add mptcpi_subflows_total counter If the initial subflow has been removed, we cannot know without checking other counters, e.g. ss -ti <filter> \| grep -c tcp-ulp-mptcp or getsockopt(SOL_MPTCP, MPTCP_FULL_INFO, ...) (or others except MPTCP_INFO of course) and then check mptcp_subflow_data->num_subflows to get the total amount of subflows. This patch adds a new counter mptcpi_subflows_total in mptcpi_flags to store the total amount of subflows, including the initial one. A new helper __mptcp_has_initial_subflow() is added to check whether the initial subflow has been removed or not. With this helper, we can then compute the total amount of subflows from mptcp_info by doing something like: mptcpi_subflows_total = mptcpi_subflows + __mptcp_has_initial_subflow(msk). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/428 Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231128-send-net-next-2023107-v4-1-8d6b94150f6b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:06:17 -08:00
Jakub Kicinski	3d6d754904	Merge branch 'mlxsw-support-cff-flood-mode' Petr Machata says: ==================== mlxsw: Support CFF flood mode The registers to configure to initialize a flood table differ between the controlled and CFF flood modes. In therefore needs to be an op. Add it, hook up the current init to the existing families, and invoke the op. PGT is an in-HW table that maps addresses to sets of ports. Then when some HW process needs a set of ports as an argument, instead of embedding the actual set in the dynamic configuration, what gets configured is the address referencing the set. The HW then works with the appropriate PGT entry. Among other allocations, the PGT currently contains two large blocks for bridge flooding: one for 802.1q and one for 802.1d. Within each of these blocks are three tables, for unknown-unicast, multicast and broadcast flooding: . . . \| 802.1q \| 802.1d \| . . . \| UC \| MC \| BC \| UC \| MC \| BC \| \______ _____/ \_____ ______/ v v FID flood vectors Thus each FID (which corresponds to an 802.1d bridge or one VLAN in an 802.1q bridge) uses three flood vectors spread across a fairly large region of PGT. This way of organizing the flood table (called "controlled") is not very flexible. E.g. to decrease a bridge scale and store more IP MC vectors, one would need to completely rewrite the bridge PGT blocks, or resort to hacks such as storing individual MC flood vectors into unused part of the bridge table. In order to address these shortcomings, Spectrum-2 and above support what is called CFF flood mode, for Compressed FID Flooding. In CFF flood mode, each FID has a little table of its own, with three entries adjacent to each other, one for unknown-UC, one for MC, one for BC. This allows for a much more fine-grained approach to PGT management, where bits of it are allocated on demand. . . . \| FID \| FID \| FID \| FID \| FID \| . . . \|U\|M\|B\|U\|M\|B\|U\|M\|B\|U\|M\|B\|U\|M\|B\| \_____________ _____________/ v FID flood vectors Besides the FID table organization, the CFF flood mode also impacts Router Subport (RSP) table. This table contains flood vectors for rFIDs, which are FIDs that reference front panel ports or LAGs. The RSP table contains two entries per front panel port and LAG, one for unknown-UC traffic, and one for everything else. Currently, the FW allocates and manages the table in its own part of PGT. rFIDs are marked with flood_rsp bit and managed specially. In CFF mode, rFIDs are managed as all other FIDs. The driver therefore has to allocate and maintain the flood vectors. Like with bridge FIDs, this is more work, but increases flexibility of the system. The FW currently supports both the controlled and CFF flood modes. To shed complexity, in the future it should only support CFF flood mode. Hence this patchset, which adds CFF flood mode support to mlxsw. Since mlxsw needs to maintain both the controlled mode as well as CFF mode support, we will keep the layout as compatible as possible. The bridge tables will stay in the same overall shape, just their inner organization will change from flood mode -> FID to FID -> flood mode. Likewise will RSP be kept as a contiguous block of PGT memory, as was the case when the FW maintained it. - The way FIDs get configured under the CFF flood mode differs from the currently used controlled mode. The simple approach of having several globally visible arrays for spectrum.c to statically choose from no longer works. Patch #1 thus privatizes all FID initialization and finalization logic, and exposes it as ops instead. - Patch #2 renames the ops that are specific to the controlled mode, to make room in the namespace for the CFF variants. Patch #3 extracts a helper to compute flood table base out of mlxsw_sp_fid_flood_table_mid(). - The op fid_setup configured fid_offset, i.e. the number of this FID within its family. For rFIDs in CFF mode, to determine this number, the driver will need to do fallible queries. Thus in patch #4, make the FID setup operation fallible as well. - Flood mode initialization routine differs between the controlled and CFF flood modes. The controlled mode needs to configure flood table layout, which the CFF mode does not need to do. In patch #5, move mlxsw_sp_fid_flood_table_init() up so that the following patch can make use of it. In patch #6, add an op to be invoked per table (if defined). - The current way of determining PGT allocation size depends on the number of FIDs and number of flood tables. RFIDs however have PGT footprint depending not on number of FIDs, but on number of ports and LAGs, because which ports an rFID should flood to does not depend on the FID itself, but on the port or LAG that it references. Therefore in patch #7, add FID family ops for determining PGT allocation size. - As elaborated above, layout of PGT will differ between controlled and CFF flood modes. In CFF mode, it will further differ between rFIDs and other FIDs (as described at previous patch). The way to pack the SFMR register to configure a FID will likewise differ from controlled to CFF. Thus in patches #8 and #9 add FID family ops to determine PGT base address for a FID and to pack SFMR. - Patches #10 and #11 add more bits for RSP support. In patch #10, add a new traffic type enumerator, for non-UC traffic. This is a combination of BC and MC traffic, but the way that mlxsw maps these mnemonic names to actual traffic type configurations requires that we have a new name to describe this class of traffic. Patch #11 then adds hooks necessary for RSP table maintenance. As ports come and go, and join and leave LAGs, it is necessary to update flood vectors that the rFIDs use. These new hooks will make that possible. - Patches #12, #13 and #14 introduce flood profiles. These have been implicit so far, but the way that CFF flood mode works with profile IDs requires that we make them explicit. Thus in patch #12, introduce flood profile objects as a set of flood tables that FID families then refer to. The FID code currently only uses a single flood profile. In patch #13, add a flood profile ID to flood profile objects. In patch #14, when in CFF mode, configure SFFP according to the existing flood profiles (or the one that exists as of that point). - Patches #15 and #16 add code to implement, respectively, bridge FIDs and RSP FIDs in CFF mode. - In patch #17, toggle flood_mode_prefer_cff on Spectrum-2 and above, which makes the newly-added code live. ==================== Link: https://lore.kernel.org/r/cover.1701183891.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:28 -08:00
Petr Machata	69f289e9c7	mlxsw: spectrum: Use CFF mode where available Mark all Spectrum>2 systems as preferring CFF flood mode if supported by the firmware. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/8a3d2ad96b943f7e3f53f998bd333a14e19cd641.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:25 -08:00
Petr Machata	72a4cedb37	mlxsw: spectrum_fid: Add support for rFID family in CFF flood mode In this patch, add the artifacts for the rFID family that works in CFF flood mode. The same that was said about PGT organization and lookup in bridge FID families applies for the rFID family as well. The main difference lies in the fact that in the controlled flood mode, the FW was taking care of maintaining the PGT tables for rFIDs. In CFF mode, the responsibility shifts to the driver. All rFIDs are based off either a front panel port, or a LAG port. For those based off ports, we need to maintain at worst one PGT block for each port, for those based off LAGs, one PGT block per LAG. This reflects in the pgt_size callback, which determines the PGT footprint based on number of ports and the LAG capacity. A number of FIDs may end up using the same PGT base. Unlike with bridges, where membership of a port in a given FID is highly dynamic, an rFID based of a port will just always need to flood to that port. Both the port and the LAG subtables need to be actively maintained. To that end, the CFF rFID family implements fid_port_init and fid_port_fini callbacks, which toggle the necessary bits. Both FID-MID translation and SFMR packing then point into either the port or the LAG subtable, to the block that corresponds to a given port or a given LAG, depending on what port the RIF bound to the rFID uses. As in the previous patch, the way CFF flood mode organizes PGT accesses allows for much more smarts and dynamism. As in the previous patch, we rather aim to keep things simple and static. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/962deb4367585d38250e80c685a34735c0c7f3ad.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:25 -08:00
Petr Machata	db3e541b59	mlxsw: spectrum_fid: Add a family for bridge FIDs in CFF flood mode In this patch, add the artifacts for 802.1d and 802.1q FID families that work in CFF flood mode. In CFF flood mode, the way flood vectors are looked up changes: there's a per-FID PGT base, to which a small offset is added depending on type of traffic. Thus each FID occupies a small contiguous block of PGT memory, whereas in the controlled flood mode, flood vectors for a given FID were spread across the PGT. The term "flood table" as used by the spectrum_fid module, borrows from controlled flood mode way of organizing the PGT table. There flood tables were actual tables, contiguous in the PGT. In the CFF flood mode, they are more abstract: a flood table becomes a collection of e.g. all first rows of the per-FID PGT blocks. Nonetheless we retain the nomenclature. FIDs are still configured through the SFMR register, but there are different fields to set under CFF mode: PGT base and profile. Thus register packing gets a dedicated op overload as well. The new organization of PGT makes it possible to treat the PGT as a block of an ordinary memory, allocate and deallocate on demand, and achieve better flexibility. Here instead, we aim to keep the code as close as possible to the previous controlled flood mode, support for which we need to retain for Spectrum-1 and older FW versions anyway. Thus the PGT footprint of the individual families is the same as before, just the internal organization of the per-family PGT region differs. Hence the pgt_size callback is reused between the controlled and CFF flood modes. Since the dummy family has no flood tables in either the CTL mode or in CFF mode, the existing one can be reused for the CFF family array. Users should not notice any changes between the controlled and CFF flood modes. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/ca40b8163e6d6a21f63ef299619acee953cf9519.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:25 -08:00
Petr Machata	d79b70dbb7	mlxsw: spectrum_fid: Initialize flood profiles in CFF mode In CFF flood mode, the way flood vectors are looked up changes: there's a per-FID PGT base, to which a small offset is added depending on type of traffic. Thus each FID occupies a small contiguous block of PGT memory, whereas in the controlled flood mode, flood vectors for a given FID were spread across the PGT. Each FID is associated with one of a handful of profiles. The profile and the traffic type are then used as keys to look up the PGT offset. This offset is then added to the per-FID PGT base. The profile / type / offset mapping needs to be configured by the driver, and is only relevant in CFF flood mode. In this patch, add the SFFP initialization code. Only initialize the one profile currently explicitly used. As follow-up patch add more profiles, this code will pick them up and initialize as well. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/2c4733ed72d439444218969c032acad22cd4ed88.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:25 -08:00
Petr Machata	af1e696fdf	mlxsw: spectrum_fid: Add profile_id to flood profile In the CFF mode, flood profiles are identified by a unique numerical identifier. This is used for configuration of FIDs and for configuration of traffic-type to PGT offset rules. In both cases, the numerical identifier serves as a handle for the flood profile. Add the identifier to the flood profile structure. There is currently only one flood profile in use explicitly, the one used for all bridging. Eventually three will be necessary in total: one for bridges, one for rFIDs, one for NVE underlay. A total of four profiles are supported by the HW. Start allocating at 1, because 0 is currently used for underlay NVE flood. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/19ea9c35ba8b522fa5f7eb6fd7bc1b68f0f66b41.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:25 -08:00
Petr Machata	5e6146e34b	mlxsw: spectrum_fid: Add an object to keep flood profiles A flood profile is a mapping from traffic type to an offset at which a flood vector should be looked up. In mlxsw so far, a flood profile was somewhat implicitly represented by flood table array. When the CFF flood mode will be introduced, the flood profile will become more explicit: each will get a number and the profile ID / traffic-type / offset mapping will actually need to be initialized in the hardware. Therefore it is going to be handy to have a structure that keeps all the components that compose a flood profile. Add this structure, currently with just the flood table array bits. In the FID families that flood at all, reference the flood profile instead of just the table array. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/15e113de114d3f41ce3fd2a14a2fa6a1b1d7e8f2.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	315702e09b	mlxsw: spectrum_fid: Add hooks for RSP table maintenance In the CFF flood mode, the driver has to allocate a table within PGT, which holds flood vectors for router subport FIDs. For LAGs, these flood vectors have to obviously be maintained dynamically as port membership in a LAG changes. But even for physical ports, the flood vectors have to be kept valid, and may not contain enabled bits corresponding to non-existent ports. It is therefore not possible to precompute the port part of the RSP table, it has to be maintained as ports come and go due to splits. To support the RSP table maintenance, add to FID ops two new ops: fid_port_init and fid_port_fini, for when a port comes to existence, or joins a lag, and vice versa. Invoke these ops from mlxsw_sp_port_fids_init() and mlxsw_sp_port_fids_fini(), which are called when port is added and removed, respectively. Also add two new hooks for LAG maintenance, mlxsw_sp_fid_port_join_lag() / _leave_lag() which transitively call into the same ops. Later patches will actually add the op implementations themselves, this just adds the scaffolding. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/234398a23540317abb25f74f920a5c8121faecf0.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	a59316ffd9	mlxsw: spectrum_fid: Add a not-UC packet type In CFF flood mode, the rFID family will allocate two tables. One for unknown UC traffic, one for everything else. Add a traffic type for the everything else traffic. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/8fb968b2d1cc37137cd0110c98cdeb625b03ca99.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	f6454316c8	mlxsw: spectrum_fid: Add an op for packing SFMR The way SFMR is packed differs between the controlled and CFF flood modes. Add an op to dispatch it dynamically. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/f12fe7879a7086ee86343ee4db02c859f78f0534.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	e917a78959	mlxsw: spectrum_fid: Add an op to get PGT address of a FID In the CFF flood mode, the way to determine a PGT address where a given FID / flood table resides is different from the controlled flood mode, which mlxsw currently uses. Furthermore, this will differ between rFID family and bridge families. The operation therefore needs to be dynamically dispatched. To that end, add an op to FID-family ops. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Link: https://lore.kernel.org/r/00e8f6ad79009a9a77a5c95d596ea9574776dc95.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	1686b8d902	mlxsw: spectrum_fid: Add an op to get PGT allocation size In the CFF flood mode, the PGT allocation size of RFID family will not depend on number of FIDs, but rather number of ports and LAGs. Therefore introduce a FID family operation to calculate the PGT allocation size. The way that size is calculated in the CFF mode depends on calling fallible functions. Thus express the op as returning an int, with the size returned via a pointer argument. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/1174651b7160fcedbef50010ae4b68201112fe6f.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	80638da22e	mlxsw: spectrum_fid: Add an op for flood table initialization In controlled flood mode, for each bridge FID family (i.e., 802.1Q and 802.1D) and packet type (i.e., UUC/MC/BC), the hardware needs to be told which PGT address to use as the base address for the flood table and how to determine the offset from the base for each FID. The above is not needed in CFF mode where each FID has its own flood table instead of the FID family itself. Therefore, create a new FID family operation for the above configuration and only implement it for the 802.1Q and 802.1D families in controlled flood mode. No functional changes intended. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/06f71415eec75811585ec597e1dd101b6dff77e7.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:24 -08:00
Petr Machata	1d0791168e	mlxsw: spectrum_fid: Move mlxsw_sp_fid_flood_table_init() up Move the function to the point where it will need to be to be visible for the 802.1d ops. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/aef09e26b0c2dd077531e665d7135b300bdaf0a8.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:23 -08:00
Petr Machata	17eda112b0	mlxsw: spectrum_fid: Make mlxsw_sp_fid_ops.setup return an int This operation will be fallible for rFIDs in CFF mode, which will be introduced in follow-up patches. Have it return an int, and handle the failures in the caller. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/75f1b85c0cb86bea5501fcc8657042f221a78b32.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:23 -08:00
Petr Machata	82ff7a196d	mlxsw: spectrum_fid: Split a helper out of mlxsw_sp_fid_flood_table_mid() In future patches, for CFF flood mode support, we will need a way to determine a PGT base dynamically, as an op. Therefore, for symmetry, split out a helper, mlxsw_sp_fid_pgt_base_ctl(), that determines a PGT base in the controlled mode as well. Now that the helper is available, use it in mlxsw_sp_fid_flood_table_init() which currently invokes the FID->MID helper to that end. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/fd41c66a1df4df6499d3da34f40e7b9efa15bc3e.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:23 -08:00
Petr Machata	ab68bd743a	mlxsw: spectrum_fid: Rename FID ops, families, arrays Currently, mlxsw always uses a "controlled" flood mode on all Nvidia Spectrum generations. The following patches will however introduce a possibility to run a "CFF" (for Compressed FID Flooding) mode on newer machines, if the FW supports it. To reflect that, label all FID ops, FID families and FID family arrays with a _ctl suffix. This will make it clearer what is what when the CFF families are introduced in later patches. Keep the dummy family intact. Since the dummy family has no flood tables in either CTL or CFF mode, there are no flood-mode-specific callbacks. Additionally, add a remark at two fields that they are only relevant when flood mode is not CFF. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/96b6da5439bb662fa86e795bbcec9dc3ccfa59fd.1701183892.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:23 -08:00
Petr Machata	01de00f439	mlxsw: spectrum_fid: Privatize FID families Currently, mlxsw always uses a "controlled" flood mode on all Nvidia Spectrum generations. The following patches will however introduce a possibility to run a "CFF" (for Compressed FID Flooding) mode on newer machines, if the FW supports it. Several operations will differ between how they need to be done in controlled mode vs. CFF mode. Thus the per-FID-family ops will differ between controlled and CFF, thus the FID family array as such will differ depending on whether the mode negotiated with FW is controlled or CFF. The simple approach of having several globally visible arrays for spectrum.c to statically choose from no longer works. Instead privatize all FID initialization and finalization logic, and expose it as ops instead. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/d3fa390d97cf3dbd2f7a28741be69b311e2059e4.1701183891.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:03:23 -08:00
Christian Marangi	7edce370d8	net: phy: aquantia: drop wrong endianness conversion for addr and CRC On further testing on BE target with kernel test robot, it was notice that the endianness conversion for addr and CRC in fw_load_memory was wrong. Drop the cpu_to_le32 conversion for addr load as it's not needed. Use get_unaligned_le32 instead of get_unaligned for FW data word load to correctly convert data in the correct order to follow system endian. Also drop the cpu_to_be32 for CRC calculation as it's wrong and would cause different CRC on BE system. The loaded word is swapped internally and MAILBOX calculates the CRC on the swapped word. To correctly calculate the CRC to be later matched with the one from MAILBOX, use an u8 struct and swap the word there to keep the same order on both LE and BE for crc_ccitt_false function. Also add additional comments on how the CRC verification for the loaded section works. CRC is calculated as we load the section and verified with the MAILBOX only after the entire section is loaded to skip additional slowdown by loop the section data again. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311210414.sEJZjlcD-lkp@intel.com/ Fixes: `e93984ebc1` ("net: phy: aquantia: add firmware load support") Tested-by: Robert Marko <robimarko@gmail.com> # ipq8072 LE device Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Link: https://lore.kernel.org/r/20231128135928.9841-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 20:01:18 -08:00
Dave Ertman	9f74a3dfcf	ice: Fix VF Reset paths when interface in a failed over aggregate There is an error when an interface has the following conditions: - PF is in an aggregate (bond) - PF has VFs created on it - bond is in a state where it is failed-over to the secondary interface - A VF reset is issued on one or more of those VFs The issue is generated by the originating PF trying to rebuild or reconfigure the VF resources. Since the bond is failed over to the secondary interface the queue contexts are in a modified state. To fix this issue, have the originating interface reclaim its resources prior to the tear-down and rebuild or reconfigure. Then after the process is complete, move the resources back to the currently active interface. There are multiple paths that can be used depending on what triggered the event, so create a helper function to move the queues and use paired calls to the helper (back to origin, process, then move back to active interface) under the same lag_mutex lock. Fixes: `1e0f9881ef` ("ice: Flesh out implementation of support for SRIOV on bonded interface") Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20231127212340.1137657-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:55:49 -08:00
Vincent Whitchurch	cb2f01b856	net: phy: adin: allow control of Fast Link Down Add support to allow Fast Link Down (aka "Enhanced link detection") to be controlled via the ETHTOOL_PHY_FAST_LINK_DOWN tunable. These PHYs have this feature enabled by default. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Nuno Sa <nuno.sa@analog.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231127-adin-fld-v1-1-797f6423fd48@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:49:38 -08:00
Heiner Kallweit	127532cd0f	r8169: improve handling task scheduling If we know that the task is going to be a no-op, don't even schedule it. And remove the check for netif_running() in the worker function, the check for flag RTL_FLAG_TASK_ENABLED is sufficient. Note that we can't remove the check for flag RTL_FLAG_TASK_ENABLED in the worker function because we have no guarantee when it will be executed. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/c65873a3-7394-4107-99a7-83f20030779c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:46:22 -08:00
Jakub Kicinski	300fbb247e	wireless fixes: - debugfs had a deadlock (removal vs. use of files), fixes going through wireless ACKed by Greg - support for HT STAs on 320 MHz channels, even if it's not clear that should ever happen (that's 6 GHz), best not to WARN() - fix for the previous CQM fix that broke most cases - various wiphy locking fixes - various small driver fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmVnUrkACgkQ10qiO8sP aACE4A/9EdhyexRlNKdF3yLh0Q4acuR8IlaLBNgGiQDFxx17djE04p2PTpwkxOyn 0EPy8dKhhWwoAkbvZ+6ToNFa+Jv9w+C5xVls2osGRuYGSVNlxCNy+8tWSNVT73jp rrapEkYHzuz9wBZiJpKwC+mK9uH9gAQgWyYUaTTeOnO/+m3HVhZdU6y71j8gQlm6 9YFSI3r/VWlq1JpThn8WGULJXOMICWN0Sp4XRhEzHPLjK7MiCLNrrQSyV+uMHHUI PmRQN8QW6Oomjbcih3YNn+Geps7xUJFEG4mZvM6GUBXugnIq9t9xmzEbEJyBRIpl MGTwrIAyxtvBeHMpB7U/R3RSEyHfSW6fXHgN2S8ZEFTu6v70gP3TK8lL1gWc5mR4 GogNaGQ0G6LdiHjM7XeGUv/SfzD6HJWa9aR1JbRwapkvxFfM7BEc14gxip0nFDLG z9sixiztGmzBsxmDc+h7M6YLlvWe4oqueUyjA7/p42NhYq41Zy/BYK5oUgsuL4ah Lsb00O6Mz9m6iPtJlREe81Qsclkjg8xMY2YcCYELXP3v50KwU3THBAOYbh5hvgrL T2gibkR0zFAGz1EDcPyQq/SBeUtASPVHyLGUMae6DOcVnOFdAhDTbX0B/TvcnRSq e1gvmNywL7wFVZLD/RYZtWW9g6NUcwK+CU59PRBh2OqVb0ow8Ls= =X6t2 -----END PGP SIGNATURE----- Merge tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== wireless fixes: - debugfs had a deadlock (removal vs. use of files), fixes going through wireless ACKed by Greg - support for HT STAs on 320 MHz channels, even if it's not clear that should ever happen (that's 6 GHz), best not to WARN() - fix for the previous CQM fix that broke most cases - various wiphy locking fixes - various small driver fixes * tag 'wireless-2023-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: use wiphy locked debugfs for sdata/link wifi: mac80211: use wiphy locked debugfs helpers for agg_status wifi: cfg80211: add locked debugfs wrappers debugfs: add API to allow debugfs operations cancellation debugfs: annotate debugfs handlers vs. removal with lockdep debugfs: fix automount d_fsdata usage wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_cap wifi: avoid offset calculation on NULL pointer wifi: cfg80211: hold wiphy mutex for send_interface wifi: cfg80211: lock wiphy mutex for rfkill poll wifi: cfg80211: fix CQM for non-range use wifi: mac80211: do not pass AP_VLAN vif pointer to drivers during flush wifi: iwlwifi: mvm: fix an error code in iwl_mvm_mld_add_sta() wifi: mt76: mt7925: fix typo in mt7925_init_he_caps wifi: mt76: mt7921: fix 6GHz disabled by the missing default CLC config ==================== Link: https://lore.kernel.org/r/20231129150809.31083-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:43:34 -08:00
Jakub Kicinski	0d47fa5cc9	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWfJyQAKCRDbK58LschI g+AOAQCJhC0Tt+2db4H+xQVNaYiLouI7DZPOhNivvACmCyxoiwEA3EOeDmIYd5E3 BI702APCZ76pxt6Zg0bkcg9d7kLDaQk= =zmzL -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-11-30 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 10 files changed, 66 insertions(+), 15 deletions(-). The main changes are: 1) Fix AF_UNIX splat from use after free in BPF sockmap, from John Fastabend. 2) Fix a syzkaller splat in netdevsim by properly handling offloaded programs (and not device-bound ones), from Stanislav Fomichev. 3) Fix bpf_mem_cache_alloc_flags() to initialize the allocation hint, from Hou Tao. 4) Fix netkit by rejecting IFLA_NETKIT_PEER_INFO in changelink, from Daniel Borkmann. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Add af_unix test with both sockets in map bpf, sockmap: af_unix stream sockets need to hold ref for pair sock netkit: Reject IFLA_NETKIT_PEER_INFO in netkit_change_link bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags() netdevsim: Don't accept device bound programs ==================== Link: https://lore.kernel.org/r/20231129234916.16128-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:40:04 -08:00
Jakub Kicinski	ee7546390a	Merge branch 'create-a-binding-for-the-marvell-mv88e6xxx-dsa-switches' Linus Walleij says: ==================== Create a binding for the Marvell MV88E6xxx DSA switches The Marvell switches are lacking DT bindings. I need proper schema checking to add LED support to the Marvell switch. Just how it is, it can't go on like this. Some Device Tree fixes are included in the series, these remove the major and most annoying warnings fallout noise: some warnings remain, and these are of more serious nature, such as missing phy-mode. They can be applied individually, or to the networking tree with the rest of the patches. Thanks to Andrew Lunn, Vladimir Oltean and Russell King for excellent review and feedback! This latest version employs special compatibles in the odd ABI device trees. v8: https://lore.kernel.org/r/20231114-marvell-88e6152-wan-led-v8-0-50688741691b@linaro.org v7: https://lore.kernel.org/r/20231024-marvell-88e6152-wan-led-v7-0-2869347697d1@linaro.org v6: https://lore.kernel.org/r/20231024-marvell-88e6152-wan-led-v6-0-993ab0949344@linaro.org v5: https://lore.kernel.org/r/20231023-marvell-88e6152-wan-led-v5-0-0e82952015a7@linaro.org v4: https://lore.kernel.org/r/20231018-marvell-88e6152-wan-led-v4-0-3ee0c67383be@linaro.org v3: https://lore.kernel.org/r/20231016-marvell-88e6152-wan-led-v3-0-38cd449dfb15@linaro.org v2: https://lore.kernel.org/r/20231014-marvell-88e6152-wan-led-v2-0-7fca08b68849@linaro.org v1: https://lore.kernel.org/r/20231013-marvell-88e6152-wan-led-v1-0-0712ba99857c@linaro.org ==================== Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-0-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:23 -08:00
Linus Walleij	017ca9c9f3	dt-bindings: marvell: Add Marvell MV88E6060 DSA schema The Marvell MV88E6060 is one of the oldest DSA switches from Marvell, and it has DT bindings used in the wild. Let's define them properly. It is different enough from the rest of the MV88E6xxx switches that it deserves its own binding. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-5-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:21 -08:00
Linus Walleij	43915b2f4b	dt-bindings: marvell: Rewrite MV88E6xxx in schema This is an attempt to rewrite the Marvell MV88E6xxx switch bindings in YAML schema. The current text binding says: WARNING: This binding is currently unstable. Do not program it into a FLASH never to be changed again. Once this binding is stable, this warning will be removed. Well that never happened before we switched to YAML markup, we can't have it like this, what about fixing the mess? Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-4-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:21 -08:00
Linus Walleij	f45c197465	dt-bindings: net: ethernet-switch: Accept special variants Accept special node naming variants for Marvell switches with special node names as ABI. This is maybe not the prettiest but it avoids special-casing the Marvell MV88E6xxx bindings by copying a lot of generic binding code down into that one binding just to special-case these unfixable nodes. Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-3-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:21 -08:00
Linus Walleij	a6e44f3028	dt-bindings: net: mvusb: Fix up DSA example When adding a proper schema for the Marvell mx88e6xxx switch, the scripts start complaining about this embedded example: dtschema/dtc warnings/errors: net/marvell,mvusb.example.dtb: switch@0: ports: '#address-cells' is a required property from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml# net/marvell,mvusb.example.dtb: switch@0: ports: '#size-cells' is a required property from schema $id: http://devicetree.org/schemas/net/dsa/marvell,mv88e6xxx.yaml# Fix this up by extending the example with those properties in the ports node. While we are at it, rename "ports" to "ethernet-ports" and rename "switch" to "ethernet-switch" as this is recommended practice. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-2-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:20 -08:00
Linus Walleij	fbb7033b76	dt-bindings: net: dsa: Require ports or ethernet-ports Bindings using dsa.yaml#/$defs/ethernet-ports specify that a DSA switch node need to have a ports or ethernet-ports subnode, and that is actually required, so add requirements using oneOf. Suggested-by: Rob Herring <robh@kernel.org> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20231127-marvell-88e6152-wan-led-v9-1-272934e04681@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 19:37:20 -08:00
Jakub Kicinski	6afb936f73	Merge branch 'tools-ynl-fixes-for-the-page-pool-sample-and-the-generation-process' Jakub Kicinski says: ==================== tools: ynl: fixes for the page-pool sample and the generation process Minor fixes to the new sample and the Makefiles. ==================== Link: https://lore.kernel.org/r/20231129193622.2912353-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 16:07:02 -08:00
Jakub Kicinski	a115b9279f	tools: ynl: don't skip regeneration from make targets Commit `2b7ac0c87d` ("tools: ynl-gen: don't touch the output file if content is the same") is working too well. It was added so that ynl-regen -f doesn't make us rebuild half of the kernel, if there are no actual changes in any generated code. When ynl-gen-c is called by make, however, we're better off trusting make's tracking and overwrite the file. Otherwise if output is identical we won't update file timestamps and make will retry code gen on every invocation. Link: https://lore.kernel.org/r/20231129193622.2912353-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 16:07:00 -08:00
Jakub Kicinski	9cf9b57082	tools: ynl: order building samples after generated code Parallel builds of ynl: make -C tools/net/ynl/ -j 4 don't work correctly right now. samples get handled before generated, so build of samples does not notice that protos.a has changed. Order samples to be last. Link: https://lore.kernel.org/r/20231129193622.2912353-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 16:07:00 -08:00
Jakub Kicinski	929003723f	tools: ynl: make sure we use local headers for page-pool Building samples generates the following warning: In file included from page-pool.c:11: generated/netdev-user.h:21:45: warning: ‘enum netdev_xdp_rx_metadata’ declared inside parameter list will not be visible outside of this definition or declaration 21 \| const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value); \| ^~~~~~~~~~~~~~~~~~~~~~ Our magic way of including uAPI headers assumes the sample name matches the family name. We need to copy the flags over. Fixes: `637567e4a3` ("tools: ynl: add sample for getting page-pool information") Link: https://lore.kernel.org/r/20231129193622.2912353-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 16:07:00 -08:00
Jakub Kicinski	ee1eb9de81	tools: ynl: fix build of the page-pool sample The name of the "destroyed" field in the reply was not changed in the sample after we started calling it "detach_time". page-pool.c: In function ‘main’: page-pool.c:84:33: error: ‘struct <anonymous>’ has no member named ‘destroyed’ 84 \| if (pp->_present.destroyed) \| ^ Fixes: `637567e4a3` ("tools: ynl: add sample for getting page-pool information") Link: https://lore.kernel.org/r/20231129193622.2912353-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-11-29 16:07:00 -08:00
John Fastabend	51354f700d	bpf, sockmap: Add af_unix test with both sockets in map This adds a test where both pairs of a af_unix paired socket are put into a BPF map. This ensures that when we tear down the af_unix pair we don't have any issues on sockmap side with ordering and reference counting. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20231129012557.95371-3-john.fastabend@gmail.com	2023-11-30 00:25:25 +01:00
John Fastabend	8866730aed	bpf, sockmap: af_unix stream sockets need to hold ref for pair sock AF_UNIX stream sockets are a paired socket. So sending on one of the pairs will lookup the paired socket as part of the send operation. It is possible however to put just one of the pairs in a BPF map. This currently increments the refcnt on the sock in the sockmap to ensure it is not free'd by the stack before sockmap cleans up its state and stops any skbs being sent/recv'd to that socket. But we missed a case. If the peer socket is closed it will be free'd by the stack. However, the paired socket can still be referenced from BPF sockmap side because we hold a reference there. Then if we are sending traffic through BPF sockmap to that socket it will try to dereference the free'd pair in its send logic creating a use after free. And following splat: [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0 [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954 [...] [59.905468] Call Trace: [59.905787] <TASK> [59.906066] dump_stack_lvl+0x130/0x1d0 [59.908877] print_report+0x16f/0x740 [59.910629] kasan_report+0x118/0x160 [59.912576] sk_wake_async+0x31/0x1b0 [59.913554] sock_def_readable+0x156/0x2a0 [59.914060] unix_stream_sendmsg+0x3f9/0x12a0 [59.916398] sock_sendmsg+0x20e/0x250 [59.916854] skb_send_sock+0x236/0xac0 [59.920527] sk_psock_backlog+0x287/0xaa0 To fix let BPF sockmap hold a refcnt on both the socket in the sockmap and its paired socket. It wasn't obvious how to contain the fix to bpf_unix logic. The primarily problem with keeping this logic in bpf_unix was: In the sock close() we could handle the deref by having a close handler. But, when we are destroying the psock through a map delete operation we wouldn't have gotten any signal thorugh the proto struct other than it being replaced. If we do the deref from the proto replace its too early because we need to deref the sk_pair after the backlog worker has been stopped. Given all this it seems best to just cache it at the end of the psock and eat 8B for the af_unix and vsock users. Notice dgram sockets are OK because they handle locking already. Fixes: `94531cfcbe` ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20231129012557.95371-2-john.fastabend@gmail.com	2023-11-30 00:25:16 +01:00
Alexei Starovoitov	b5145153a7	Merge branch 'xsk-tx-metadata' Stanislav Fomichev says: ==================== xsk: TX metadata This series implements initial TX metadata (offloads) for AF_XDP. See patch #2 for the main implementation and mlx5/stmmac ones for the example on how to consume the metadata on the device side. Starting with two types of offloads: - request TX timestamp (and write it back into the metadata area) - request TX checksum offload Changes since v5: - preserve xsk_tx_metadata flags across tx and completion by moving them out of 'request' union (Jesper) - fix xdp_metadata checksum failure in big endian (Alexei) - add SPDX to xdp-rx-metadata.rst (Simon) v5: https://lore.kernel.org/bpf/20231102225837.1141915-1-sdf@google.com/ Performance (mlx5): I've implemented a small xskgen tool to try to saturate single tx queue: https://github.com/fomichev/xskgen/tree/master Here are the performance numbers with some analysis. 1. Baseline. Running with commit `eb62e6aef9` ("Merge branch 'bpf: Support bpf_get_func_ip helper in uprobes'"), nothing from this series: - with 1400 bytes of payload: 98 gbps, 8 mpps ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps - with 200 bytes of payload: 49 gbps, 23 mpps ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps 2. Adding single commit that supports reserving tx_metadata_len changes nothing numbers-wise. - baseline for 1400 ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps - baseline for 200 ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps 3. Adding -M flag causes xskgen to reserve the metadata and fill it, but doesn't set XDP_TX_METADATA descriptor option. - new baseline for 1400 (with only filling the metadata) ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps - new baseline for 200 (with only filling the metadata) ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps (the numbers go sligtly up here, not really sure why, maybe some cache-related side-effects? 4. Next, I'm running the same test but with the commit that adds actual general infra to parse XDP_TX_METADATA (but no driver support). Essentially applying "xsk: add TX timestamp and TX checksum offload support" from this series. Numbers are the same. - fill metadata for 1400 ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps - fill metadata for 200 ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps - request metadata for 1400 ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps - request metadata for 200 ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps 5. Now, for the most interesting part, I'm adding mlx5 driver support. The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but _only_ when I enable the metadata. This looks like a side effect of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push. Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)' to not affect the existing non-metadata use-cases. Since this is not regressing existing workloads, I'm not spending any time trying to optimize it more (and leaving it up to mlx owners to purse if they see any good way to do it). - same baseline ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps - fill metadata for 1400 ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps - fill metadata for 200 ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps - request metadata for 1400 ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps - request metadata for 200 ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1 sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://lore.kernel.org/r/20231127190319.1190813-1-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-11-29 14:59:41 -08:00
Stanislav Fomichev	60523115c1	selftests/bpf: Add TX side to xdp_hw_metadata When we get a packet on port 9091, we swap src/dst and send it out. At this point we also request the timestamp and checksum offloads. Checksum offload is verified by looking at the tcpdump on the other side. The tool prints pseudo-header csum and the final one it expects. The final checksum actually matches the incoming packets checksum because we only flip the src/dst and don't change the payload. Some other related changes: - switched to zerocopy mode by default; new flag can be used to force old behavior - request fixed tx_metadata_len headroom - some other small fixes (umem size, fill idx+i, etc) mvbz3:~# ./xdp_hw_metadata eth3 ... xsk_ring_cons__peek: 1 0x19546f8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100 rx_hash: 0x80B7EA8B with RSS type:0x2A rx_timestamp: 1697580171852147395 (sec:1697580171.8521) HW RX-time: 1697580171852147395 (sec:1697580171.8521), delta to User RX-time sec:0.2797 (279673.082 usec) XDP RX-time: 1697580172131699047 (sec:1697580172.1317), delta to User RX-time sec:0.0001 (121.430 usec) 0x19546f8: ping-pong with csum=3b8e (want d862) csum_start=54 csum_offset=6 0x19546f8: complete tx idx=0 addr=8 tx_timestamp: 1697580172056756493 (sec:1697580172.0568) HW TX-complete-time: 1697580172056756493 (sec:1697580172.0568), delta to User TX-complete-time sec:0.0852 (85175.537 usec) XDP RX-time: 1697580172131699047 (sec:1697580172.1317), delta to User TX-complete-time sec:0.0102 (10232.983 usec) HW RX-time: 1697580171852147395 (sec:1697580171.8521), delta to HW TX-complete-time sec:0.2046 (204609.098 usec) 0x19546f8: complete rx idx=128 addr=80100 mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091 mvbz4:~# tcpdump -vvx -i eth3 udp tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e 0x0030: 7864 70 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e 0x0030: 7864 70 Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-14-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-11-29 14:59:41 -08:00

1 2 3 4 5 ...

1234814 commits