Commit graph

295 commits

Author SHA1 Message Date
Linus Torvalds
5b7c4cabbb Networking changes for 6.3.
Core
 ----
 
  - Add dedicated kmem_cache for typical/small skb->head, avoid having
    to access struct page at kfree time, and improve memory use.
 
  - Introduce sysctl to set default RPS configuration for new netdevs.
 
  - Define Netlink protocol specification format which can be used
    to describe messages used by each family and auto-generate parsers.
    Add tools for generating kernel data structures and uAPI headers.
 
  - Expose all net/core sysctls inside netns.
 
  - Remove 4s sleep in netpoll if carrier is instantly detected on boot.
 
  - Add configurable limit of MDB entries per port, and port-vlan.
 
  - Continue populating drop reasons throughout the stack.
 
  - Retire a handful of legacy Qdiscs and classifiers.
 
 Protocols
 ---------
 
  - Support IPv4 big TCP (TSO frames larger than 64kB).
 
  - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
    on socket by socket basis.
 
  - Track and report in procfs number of MPTCP sockets used.
 
  - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP
    path manager.
 
  - IPv6: don't check net.ipv6.route.max_size and rely on garbage
    collection to free memory (similarly to IPv4).
 
  - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
 
  - ICMP: add per-rate limit counters.
 
  - Add support for user scanning requests in ieee802154.
 
  - Remove static WEP support.
 
  - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
    reporting.
 
  - WiFi 7 EHT channel puncturing support (client & AP).
 
 BPF
 ---
 
  - Add a rbtree data structure following the "next-gen data structure"
    precedent set by recently added linked list, that is, by using
    kfunc + kptr instead of adding a new BPF map type.
 
  - Expose XDP hints via kfuncs with initial support for RX hash and
    timestamp metadata.
 
  - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key
    to better support decap on GRE tunnel devices not operating
    in collect metadata.
 
  - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
 
  - Remove the need for trace_printk_lock for bpf_trace_printk
    and bpf_trace_vprintk helpers.
 
  - Extend libbpf's bpf_tracing.h support for tracing arguments of
    kprobes/uprobes and syscall as a special case.
 
  - Significantly reduce the search time for module symbols
    by livepatch and BPF.
 
  - Enable cpumasks to be used as kptrs, which is useful for tracing
    programs tracking which tasks end up running on which CPUs in
    different time intervals.
 
  - Add support for BPF trampoline on s390x and riscv64.
 
  - Add capability to export the XDP features supported by the NIC.
 
  - Add __bpf_kfunc tag for marking kernel functions as kfuncs.
 
  - Add cgroup.memory=nobpf kernel parameter option to disable BPF
    memory accounting for container environments.
 
 Netfilter
 ---------
 
  - Remove the CLUSTERIP target. It has been marked as obsolete
    for years, and we still have WARN splats wrt. races of
    the out-of-band /proc interface installed by this target.
 
  - Add 'destroy' commands to nf_tables. They are identical to
    the existing 'delete' commands, but do not return an error if
    the referenced object (set, chain, rule...) did not exist.
 
 Driver API
 ----------
 
  - Improve cpumask_local_spread() locality to help NICs set the right
    IRQ affinity on AMD platforms.
 
  - Separate C22 and C45 MDIO bus transactions more clearly.
 
  - Introduce new DCB table to control DSCP rewrite on egress.
 
  - Support configuration of Physical Layer Collision Avoidance (PLCA)
    Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
    shared medium Ethernet.
 
  - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
    preemption of low priority frames by high priority frames.
 
  - Add support for controlling MACSec offload using netlink SET.
 
  - Rework devlink instance refcounts to allow registration and
    de-registration under the instance lock. Split the code into multiple
    files, drop some of the unnecessarily granular locks and factor out
    common parts of netlink operation handling.
 
  - Add TX frame aggregation parameters (for USB drivers).
 
  - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
    messages with notifications for debug.
 
  - Allow offloading of UDP NEW connections via act_ct.
 
  - Add support for per action HW stats in TC.
 
  - Support hardware miss to TC action (continue processing in SW from
    a specific point in the action chain).
 
  - Warn if old Wireless Extension user space interface is used with
    modern cfg80211/mac80211 drivers. Do not support Wireless Extensions
    for Wi-Fi 7 devices at all. Everyone should switch to using nl80211
    interface instead.
 
  - Improve the CAN bit timing configuration. Use extack to return error
    messages directly to user space, update the SJW handling, including
    the definition of a new default value that will benefit CAN-FD
    controllers, by increasing their oscillator tolerance.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - nVidia BlueField-3 support (control traffic driver)
    - Ethernet support for imx93 SoCs
    - Motorcomm yt8531 gigabit Ethernet PHY
    - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
    - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
    - Amlogic gxl MDIO mux
 
  - WiFi:
    - RealTek RTL8188EU (rtl8xxxu)
    - Qualcomm Wi-Fi 7 devices (ath12k)
 
  - CAN:
    - Renesas R-Car V4H
 
 Drivers
 -------
 
  - Bluetooth:
    - Set Per Platform Antenna Gain (PPAG) for Intel controllers.
 
  - Ethernet NICs:
    - Intel (1G, igc):
      - support TSN / Qbv / packet scheduling features of i226 model
    - Intel (100G, ice):
      - use GNSS subsystem instead of TTY
      - multi-buffer XDP support
      - extend support for GPIO pins to E823 devices
    - nVidia/Mellanox:
      - update the shared buffer configuration on PFC commands
      - implement PTP adjphase function for HW offset control
      - TC support for Geneve and GRE with VF tunnel offload
      - more efficient crypto key management method
      - multi-port eswitch support
    - Netronome/Corigine:
      - add DCB IEEE support
      - support IPsec offloading for NFP3800
    - Freescale/NXP (enetc):
      - enetc: support XDP_REDIRECT for XDP non-linear buffers
      - enetc: improve reconfig, avoid link flap and waiting for idle
      - enetc: support MAC Merge layer
    - Other NICs:
      - sfc/ef100: add basic devlink support for ef100
      - ionic: rx_push mode operation (writing descriptors via MMIO)
      - bnxt: use the auxiliary bus abstraction for RDMA
      - r8169: disable ASPM and reset bus in case of tx timeout
      - cpsw: support QSGMII mode for J721e CPSW9G
      - cpts: support pulse-per-second output
      - ngbe: add an mdio bus driver
      - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
      - r8152: handle devices with FW with NCM support
      - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
      - virtio-net: support multi buffer XDP
      - virtio/vsock: replace virtio_vsock_pkt with sk_buff
      - tsnep: XDP support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add support for latency TLV (in FW control messages)
    - Microchip (sparx5):
      - separate explicit and implicit traffic forwarding rules, make
        the implicit rules always active
      - add support for egress DSCP rewrite
      - IS0 VCAP support (Ingress Classification)
      - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.)
      - ES2 VCAP support (Egress Access Control)
      - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - add MAB (port auth) offload support
      - enable PTP receive for mv88e6390
    - NXP (ocelot):
      - support MAC Merge layer
      - support for the the vsc7512 internal copper phys
    - Microchip:
      - lan9303: convert to PHYLINK
      - lan966x: support TC flower filter statistics
      - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
      - lan937x: support Credit Based Shaper configuration
      - ksz9477: support Energy Efficient Ethernet
    - other:
      - qca8k: convert to regmap read/write API, use bulk operations
      - rswitch: Improve TX timestamp accuracy
 
  - Intel WiFi (iwlwifi):
    - EHT (Wi-Fi 7) rate reporting
    - STEP equalizer support: transfer some STEP (connection to radio
      on platforms with integrated wifi) related parameters from the
      BIOS to the firmware.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - IPQ5018 support
    - Fine Timing Measurement (FTM) responder role support
    - channel 177 support
 
  - MediaTek WiFi (mt76):
    - per-PHY LED support
    - mt7996: EHT (Wi-Fi 7) support
    - Wireless Ethernet Dispatch (WED) reset support
    - switch to using page pool allocator
 
  - RealTek WiFi (rtw89):
    - support new version of Bluetooth co-existance
 
  - Mobile:
    - rmnet: support TX aggregation.
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S
 IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK
 nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ
 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el
 n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW
 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl
 leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4
 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP
 n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC
 xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc
 edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1
 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI=
 =xXhC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Add dedicated kmem_cache for typical/small skb->head, avoid having
     to access struct page at kfree time, and improve memory use.

   - Introduce sysctl to set default RPS configuration for new netdevs.

   - Define Netlink protocol specification format which can be used to
     describe messages used by each family and auto-generate parsers.
     Add tools for generating kernel data structures and uAPI headers.

   - Expose all net/core sysctls inside netns.

   - Remove 4s sleep in netpoll if carrier is instantly detected on
     boot.

   - Add configurable limit of MDB entries per port, and port-vlan.

   - Continue populating drop reasons throughout the stack.

   - Retire a handful of legacy Qdiscs and classifiers.

  Protocols:

   - Support IPv4 big TCP (TSO frames larger than 64kB).

   - Add IP_LOCAL_PORT_RANGE socket option, to control local port range
     on socket by socket basis.

   - Track and report in procfs number of MPTCP sockets used.

   - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
     manager.

   - IPv6: don't check net.ipv6.route.max_size and rely on garbage
     collection to free memory (similarly to IPv4).

   - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).

   - ICMP: add per-rate limit counters.

   - Add support for user scanning requests in ieee802154.

   - Remove static WEP support.

   - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
     reporting.

   - WiFi 7 EHT channel puncturing support (client & AP).

  BPF:

   - Add a rbtree data structure following the "next-gen data structure"
     precedent set by recently added linked list, that is, by using
     kfunc + kptr instead of adding a new BPF map type.

   - Expose XDP hints via kfuncs with initial support for RX hash and
     timestamp metadata.

   - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
     better support decap on GRE tunnel devices not operating in collect
     metadata.

   - Improve x86 JIT's codegen for PROBE_MEM runtime error checks.

   - Remove the need for trace_printk_lock for bpf_trace_printk and
     bpf_trace_vprintk helpers.

   - Extend libbpf's bpf_tracing.h support for tracing arguments of
     kprobes/uprobes and syscall as a special case.

   - Significantly reduce the search time for module symbols by
     livepatch and BPF.

   - Enable cpumasks to be used as kptrs, which is useful for tracing
     programs tracking which tasks end up running on which CPUs in
     different time intervals.

   - Add support for BPF trampoline on s390x and riscv64.

   - Add capability to export the XDP features supported by the NIC.

   - Add __bpf_kfunc tag for marking kernel functions as kfuncs.

   - Add cgroup.memory=nobpf kernel parameter option to disable BPF
     memory accounting for container environments.

  Netfilter:

   - Remove the CLUSTERIP target. It has been marked as obsolete for
     years, and we still have WARN splats wrt races of the out-of-band
     /proc interface installed by this target.

   - Add 'destroy' commands to nf_tables. They are identical to the
     existing 'delete' commands, but do not return an error if the
     referenced object (set, chain, rule...) did not exist.

  Driver API:

   - Improve cpumask_local_spread() locality to help NICs set the right
     IRQ affinity on AMD platforms.

   - Separate C22 and C45 MDIO bus transactions more clearly.

   - Introduce new DCB table to control DSCP rewrite on egress.

   - Support configuration of Physical Layer Collision Avoidance (PLCA)
     Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
     shared medium Ethernet.

   - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
     preemption of low priority frames by high priority frames.

   - Add support for controlling MACSec offload using netlink SET.

   - Rework devlink instance refcounts to allow registration and
     de-registration under the instance lock. Split the code into
     multiple files, drop some of the unnecessarily granular locks and
     factor out common parts of netlink operation handling.

   - Add TX frame aggregation parameters (for USB drivers).

   - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
     messages with notifications for debug.

   - Allow offloading of UDP NEW connections via act_ct.

   - Add support for per action HW stats in TC.

   - Support hardware miss to TC action (continue processing in SW from
     a specific point in the action chain).

   - Warn if old Wireless Extension user space interface is used with
     modern cfg80211/mac80211 drivers. Do not support Wireless
     Extensions for Wi-Fi 7 devices at all. Everyone should switch to
     using nl80211 interface instead.

   - Improve the CAN bit timing configuration. Use extack to return
     error messages directly to user space, update the SJW handling,
     including the definition of a new default value that will benefit
     CAN-FD controllers, by increasing their oscillator tolerance.

  New hardware / drivers:

   - Ethernet:
      - nVidia BlueField-3 support (control traffic driver)
      - Ethernet support for imx93 SoCs
      - Motorcomm yt8531 gigabit Ethernet PHY
      - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
      - Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
      - Amlogic gxl MDIO mux

   - WiFi:
      - RealTek RTL8188EU (rtl8xxxu)
      - Qualcomm Wi-Fi 7 devices (ath12k)

   - CAN:
      - Renesas R-Car V4H

  Drivers:

   - Bluetooth:
      - Set Per Platform Antenna Gain (PPAG) for Intel controllers.

   - Ethernet NICs:
      - Intel (1G, igc):
         - support TSN / Qbv / packet scheduling features of i226 model
      - Intel (100G, ice):
         - use GNSS subsystem instead of TTY
         - multi-buffer XDP support
         - extend support for GPIO pins to E823 devices
      - nVidia/Mellanox:
         - update the shared buffer configuration on PFC commands
         - implement PTP adjphase function for HW offset control
         - TC support for Geneve and GRE with VF tunnel offload
         - more efficient crypto key management method
         - multi-port eswitch support
      - Netronome/Corigine:
         - add DCB IEEE support
         - support IPsec offloading for NFP3800
      - Freescale/NXP (enetc):
         - support XDP_REDIRECT for XDP non-linear buffers
         - improve reconfig, avoid link flap and waiting for idle
         - support MAC Merge layer
      - Other NICs:
         - sfc/ef100: add basic devlink support for ef100
         - ionic: rx_push mode operation (writing descriptors via MMIO)
         - bnxt: use the auxiliary bus abstraction for RDMA
         - r8169: disable ASPM and reset bus in case of tx timeout
         - cpsw: support QSGMII mode for J721e CPSW9G
         - cpts: support pulse-per-second output
         - ngbe: add an mdio bus driver
         - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
         - r8152: handle devices with FW with NCM support
         - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
         - virtio-net: support multi buffer XDP
         - virtio/vsock: replace virtio_vsock_pkt with sk_buff
         - tsnep: XDP support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add support for latency TLV (in FW control messages)
      - Microchip (sparx5):
         - separate explicit and implicit traffic forwarding rules, make
           the implicit rules always active
         - add support for egress DSCP rewrite
         - IS0 VCAP support (Ingress Classification)
         - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
           etc.)
         - ES2 VCAP support (Egress Access Control)
         - support for Per-Stream Filtering and Policing (802.1Q,
           8.6.5.1)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - add MAB (port auth) offload support
         - enable PTP receive for mv88e6390
      - NXP (ocelot):
         - support MAC Merge layer
         - support for the the vsc7512 internal copper phys
      - Microchip:
         - lan9303: convert to PHYLINK
         - lan966x: support TC flower filter statistics
         - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
         - lan937x: support Credit Based Shaper configuration
         - ksz9477: support Energy Efficient Ethernet
      - other:
         - qca8k: convert to regmap read/write API, use bulk operations
         - rswitch: Improve TX timestamp accuracy

   - Intel WiFi (iwlwifi):
      - EHT (Wi-Fi 7) rate reporting
      - STEP equalizer support: transfer some STEP (connection to radio
        on platforms with integrated wifi) related parameters from the
        BIOS to the firmware.

   - Qualcomm 802.11ax WiFi (ath11k):
      - IPQ5018 support
      - Fine Timing Measurement (FTM) responder role support
      - channel 177 support

   - MediaTek WiFi (mt76):
      - per-PHY LED support
      - mt7996: EHT (Wi-Fi 7) support
      - Wireless Ethernet Dispatch (WED) reset support
      - switch to using page pool allocator

   - RealTek WiFi (rtw89):
      - support new version of Bluetooth co-existance

   - Mobile:
      - rmnet: support TX aggregation"

* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
  page_pool: add a comment explaining the fragment counter usage
  net: ethtool: fix __ethtool_dev_mm_supported() implementation
  ethtool: pse-pd: Fix double word in comments
  xsk: add linux/vmalloc.h to xsk.c
  sefltests: netdevsim: wait for devlink instance after netns removal
  selftest: fib_tests: Always cleanup before exit
  net/mlx5e: Align IPsec ASO result memory to be as required by hardware
  net/mlx5e: TC, Set CT miss to the specific ct action instance
  net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
  net/mlx5: Refactor tc miss handling to a single function
  net/mlx5: Kconfig: Make tc offload depend on tc skb extension
  net/sched: flower: Support hardware miss to tc action
  net/sched: flower: Move filter handle initialization earlier
  net/sched: cls_api: Support hardware miss to tc action
  net/sched: Rename user cookie and act cookie
  sfc: fix builds without CONFIG_RTC_LIB
  sfc: clean up some inconsistent indentings
  net/mlx4_en: Introduce flexible array to silence overflow warning
  net: lan966x: Fix possible deadlock inside PTP
  net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
  ...
2023-02-21 18:24:12 -08:00
Dave Hansen
74e19ef0ff uaccess: Add speculation barrier to copy_from_user()
The results of "access_ok()" can be mis-speculated.  The result is that
you can end speculatively:

	if (access_ok(from, size))
		// Right here

even for bad from/size combinations.  On first glance, it would be ideal
to just add a speculation barrier to "access_ok()" so that its results
can never be mis-speculated.

But there are lots of system calls just doing access_ok() via
"copy_to_user()" and friends (example: fstat() and friends).  Those are
generally not problematic because they do not _consume_ data from
userspace other than the pointer.  They are also very quick and common
system calls that should not be needlessly slowed down.

"copy_from_user()" on the other hand uses a user-controller pointer and
is frequently followed up with code that might affect caches.  Take
something like this:

	if (!copy_from_user(&kernelvar, uptr, size))
		do_something_with(kernelvar);

If userspace passes in an evil 'uptr' that *actually* points to a kernel
addresses, and then do_something_with() has cache (or other)
side-effects, it could allow userspace to infer kernel data values.

Add a barrier to the common copy_from_user() code to prevent
mis-speculated values which happen after the copy.

Also add a stub for architectures that do not define barrier_nospec().
This makes the macro usable in generic code.

Since the barrier is now usable in generic code, the x86 #ifdef in the
BPF code can also go away.

Reported-by: Jordy Zomer <jordyzomer@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>   # BPF bits
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-21 14:45:22 -08:00
Yafang Shao
bf39650824 bpf: allow to disable bpf prog memory accounting
We can simply disable the bpf prog memory accouting by not setting the
GFP_ACCOUNT.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lore.kernel.org/r/20230210154734.4416-5-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-10 18:59:57 -08:00
Stanislav Fomichev
3d76a4d3d4 bpf: XDP metadata RX kfuncs
Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:11 -08:00
Stanislav Fomichev
2b3486bc2d bpf: Introduce device-bound XDP programs
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Stanislav Fomichev
9d03ebc71a bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Linus Torvalds
609d3bc623 Including fixes from bpf, netfilter and can.
Current release - regressions:
 
  - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func
 
  - rxrpc:
   - fix security setting propagation
   - fix null-deref in rxrpc_unuse_local()
   - fix switched parameters in peer tracing
 
 Current release - new code bugs:
 
  - rxrpc:
    - fix I/O thread startup getting skipped
    - fix locking issues in rxrpc_put_peer_locked()
    - fix I/O thread stop
    - fix uninitialised variable in rxperf server
    - fix the return value of rxrpc_new_incoming_call()
 
  - microchip: vcap: fix initialization of value and mask
 
  - nfp: fix unaligned io read of capabilities word
 
 Previous releases - regressions:
 
  - stop in-kernel socket users from corrupting socket's task_frag
 
  - stream: purge sk_error_queue in sk_stream_kill_queues()
 
  - openvswitch: fix flow lookup to use unmasked key
 
  - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()
 
  - devlink:
    - hold region lock when flushing snapshots
    - protect devlink dump by the instance lock
 
 Previous releases - always broken:
 
  - bpf:
    - prevent leak of lsm program after failed attach
    - resolve fext program type when checking map compatibility
 
  - skbuff: account for tail adjustment during pull operations
 
  - macsec: fix net device access prior to holding a lock
 
  - bonding: switch back when high prio link up
 
  - netfilter: flowtable: really fix NAT IPv6 offload
 
  - enetc: avoid buffer leaks on xdp_do_redirect() failure
 
  - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()
 
  - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOiGa4ACgkQMUZtbf5S
 IrvetBAAg/AjgG51gboLsuGjgRSwAi5T6ijgVR+pW+kMuoOdaamOF+h/zC1ox/H9
 QrWvTBipy+EqSD8bM4Xz0FNgidch8X4iWYhKGZuBht/4NP5FOzPUG2mNlUy5ANGq
 QZcCw6CUsir8HTb+IJpFEIq0JMwzKCm3WyAkYjEj4iuft0Y93cAgjkMVwoX0RERO
 o/pslC5dsozCLJxEglpw1aJq7aoroNuRSGSXl95nv8fU3UxmUXajnA3HNscXImdV
 6uqSIuyPIaGocpCBPRKUQd0sctkTY4cm8wmxxMCDVsBRVusoaq5eg1VRvxJm9Rxj
 gvDvHvfhnEuSigFF5A+paBp4c+i3C8g/UTBJTtptdAC+Y2tt4UT3Q5aaazYUOAqd
 W4TSJ3bk5zhkhpRF9clb0fNQaM1HOT4rkDEEGTfVN62dtHfPKpNwYufQKaYHdVj1
 RJ3ooH6c7TMVaRs6ZgEWNYToKZj94SIfPhfEhuqWXdNMDBkUMp2BXFFOp9fZDWju
 PsMQrRD7n6+XXpNvScYtnJDORqfIL9yHGZE9kxZA5QSDl9cnPA3SUbNruQPlXHrl
 w0yQlYuG3gcciua4dXaLfz1iN4rPdenuYhVBHhztEwDKl+b61CVQYlOHGkXPVURp
 oft74qCCFbva+Hf/7jENQotjT1tLfxAGdUARuFeDBueJgDRAPsw=
 =goV5
 -----END PGP SIGNATURE-----

Merge tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter and can.

  Current release - regressions:

   - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func

   - rxrpc:
      - fix security setting propagation
      - fix null-deref in rxrpc_unuse_local()
      - fix switched parameters in peer tracing

  Current release - new code bugs:

   - rxrpc:
      - fix I/O thread startup getting skipped
      - fix locking issues in rxrpc_put_peer_locked()
      - fix I/O thread stop
      - fix uninitialised variable in rxperf server
      - fix the return value of rxrpc_new_incoming_call()

   - microchip: vcap: fix initialization of value and mask

   - nfp: fix unaligned io read of capabilities word

  Previous releases - regressions:

   - stop in-kernel socket users from corrupting socket's task_frag

   - stream: purge sk_error_queue in sk_stream_kill_queues()

   - openvswitch: fix flow lookup to use unmasked key

   - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

   - devlink:
      - hold region lock when flushing snapshots
      - protect devlink dump by the instance lock

  Previous releases - always broken:

   - bpf:
      - prevent leak of lsm program after failed attach
      - resolve fext program type when checking map compatibility

   - skbuff: account for tail adjustment during pull operations

   - macsec: fix net device access prior to holding a lock

   - bonding: switch back when high prio link up

   - netfilter: flowtable: really fix NAT IPv6 offload

   - enetc: avoid buffer leaks on xdp_do_redirect() failure

   - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

   - dsa: microchip: remove IRQF_TRIGGER_FALLING in
     request_threaded_irq"

* tag 'net-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
  net: fec: check the return value of build_skb()
  net: simplify sk_page_frag
  Treewide: Stop corrupting socket's task_frag
  net: Introduce sk_use_task_frag in struct sock.
  mctp: Remove device type check at unregister
  net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq
  can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len
  can: flexcan: avoid unbalanced pm_runtime_enable warning
  Documentation: devlink: add missing toc entry for etas_es58x devlink doc
  mctp: serial: Fix starting value for frame check sequence
  nfp: fix unaligned io read of capabilities word
  net: stream: purge sk_error_queue in sk_stream_kill_queues()
  myri10ge: Fix an error handling path in myri10ge_probe()
  net: microchip: vcap: Fix initialization of value and mask
  rxrpc: Fix the return value of rxrpc_new_incoming_call()
  rxrpc: rxperf: Fix uninitialised variable
  rxrpc: Fix I/O thread stop
  rxrpc: Fix switched parameters in peer tracing
  rxrpc: Fix locking issues in rxrpc_put_peer_locked()
  rxrpc: Fix I/O thread startup getting skipped
  ...
2022-12-21 08:41:32 -08:00
Linus Torvalds
4f292c4de4 New Feature:
* Randomize the per-cpu entry areas
 Cleanups:
 * Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
   coding it
 * Move to "native" set_memory_rox() helper
 * Clean up pmd_get_atomic() and i386-PAE
 * Remove some unused page table size macros
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
 krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
 cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
 Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
 MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
 UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
 9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
 +PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
 x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
 VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
 D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
 dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
 =wwVF
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "New Feature:

   - Randomize the per-cpu entry areas

  Cleanups:

   - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it

   - Move to "native" set_memory_rox() helper

   - Clean up pmd_get_atomic() and i386-PAE

   - Remove some unused page table size macros"

* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  x86/mm: Ensure forced page table splitting
  x86/kasan: Populate shadow for shared chunk of the CPU entry area
  x86/kasan: Add helpers to align shadow addresses up and down
  x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
  x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
  x86/mm: Recompute physical address for every page of per-CPU CEA mapping
  x86/mm: Rename __change_page_attr_set_clr(.checkalias)
  x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
  x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
  x86/mm: Add a few comments
  x86/mm: Fix CR3_ADDR_MASK
  x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
  mm: Convert __HAVE_ARCH_P..P_GET to the new style
  mm: Remove pointless barrier() after pmdp_get_lockless()
  x86/mm/pae: Get rid of set_64bit()
  x86_64: Remove pointless set_64bit() usage
  x86/mm/pae: Be consistent with pXXp_get_and_clear()
  x86/mm/pae: Use WRITE_ONCE()
  x86/mm/pae: Don't (ab)use atomic64
  mm/gup: Fix the lockless PMD access
  ...
2022-12-17 14:06:53 -06:00
Peter Zijlstra
d48567c9a0 mm: Introduce set_memory_rox()
Because endlessly repeating:

	set_memory_ro()
	set_memory_x()

is getting tedious.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
2022-12-15 10:37:26 -08:00
Toke Høiland-Jørgensen
1c123c567f bpf: Resolve fext program type when checking map compatibility
The bpf_prog_map_compatible() check makes sure that BPF program types are
not mixed inside BPF map types that can contain programs (tail call maps,
cpumaps and devmaps). It does this by setting the fields of the map->owner
struct to the values of the first program being checked against, and
rejecting any subsequent programs if the values don't match.

One of the values being set in the map owner struct is the program type,
and since the code did not resolve the prog type for fext programs, the map
owner type would be set to PROG_TYPE_EXT and subsequent loading of programs
of the target type into the map would fail.

This bug is seen in particular for XDP programs that are loaded as
PROG_TYPE_EXT using libxdp; these cannot insert programs into devmaps and
cpumaps because the check fails as described above.

Fix the bug by resolving the fext program type to its target program type
as elsewhere in the verifier.

v3:
- Add Yonghong's ACK

Fixes: f45d5b6ce2 ("bpf: generalise tail call map compatibility check")
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221214230254.790066-1-toke@redhat.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-14 21:30:27 -08:00
Linus Torvalds
7e68dd7d07 Networking changes for 6.2.
Core
 ----
  - Allow live renaming when an interface is up
 
  - Add retpoline wrappers for tc, improving considerably the
    performances of complex queue discipline configurations.
 
  - Add inet drop monitor support.
 
  - A few GRO performance improvements.
 
  - Add infrastructure for atomic dev stats, addressing long standing
    data races.
 
  - De-duplicate common code between OVS and conntrack offloading
    infrastructure.
 
  - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
 
  - Netfilter: introduce packet parser for tunneled packets
 
  - Replace IPVS timer-based estimators with kthreads to scale up
    the workload with the number of available CPUs.
 
  - Add the helper support for connection-tracking OVS offload.
 
 BPF
 ---
  - Support for user defined BPF objects: the use case is to allocate
    own objects, build own object hierarchies and use the building
    blocks to build own data structures flexibly, for example, linked
    lists in BPF.
 
  - Make cgroup local storage available to non-cgroup attached BPF
    programs.
 
  - Avoid unnecessary deadlock detection and failures wrt BPF task
    storage helpers.
 
  - A relevant bunch of BPF verifier fixes and improvements.
 
  - Veristat tool improvements to support custom filtering, sorting,
    and replay of results.
 
  - Add LLVM disassembler as default library for dumping JITed code.
 
  - Lots of new BPF documentation for various BPF maps.
 
  - Add bpf_rcu_read_{,un}lock() support for sleepable programs.
 
  - Add RCU grace period chaining to BPF to wait for the completion
    of access from both sleepable and non-sleepable BPF programs.
 
  - Add support storing struct task_struct objects as kptrs in maps.
 
  - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
    values.
 
  - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
 
 Protocols
 ---------
  - TCP: implement Protective Load Balancing across switch links.
 
  - TCP: allow dynamically disabling TCP-MD5 static key, reverting
    back to fast[er]-path.
 
  - UDP: Introduce optional per-netns hash lookup table.
 
  - IPv6: simplify and cleanup sockets disposal.
 
  - Netlink: support different type policies for each generic
    netlink operation.
 
  - MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
 
  - MPTCP: add netlink notification support for listener sockets
    events.
 
  - SCTP: add VRF support, allowing sctp sockets binding to VRF
    devices.
 
  - Add bridging MAC Authentication Bypass (MAB) support.
 
  - Extensions for Ethernet VPN bridging implementation to better
    support multicast scenarios.
 
  - More work for Wi-Fi 7 support, comprising conversion of all
    the existing drivers to internal TX queue usage.
 
  - IPSec: introduce a new offload type (packet offload) allowing
    complete header processing and crypto offloading.
 
  - IPSec: extended ack support for more descriptive XFRM error
    reporting.
 
  - RXRPC: increase SACK table size and move processing into a
    per-local endpoint kernel thread, reducing considerably the
    required locking.
 
  - IEEE 802154: synchronous send frame and extended filtering
    support, initial support for scanning available 15.4 networks.
 
  - Tun: bump the link speed from 10Mbps to 10Gbps.
 
  - Tun/VirtioNet: implement UDP segmentation offload support.
 
 Driver API
 ----------
 
  - PHY/SFP: improve power level switching between standard
    level 1 and the higher power levels.
 
  - New API for netdev <-> devlink_port linkage.
 
  - PTP: convert existing drivers to new frequency adjustment
    implementation.
 
  - DSA: add support for rx offloading.
 
  - Autoload DSA tagging driver when dynamically changing protocol.
 
  - Add new PCP and APPTRUST attributes to Data Center Bridging.
 
  - Add configuration support for 800Gbps link speed.
 
  - Add devlink port function attribute to enable/disable RoCE and
    migratable.
 
  - Extend devlink-rate to support strict prioriry and weighted fair
    queuing.
 
  - Add devlink support to directly reading from region memory.
 
  - New device tree helper to fetch MAC address from nvmem.
 
  - New big TCP helper to simplify temporary header stripping.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Marvel Octeon CNF95N and CN10KB Ethernet Switches.
    - Marvel Prestera AC5X Ethernet Switch.
    - WangXun 10 Gigabit NIC.
    - Motorcomm yt8521 Gigabit Ethernet.
    - Microchip ksz9563 Gigabit Ethernet Switch.
    - Microsoft Azure Network Adapter.
    - Linux Automation 10Base-T1L adapter.
 
  - PHY:
    - Aquantia AQR112 and AQR412.
    - Motorcomm YT8531S.
 
  - PTP:
    - Orolia ART-CARD.
 
  - WiFi:
    - MediaTek Wi-Fi 7 (802.11be) devices.
    - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
      devices.
 
  - Bluetooth:
    - Broadcom BCM4377/4378/4387 Bluetooth chipsets.
    - Realtek RTL8852BE and RTL8723DS.
    - Cypress.CYW4373A0 WiFi + Bluetooth combo device.
 
 Drivers
 -------
  - CAN:
    - gs_usb: bus error reporting support.
    - kvaser_usb: listen only and bus error reporting support.
 
  - Ethernet NICs:
    - Intel (100G):
      - extend action skbedit to RX queue mapping.
      - implement devlink-rate support.
      - support direct read from memory.
    - nVidia/Mellanox (mlx5):
      - SW steering improvements, increasing rules update rate.
      - Support for enhanced events compression.
      - extend H/W offload packet manipulation capabilities.
      - implement IPSec packet offload mode.
    - nVidia/Mellanox (mlx4):
      - better big TCP support.
    - Netronome Ethernet NICs (nfp):
      - IPsec offload support.
      - add support for multicast filter.
    - Broadcom:
      - RSS and PTP support improvements.
    - AMD/SolarFlare:
      - netlink extened ack improvements.
      - add basic flower matches to offload, and related stats.
    - Virtual NICs:
      - ibmvnic: introduce affinity hint support.
    - small / embedded:
      - FreeScale fec: add initial XDP support.
      - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
      - TI am65-cpsw: add suspend/resume support.
      - Mediatek MT7986: add RX wireless wthernet dispatch support.
      - Realtek 8169: enable GRO software interrupt coalescing per
        default.
 
  - Ethernet high-speed switches:
    - Microchip (sparx5):
      - add support for Sparx5 TC/flower H/W offload via VCAP.
    - Mellanox mlxsw:
      - add 802.1X and MAC Authentication Bypass offload support.
      - add ip6gre support.
 
  - Embedded Ethernet switches:
    - Mediatek (mtk_eth_soc):
      - improve PCS implementation, add DSA untag support.
      - enable flow offload support.
    - Renesas:
      - add rswitch R-Car Gen4 gPTP support.
    - Microchip (lan966x):
      - add full XDP support.
      - add TC H/W offload via VCAP.
      - enable PTP on bridge interfaces.
    - Microchip (ksz8):
      - add MTU support for KSZ8 series.
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - support configuring channel dwell time during scan.
 
  - MediaTek WiFi (mt76):
    - enable Wireless Ethernet Dispatch (WED) offload support.
    - add ack signal support.
    - enable coredump support.
    - remain_on_channel support.
 
  - Intel WiFi (iwlwifi):
    - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
    - 320 MHz channels support.
 
  - RealTek WiFi (rtw89):
    - new dynamic header firmware format support.
    - wake-over-WLAN support.
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
 ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
 kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
 DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
 H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
 oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
 Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
 yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
 ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
 V8gnt1EL+0cc
 =CbJC
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...
2022-12-13 15:47:48 -08:00
Kumar Kartikeya Dwivedi
958cf2e273 bpf: Introduce bpf_obj_new
Introduce type safe memory allocator bpf_obj_new for BPF programs. The
kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
to kfuncs still requires having them in prototype, unlike BPF helpers
which always take 5 arguments and have them checked using bpf_func_proto
in verifier, ignoring unset argument types.

Introduce __ign suffix to ignore a specific kfunc argument during type
checks, then use this to introduce support for passing type metadata to
the bpf_obj_new_impl kfunc.

The user passes BTF ID of the type it wants to allocates in program BTF,
the verifier then rewrites the first argument as the size of this type,
after performing some sanity checks (to ensure it exists and it is a
struct type).

The second argument is also fixed up and passed by the verifier. This is
the btf_struct_meta for the type being allocated. It would be needed
mostly for the offset array which is required for zero initializing
special fields while leaving the rest of storage in unitialized state.

It would also be needed in the next patch to perform proper destruction
of the object's special fields.

Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
using the any context BPF memory allocator introduced recently. To this
end, a global instance of the BPF memory allocator is initialized on
boot to be used for this purpose. This 'bpf_global_ma' serves all
allocations for bpf_obj_new. In the future, bpf_obj_new variants will
allow specifying a custom allocator.

Note that now that bpf_obj_new can be used to allocate objects that can
be linked to BPF linked list (when future linked list helpers are
available), we need to also free the elements using bpf_mem_free.
However, since the draining of elements is done outside the
bpf_spin_lock, we need to do migrate_disable around the call since
bpf_list_head_free can be called from map free path where migration is
enabled. Otherwise, when called from BPF programs migration is already
disabled.

A convenience macro is included in the bpf_experimental.h header to hide
over the ugly details of the implementation, leading to user code
looking similar to a language level extension which allocates and
constructs fields of a user type.

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
};

void prog(void) {
	struct foo *f;

	f = bpf_obj_new(typeof(*f));
	if (!f)
		return;
	...
}

A key piece of this story is still missing, i.e. the free function,
which will come in the next patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Jason A. Donenfeld
8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Jakub Kicinski
94adb5e29e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20 17:49:10 -07:00
Jakub Kicinski
3566a79c9e bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY08JQQAKCRDbK58LschI
 g0M0AQCWGrJcnQFut1qwR9efZUadwxtKGAgpaA/8Smd8+v7c8AD/SeHQuGfkFiD6
 rx18hv1mTfG0HuPnFQy6YZQ98vmznwE=
 =DaeS
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-18

We've added 33 non-merge commits during the last 14 day(s) which contain
a total of 31 files changed, 874 insertions(+), 538 deletions(-).

The main changes are:

1) Add RCU grace period chaining to BPF to wait for the completion
   of access from both sleepable and non-sleepable BPF programs,
   from Hou Tao & Paul E. McKenney.

2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
   values. In the wild we have seen OS vendors doing buggy backports
   where helper call numbers mismatched. This is an attempt to make
   backports more foolproof, from Andrii Nakryiko.

3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions,
   from Roberto Sassu.

4) Fix libbpf's BTF dumper for structs with padding-only fields,
   from Eduard Zingerman.

5) Fix various libbpf bugs which have been found from fuzzing with
   malformed BPF object files, from Shung-Hsi Yu.

6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT,
   from Jie Meng.

7) Fix various ASAN bugs in both libbpf and selftests when running
   the BPF selftest suite on arm64, from Xu Kuohai.

8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest
   and use in-skeleton link pointer to remove an explicit bpf_link__destroy(),
   from Jiri Olsa.

9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying
   on symlinked iptables which got upgraded to iptables-nft,
   from Martin KaFai Lau.

10) Minor BPF selftest improvements all over the place, from various others.

* tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits)
  bpf/docs: Update README for most recent vmtest.sh
  bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
  bpf: Use rcu_trace_implies_rcu_gp() in local storage map
  bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
  rcu-tasks: Provide rcu_trace_implies_rcu_gp()
  selftests/bpf: Use sys_pidfd_open() helper when possible
  libbpf: Fix null-pointer dereference in find_prog_by_sec_insn()
  libbpf: Deal with section with no data gracefully
  libbpf: Use elf_getshdrnum() instead of e_shnum
  selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c
  selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow
  selftest/bpf: Fix memory leak in kprobe_multi_test
  selftests/bpf: Fix memory leak caused by not destroying skeleton
  libbpf: Fix memory leak in parse_usdt_arg()
  libbpf: Fix use-after-free in btf_dump_name_dups
  selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test
  selftests/bpf: Alphabetize DENYLISTs
  selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id()
  libbpf: Introduce bpf_link_get_fd_by_id_opts()
  libbpf: Introduce bpf_btf_get_fd_by_id_opts()
  ...
====================

Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18 18:56:43 -07:00
Hou Tao
4835f9ee98 bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
To support both sleepable and normal uprobe bpf program, the freeing of
trace program array chains a RCU-tasks-trace grace period and a normal
RCU grace period one after the other.

With the introduction of rcu_trace_implies_rcu_gp(),
__bpf_prog_array_free_sleepable_cb() can check whether or not a normal
RCU grace period has also passed after a RCU-tasks-trace grace period
has passed. If it is true, it is safe to invoke kfree() directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Jason A. Donenfeld
a251c17aa5 treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:58 -06:00
Jason A. Donenfeld
81895a65ec treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

-       RAND = get_random_u32();
        ... when != RAND
-       RAND %= (E);
+       RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

        ((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
        value = int(literal, 16)
elif literal[0] in '123456789':
        value = int(literal, 10)
if value is None:
        print("I don't know how to handle %s" % (literal))
        cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
        print("Skipping 0x%x for cleanup elsewhere" % (value))
        cocci.include_match(False)
elif value & (value + 1) != 0:
        print("Skipping 0x%x because it's not a power of two minus one" % (value))
        cocci.include_match(False)
elif literal.startswith('0x'):
        coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
        coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

 {
-       T VAR;
-       VAR = (E);
-       return VAR;
+       return E;
 }

@drop_var@
type T;
identifier VAR;
@@

 {
-       T VAR;
        ... when != VAR
 }

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:55 -06:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Alexander Potapenko
a6a7aaba7f bpf: kmsan: initialize BPF registers with zeroes
When executing BPF programs, certain registers may get passed
uninitialized to helper functions.  E.g.  when performing a JMP_CALL,
registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
many of them are actually used.

Passing uninitialized values as function parameters is technically
undefined behavior, so we work around it by always initializing the
registers.

Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:25 -07:00
Song Liu
19c02415da bpf: use bpf_prog_pack for bpf_dispatcher
Allocate bpf_dispatcher with bpf_prog_pack_alloc so that bpf_dispatcher
can share pages with bpf programs.

arch_prepare_bpf_dispatcher() is updated to provide a RW buffer as working
area for arch code to write to.

This also fixes CPA W^X warnning like:

CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ...

Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220926184739.3512547-2-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-26 20:40:43 -07:00
Jakub Kicinski
60ad1100d5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/.gitignore
  sort the net-next version and use it

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-01 12:58:02 -07:00
Kuniyuki Iwashima
0947ae1121 bpf: Fix a data-race around bpf_jit_limit.
While reading bpf_jit_limit, it can be changed concurrently via sysctl,
WRITE_ONCE() in __do_proc_doulongvec_minmax(). The size of bpf_jit_limit
is long, so we need to add a paired READ_ONCE() to avoid load-tearing.

Fixes: ede95a63b5 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220823215804.2177-1-kuniyu@amazon.com
2022-08-24 00:27:14 +02:00
Jesper Dangaard Brouer
c8996c98f7 bpf: Add BPF-helper for accessing CLOCK_TAI
Commit 3dc6ffae2d ("timekeeping: Introduce fast accessor to clock tai")
introduced a fast and NMI-safe accessor for CLOCK_TAI. Especially in time
sensitive networks (TSN), where all nodes are synchronized by Precision Time
Protocol (PTP), it's helpful to have the possibility to generate timestamps
based on CLOCK_TAI instead of CLOCK_MONOTONIC. With a BPF helper for TAI in
place, it becomes very convenient to correlate activity across different
machines in the network.

Use cases for such a BPF helper include functionalities such as Tx launch
time (e.g. ETF and TAPRIO Qdiscs) and timestamping.

Note: CLOCK_TAI is nothing new per se, only the NMI-safe variant of it is.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
[Kurt: Wrote changelog and renamed helper]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20220809060803.5773-2-kurt@linutronix.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-09 09:47:13 -07:00
Jakub Kicinski
b3fce974d4 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2022-07-22

We've added 73 non-merge commits during the last 12 day(s) which contain
a total of 88 files changed, 3458 insertions(+), 860 deletions(-).

The main changes are:

1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai.

2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel
   syscalls through kprobe mechanism, from Andrii Nakryiko.

3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel
   function, from Song Liu & Jiri Olsa.

4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change
   entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi.

5) Add a ksym BPF iterator to allow for more flexible and efficient interactions
   with kernel symbols, from Alan Maguire.

6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter.

7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov.

8) libbpf support for writing custom perf event readers, from Jon Doron.

9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar.

10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski.

11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin.

12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev.

13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui.

14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong.

15) Make non-prealloced BPF map allocations low priority to play better with
    memcg limits, from Yafang Shao.

16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao.

17) Various smaller cleanups and improvements all over the place.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits)
  bpf: Simplify bpf_prog_pack_[size|mask]
  bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)
  bpf, x64: Allow to use caller address from stack
  ftrace: Allow IPMODIFY and DIRECT ops on the same function
  ftrace: Add modify_ftrace_direct_multi_nolock
  bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test
  bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF
  selftests/bpf: Fix test_verifier failed test in unprivileged mode
  selftests/bpf: Add negative tests for new nf_conntrack kfuncs
  selftests/bpf: Add tests for new nf_conntrack kfuncs
  selftests/bpf: Add verifier tests for trusted kfunc args
  net: netfilter: Add kfuncs to set and change CT status
  net: netfilter: Add kfuncs to set and change CT timeout
  net: netfilter: Add kfuncs to allocate and insert CT
  net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup
  bpf: Add documentation for kfuncs
  bpf: Add support for forcing kfunc args to be trusted
  bpf: Switch to new kfunc flags infrastructure
  tools/resolve_btfids: Add support for 8-byte BTF sets
  bpf: Introduce 8-byte BTF set
  ...
====================

Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-22 16:55:44 -07:00
Song Liu
ea2babac63 bpf: Simplify bpf_prog_pack_[size|mask]
Simplify the logic that selects bpf_prog_pack_size, and always use
(PMD_SIZE * num_possible_nodes()). This is a good tradeoff, as most of
the performance benefit observed is from less direct map fragmentation [0].

Also, module_alloc(4MB) may not allocate 4MB aligned memory. Therefore,
we cannot use (ptr & bpf_prog_pack_mask) to find the correct address of
bpf_prog_pack. Fix this by checking the header address falls in the range
of pack->ptr and (pack->ptr + bpf_prog_pack_size).

  [0] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20220713204950.3015201-1-song@kernel.org
2022-07-22 22:08:27 +02:00
Jakub Kicinski
816cd16883 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  310731e2f1 ("net: Fix data-races around sysctl_mem.")
  e70f3c7012 ("Revert "net: set SK_MEM_QUANTUM to 4096"")
https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/

net/ipv4/fib_semantics.c
  747c143072 ("ip: fix dflt addr selection for connected nexthop")
  d62607c3fe ("net: rename reference+tracking helpers")

net/tls/tls.h
include/net/tls.h
  3d8c51b25a ("net/tls: Check for errors in tls_device_init")
  5879031423 ("tls: create an internal header")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-14 15:27:35 -07:00
Song Liu
1d5f82d9dd bpf, x86: fix freeing of not-finalized bpf_prog_pack
syzbot reported a few issues with bpf_prog_pack [1], [2]. This only happens
with multiple subprogs. In jit_subprogs(), we first call bpf_int_jit_compile()
on each sub program. And then, we call it on each sub program again. jit_data
is not freed in the first call of bpf_int_jit_compile(). Similarly we don't
call bpf_jit_binary_pack_finalize() in the first call of bpf_int_jit_compile().

If bpf_int_jit_compile() failed for one sub program, we will call
bpf_jit_binary_pack_finalize() for this sub program. However, we don't have a
chance to call it for other sub programs. Then we will hit "goto out_free" in
jit_subprogs(), and call bpf_jit_free on some subprograms that haven't got
bpf_jit_binary_pack_finalize() yet.

At this point, bpf_jit_binary_pack_free() is called and the whole 2MB page is
freed erroneously.

Fix this with a custom bpf_jit_free() for x86_64, which calls
bpf_jit_binary_pack_finalize() if necessary. Also, with custom
bpf_jit_free(), bpf_prog_aux->use_bpf_prog_pack is not needed any more,
remove it.

Fixes: 1022a5498f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
[1] https://syzkaller.appspot.com/bug?extid=2f649ec6d2eea1495a8f
[2] https://syzkaller.appspot.com/bug?extid=87f65c75f4a72db05445
Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20220706002612.4013790-1-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-12 17:33:22 -07:00
Eric Dumazet
0326195f52 bpf: Make sure mac_header was set before using it
Classic BPF has a way to load bytes starting from the mac header.

Some skbs do not have a mac header, and skb_mac_header()
in this case is returning a pointer that 65535 bytes after
skb->head.

Existing range check in bpf_internal_load_pointer_neg_helper()
was properly kicking and no illegal access was happening.

New sanity check in skb_mac_header() is firing, so we need
to avoid it.

WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 skb_mac_header include/linux/skbuff.h:2785 [inline]
WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74
Modules linked in:
CPU: 1 PID: 28990 Comm: syz-executor.0 Not tainted 5.19.0-rc4-syzkaller-00865-g4874fb9484be #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022
RIP: 0010:skb_mac_header include/linux/skbuff.h:2785 [inline]
RIP: 0010:bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74
Code: ff ff 45 31 f6 e9 5a ff ff ff e8 aa 27 40 00 e9 3b ff ff ff e8 90 27 40 00 e9 df fe ff ff e8 86 27 40 00 eb 9e e8 2f 2c f3 ff <0f> 0b eb b1 e8 96 27 40 00 e9 79 fe ff ff 90 41 57 41 56 41 55 41
RSP: 0018:ffffc9000309f668 EFLAGS: 00010216
RAX: 0000000000000118 RBX: ffffffffffeff00c RCX: ffffc9000e417000
RDX: 0000000000040000 RSI: ffffffff81873f21 RDI: 0000000000000003
RBP: ffff8880842878c0 R08: 0000000000000003 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000001 R12: 0000000000000004
R13: ffff88803ac56c00 R14: 000000000000ffff R15: dffffc0000000000
FS: 00007f5c88a16700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdaa9f6c058 CR3: 000000003a82c000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
____bpf_skb_load_helper_32 net/core/filter.c:276 [inline]
bpf_skb_load_helper_32+0x191/0x220 net/core/filter.c:264

Fixes: f9aefd6b2a ("net: warn if mac header was not set")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220707123900.945305-1-edumazet@google.com
2022-07-07 20:13:13 +02:00
Stanislav Fomichev
c0e19f2c9a bpf: minimize number of allocated lsm slots per program
Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.

struct cgroup_bpf {
	struct bpf_prog_array *    effective[33];        /*     0   264 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	struct hlist_head          progs[33];            /*   264   264 */
	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
	u8                         flags[33];            /*   528    33 */

	/* XXX 7 bytes hole, try to pack */

	struct list_head           storages;             /*   568    16 */
	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
	struct bpf_prog_array *    inactive;             /*   584     8 */
	struct percpu_ref          refcnt;               /*   592    16 */
	struct work_struct         release_work;         /*   608    72 */

	/* size: 680, cachelines: 11, members: 7 */
	/* sum members: 673, holes: 1, sum holes: 7 */
	/* last cacheline: 40 bytes */
};

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:52 -07:00
Stanislav Fomichev
69fd337a97 bpf: per-cgroup lsm flavor
Allow attaching to lsm hooks in the cgroup context.

Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.

For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.

Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.

Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.

Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.

Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-29 13:21:51 -07:00
Tony Ambardar
95acd8817e bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT
The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
calls for only x86-64. Change the logic to instead rely on a new weak
function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
JIT backend can override.

Update the x86-64 eBPF JIT to reflect this.

Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
[jakub: drop MIPS bits and tweak patch subject]
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
2022-06-21 18:52:04 +02:00
Delyan Kratunov
8c7dcb84e3 bpf: implement sleepable uprobes by chaining gps
uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.

Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).

Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.

Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).

The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.

Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-06-16 19:27:29 -07:00
Pu Lehui
cc1685546d bpf: Correct the comment about insn_to_jit_off
The insn_to_jit_off passed to bpf_prog_fill_jited_linfo should be the
first byte of the next instruction, or the byte off to the end of the
current instruction.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220530092815.1112406-4-pulehui@huawei.com
2022-06-02 16:26:06 -07:00
Menglong Dong
caff1fa411 bpf: Fix probe read error in ___bpf_prog_run()
I think there is something wrong with BPF_PROBE_MEM in ___bpf_prog_run()
in big-endian machine. Let's make a test and see what will happen if we
want to load a 'u16' with BPF_PROBE_MEM.

Let's make the src value '0x0001', the value of dest register will become
0x0001000000000000, as the value will be loaded to the first 2 byte of
DST with following code:

  bpf_probe_read_kernel(&DST, SIZE, (const void *)(long) (SRC + insn->off));

Obviously, the value in DST is not correct. In fact, we can compare
BPF_PROBE_MEM with LDX_MEM_H:

  DST = *(SIZE *)(unsigned long) (SRC + insn->off);

If the memory load is done by LDX_MEM_H, the value in DST will be 0x1 now.

And I think this error results in the test case 'test_bpf_sk_storage_map'
failing:

  test_bpf_sk_storage_map:PASS:bpf_iter_bpf_sk_storage_map__open_and_load 0 nsec
  test_bpf_sk_storage_map:PASS:socket 0 nsec
  test_bpf_sk_storage_map:PASS:map_update 0 nsec
  test_bpf_sk_storage_map:PASS:socket 0 nsec
  test_bpf_sk_storage_map:PASS:map_update 0 nsec
  test_bpf_sk_storage_map:PASS:socket 0 nsec
  test_bpf_sk_storage_map:PASS:map_update 0 nsec
  test_bpf_sk_storage_map:PASS:attach_iter 0 nsec
  test_bpf_sk_storage_map:PASS:create_iter 0 nsec
  test_bpf_sk_storage_map:PASS:read 0 nsec
  test_bpf_sk_storage_map:FAIL:ipv6_sk_count got 0 expected 3
  $10/26 bpf_iter/bpf_sk_storage_map:FAIL

The code of the test case is simply, it will load sk->sk_family to the
register with BPF_PROBE_MEM and check if it is AF_INET6. With this patch,
now the test case 'bpf_iter' can pass:

  $10  bpf_iter:OK

Fixes: 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20220524021228.533216-1-imagedong@tencent.com
2022-05-28 01:09:18 +02:00
Song Liu
fe736565ef bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
Introduce bpf_arch_text_invalidate and use it to fill unused part of the
bpf_prog_pack with illegal instructions when a BPF program is freed.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220520235758.1858153-4-song@kernel.org
2022-05-23 23:08:11 +02:00
Song Liu
d88bb5eed0 bpf: Fill new bpf_prog_pack with illegal instructions
bpf_prog_pack enables sharing huge pages among multiple BPF programs.
These pages are marked as executable before the JIT engine fill it with
BPF programs. To make these pages safe, fill the hole bpf_prog_pack with
illegal instructions before making it executable.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220520235758.1858153-2-song@kernel.org
2022-05-23 23:07:29 +02:00
Alexei Starovoitov
4b6313cf99 bpf: Fix combination of jit blinding and pointers to bpf subprogs.
The combination of jit blinding and pointers to bpf subprogs causes:
[   36.989548] BUG: unable to handle page fault for address: 0000000100000001
[   36.990342] #PF: supervisor instruction fetch in kernel mode
[   36.990968] #PF: error_code(0x0010) - not-present page
[   36.994859] RIP: 0010:0x100000001
[   36.995209] Code: Unable to access opcode bytes at RIP 0xffffffd7.
[   37.004091] Call Trace:
[   37.004351]  <TASK>
[   37.004576]  ? bpf_loop+0x4d/0x70
[   37.004932]  ? bpf_prog_3899083f75e4c5de_F+0xe3/0x13b

The jit blinding logic didn't recognize that ld_imm64 with an address
of bpf subprogram is a special instruction and proceeded to randomize it.
By itself it wouldn't have been an issue, but jit_subprogs() logic
relies on two step process to JIT all subprogs and then JIT them
again when addresses of all subprogs are known.
Blinding process in the first JIT phase caused second JIT to miss
adjustment of special ld_imm64.

Fix this issue by ignoring special ld_imm64 instructions that don't have
user controlled constants and shouldn't be blinded.

Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220513011025.13344-1-alexei.starovoitov@gmail.com
2022-05-13 15:13:48 +02:00
Feng Zhou
07343110b2 bpf: add bpf_map_lookup_percpu_elem for percpu map
Add new ebpf helpers bpf_map_lookup_percpu_elem.

The implementation method is relatively simple, refer to the implementation
method of map_lookup_elem of percpu map, increase the parameters of cpu, and
obtain it according to the specified cpu.

Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20220511093854.411-2-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-05-11 18:16:54 -07:00
Song Liu
e581094167 bpf: Fix bpf_prog_pack when PMU_SIZE is not defined
PMD_SIZE is not available in some special config, e.g. ARCH=arm with
CONFIG_MMU=n. Use bpf_prog_pack of PAGE_SIZE in these cases.

Fixes: ef078600ee ("bpf: Select proper size for bpf_prog_pack")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220321180009.1944482-3-song@kernel.org
2022-03-21 13:53:45 -07:00
Song Liu
96805674e5 bpf: Fix bpf_prog_pack for multi-node setup
module_alloc requires num_online_nodes * PMD_SIZE to allocate huge pages.
bpf_prog_pack uses pack of size num_online_nodes * PMD_SIZE.
OTOH, module_alloc returns addresses that are PMD_SIZE aligned (instead of
num_online_nodes * PMD_SIZE aligned). Therefore, PMD_MASK should be used
to calculate pack_ptr in bpf_prog_pack_free().

Fixes: ef078600ee ("bpf: Select proper size for bpf_prog_pack")
Reported-by: syzbot+c946805b5ce6ab87df0b@syzkaller.appspotmail.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220321180009.1944482-2-song@kernel.org
2022-03-21 13:53:45 -07:00
Song Liu
ef078600ee bpf: Select proper size for bpf_prog_pack
Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
cases. Specifically, for NUMA systems, __vmalloc_node_range requires
PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
does not support huge pages (i.e., with cmdline option nohugevmalloc), it
is better to use PAGE_SIZE packs.

Add logic to select proper size for bpf_prog_pack. This solution is not
ideal, as it makes assumption about the behavior of module_alloc and
__vmalloc_node_range. However, it appears to be the easiest solution as
it doesn't require changes in module_alloc and vmalloc code.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220311201135.3573610-1-song@kernel.org
2022-03-20 19:07:14 -07:00
Hou Tao
d2a3b7c5be bpf: Fix net.core.bpf_jit_harden race
It is the bpf_jit_harden counterpart to commit 60b58afc96 ("bpf: fix
net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
for each subprog if there are subprogs in bpf program and constant
blinding may increase the length of program, so when running
"./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
jit_subprogs may fail because constant blinding increases the length
of subprog instructions during extra passs.

So cache the value of bpf_jit_blinding_enabled() during program
allocation, and use the cached value during constant blinding, subprog
JITing and args tracking of tail call.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220309123321.2400262-4-houtao1@huawei.com
2022-03-16 15:13:36 -07:00
Song Liu
676b2daaba bpf, x86: Set header->size properly before freeing it
On do_jit failure path, the header is freed by bpf_jit_binary_pack_free.
While bpf_jit_binary_pack_free doesn't require proper ro_header->size,
bpf_prog_pack_free still uses it. Set header->size in bpf_int_jit_compile
before calling bpf_jit_binary_pack_free.

Fixes: 1022a5498f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220302175126.247459-3-song@kernel.org
2022-03-02 13:24:37 -08:00
Song Liu
d24d2a2b0a bpf: bpf_prog_pack: Set proper size before freeing ro_header
bpf_prog_pack_free() uses header->size to decide whether the header
should be freed with module_memfree() or the bpf_prog_pack logic.
However, in kvmalloc() failure path of bpf_jit_binary_pack_alloc(),
header->size is not set yet. As a result, bpf_prog_pack_free() may treat
a slice of a pack as a standalone kvmalloc'd header and call
module_memfree() on the whole pack. This in turn causes use-after-free by
other users of the pack.

Fix this by setting ro_header->size before freeing ro_header.

Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
Reported-by: syzbot+ecb1e7e51c52f68f7481@syzkaller.appspotmail.com
Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220217183001.1876034-1-song@kernel.org
2022-02-17 13:15:36 -08:00
Song Liu
4cc0991abd bpf: Fix bpf_prog_pack build for ppc64_defconfig
bpf_prog_pack causes build error with powerpc ppc64_defconfig:

kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope
  830 |         unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
      |                       ^~~~~~

This is because the marco expands as:

unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \
     + ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))];

where __pte_index_size is a global variable.

Fix it by turning bitmap into a 0-length array.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org
2022-02-10 18:59:32 -08:00
Song Liu
c1b13a9451 bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
Fix build with CONFIG_TRANSPARENT_HUGEPAGE=n with BPF_PROG_PACK_SIZE as
PAGE_SIZE.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220208220509.4180389-3-song@kernel.org
2022-02-08 18:37:10 -08:00
Song Liu
33c9805860 bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]
This is the jit binary allocator built on top of bpf_prog_pack.

bpf_prog_pack allocates RO memory, which cannot be used directly by the
JIT engine. Therefore, a temporary rw buffer is allocated for the JIT
engine. Once JIT is done, bpf_jit_binary_pack_finalize is used to copy
the program to the RO memory.

bpf_jit_binary_pack_alloc reserves 16 bytes of extra space for illegal
instructions, which is small than the 128 bytes space reserved by
bpf_jit_binary_alloc. This change is necessary for bpf_jit_binary_hdr
to find the correct header. Also, flag use_bpf_prog_pack is added to
differentiate a program allocated by bpf_jit_binary_pack_alloc.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-9-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
57631054fa bpf: Introduce bpf_prog_pack allocator
Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could add significant
pressure to instruction TLB. High iTLB pressure usually causes slow down
for the whole system, which includes visible performance degradation for
production workloads.

Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
page. The memory is then allocated in 64 byte chunks.

Memory allocated by bpf_prog_pack allocator is RO protected after initial
allocation. To write to it, the user (jit engine) need to use text poke
API.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-8-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
ebc1415d9b bpf: Introduce bpf_arch_text_copy
This will be used to copy JITed text to RO protected module memory. On
x86, bpf_arch_text_copy is implemented with text_poke_copy.

bpf_arch_text_copy returns pointer to dst on success, and ERR_PTR(errno)
on errors.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-7-song@kernel.org
2022-02-07 18:13:01 -08:00