Commit graph

35686 commits

Author SHA1 Message Date
Linus Torvalds
3f6ec19f2d Time and timer updates:
- Instead of new drivers remove tango, sirf, u300 and atlas drivers
  - Add suspend/resume support for microchip pit64b
  - The usual fixes, improvements and cleanups here and there
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmAqjecTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodVTD/9UuKlOifRBNd4ECR+yF65MzLfjqNHU
 j76E3Dzuf4QCbXTjmQAsadaqQ+w9l8Ie7OVT51XlmTczEBkJiV3FOGLVJeRun6R4
 7OLF3VOYyBUtMoRdHzyXaYOTbsOK9gZitucDeLCQhvKhDCkVnKKFXNJR+TTkSYM3
 xDBLwuI7uuHWHyh0+W+3SI1pTiEA4yMe5ZbqqoJbjGQapr3Eao+nyjd1aa3ERb2f
 PtS7UVQ69QowRqq6DQyZk0yKit8J3HZnHfCPH/T6eXsxGnui36GiUnTGCMhLMZpD
 Xvl/5cjqQuKjgt2093t8nGiumOGBOfrb8uvc/qMW777DzFe/VJtXrC/7pySVLAhK
 oc9Swj0iX/WPARzlpyOk3lfpDMzv6qyjMNJIXcnav2lrknITp+TMORKWOADB03UV
 sswlN7YFTrNe7d7uxEdybKkNX6bUwgOzo2m69A1IdSXwKPzYkZQmku6Y7GnzYErZ
 aiJiZl858VB9g24ROKLt/uQTarzYCS0sjcdnDgO1KSR7zKHZ4iUpd3zucd3mlmUo
 fGTMIbCqL/gzl4Zcl6njvzMVfJeMzOeDiQ41wCyYnOsXlIKmWNi1rONdGZcyDYvN
 bOiGVUMKicEZwBlSZQQ1GdR9eGf8/Ix5cTlynZepN3dkHUDc7SU+OvQZdg4Yd/RY
 BsquFDHnt4gl+A==
 =xZQt
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Time and timer updates:

   - Instead of new drivers remove tango, sirf, u300 and atlas drivers

   - Add suspend/resume support for microchip pit64b

   - The usual fixes, improvements and cleanups here and there"

* tag 'timers-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timens: Delete no-op time_ns_init()
  alarmtimer: Update kerneldoc
  clocksource/drivers/timer-microchip-pit64b: Add clocksource suspend/resume
  clocksource/drivers/prima: Remove sirf prima driver
  clocksource/drivers/atlas: Remove sirf atlas driver
  clocksource/drivers/tango: Remove tango driver
  clocksource/drivers/u300: Remove the u300 driver
  dt-bindings: timer: nuvoton: Clarify that interrupt of timer 0 should be specified
  clocksource/drivers/davinci: Move pr_fmt() before the includes
  clocksource/drivers/efm32: Drop unused timer code
2021-02-21 11:55:43 -08:00
Linus Torvalds
b5183bc94b Updates for the irq subsystem:
- The usual new irq chip driver (Realtek RTL83xx)
   - Removal of sirfsoc and tango irq chip drivers
   - Conversion of the sun6i chip support to hierarchical irq domains
   - The usual fixes, improvements and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmAqjRITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYob0ND/9ghLC0XF7gSrIFBxNXczQcckd0Ev8y
 gt7EJMxy9+UG/krEmLk4p5FkkKaUQy4v8PMqhDpI0pR7Fdm2oaOCgPRHsNdVYr5J
 g7FawE99qPiRXaR29rpfX6qIBRP3rJ4/MS90YlcXGvKbgPtqCXStEWrIyu1/1Q8A
 3BdfKL3K15/TeWeu9075yHX8V4Ovss6Birc0/XNtQWiHNk84GAZVIQPjfdHEu83j
 Kv4QxlrfSDjCvB016YKL04kjrwzUXxksvGnyjFXNXFZwg0bVtLTujSdkXajUBQjv
 1ZpcEM+gBPdeOU99ELYeaQ9B0Di2mNjya5TBttmKdKW/nQ01BkWUouoqyBRqjpvd
 +2uXaad/D7iKUYXqXtSim8OpwBLb5THQdG4qn1vgYZOUxWu9DKdtZEO3hw8edgjN
 zJEQFI8yBeDtcAjKTqm0C2fomlWPhDM0A97EHmV8LiQpz0PyY9q8Oogg51t1RggH
 7w9+YjjrGkaPifZXX+YogJT/CbFanK6lOUIFg/qro1QcicCsEbWrTaHUOwdpWpHr
 B4FKZIfrlnrY7yMC7aClPQSJr3iUXBUjfWhRKvYiwOLTabfUWz+HRFA++R8vytbL
 RvEZnhuvi3eyRn91yiapQM/f69yC1yslTZHy4/Ww9kvtOoawQpmegnZ3dLMhwBN1
 aFgQi4U9Z7G2eA==
 =rM8s
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates for the irq subsystem:

   - The usual new irq chip driver (Realtek RTL83xx)

   - Removal of sirfsoc and tango irq chip drivers

   - Conversion of the sun6i chip support to hierarchical irq domains

   - The usual fixes, improvements and cleanups all over the place"

* tag 'irq-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/imx: IMX_INTMUX should not default to y, unconditionally
  irqchip/loongson-pch-msi: Use bitmap_zalloc() to allocate bitmap
  irqchip/csky-mpintc: Prevent selection on unsupported platforms
  irqchip: Add support for Realtek RTL838x/RTL839x interrupt controller
  dt-bindings: interrupt-controller: Add Realtek RTL838x/RTL839x support
  irqchip/ls-extirq: add IRQCHIP_SKIP_SET_WAKE to the irqchip flags
  genirq: Use new tasklet API for resend_tasklet
  dt-bindings: qcom,pdc: Add compatible for SM8350
  dt-bindings: qcom,pdc: Add compatible for SM8250
  irqchip/sun6i-r: Add wakeup support
  irqchip/sun6i-r: Use a stacked irqchip driver
  dt-bindings: irq: sun6i-r: Add a compatible for the H3
  dt-bindings: irq: sun6i-r: Split the binding from sun7i-nmi
  irqchip/gic-v3: Fix typos in PMR/RPR SCR_EL3.FIQ handling explanation
  irqchip: Remove sirfsoc driver
  irqchip: Remove sigma tango driver
2021-02-21 11:53:06 -08:00
Linus Torvalds
582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds
24880bef41 Remove oprofile and dcookies support
The "oprofile" user-space tools don't use the kernel OPROFILE support any more,
 and haven't in a long time. User-space has been converted to the perf
 interfaces.
 
 The dcookies stuff is only used by the oprofile code. Now that oprofile's
 support is getting removed from the kernel, there is no need for dcookies as
 well.
 
 Remove kernel's old oprofile and dcookies support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJgJMEVAAoJENK5HDyugRIcL8YP/jkmXH5CZT80ntcqrJGWKcG7
 lWbach7uNeQteht7B1ZPKvojxizTkmfrN2sClX0B2hbGkc5TiWUQ2ZSnvnfWDZ8+
 z2qQcEB11G/ReL2vvRk1fJlWdAOyUfrPee/44AkemnLRv+Niw/8PqnGd87yDQGsK
 qy5E1XXfbjUq6Y/uMiLOX3+21I6w6o2Q6I3NNXC93s0wS3awqnft8n0XBC7iAPBj
 eowRJxpdRU2Vcuj8UOzzOI7gQlwdjwYImyLPbRy/V8NawC8a+FHrPrf5/GCYlVzl
 7TGFBsDQSmzvrBChUfoGz1Rq/VZ1a357p5rhRqemfUrdkjW+vyzelnD8I1W/hb2o
 SmBXoPoyl3+UkFHNyJI0mI7obaV+2PzyXMV0JIQUj+IiX/mfeFv0nF4XfZD2IkRt
 6xhaYj775Zrx32iBdGZIvvLg5Gh9ZkZmR5vJ7Fi/EIZFe6Z+bZnPKUROnAgS/o0z
 +UkSygOhgo/1XbqrzZVk1iweWeu+EUMbY4YQv2qVnFhpvsq4ieThcUGQpWcxGjjH
 WP8O0n1yq1slsnpUtxhiTsm46ENajx9zZp6Iv6Ws+NM0RUqjND8BdF1co9WGD3LS
 cnZMFBs4Bg/V1HICL/D4s6L7t1ofrEXIgJH1y3iF0HeECq03mU4CgA/qly9Aebqg
 UxPF3oNlVOPlds9FzsU2
 =I2Ac
 -----END PGP SIGNATURE-----

Merge tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux

Pull oprofile and dcookies removal from Viresh Kumar:
 "Remove oprofile and dcookies support

  The 'oprofile' user-space tools don't use the kernel OPROFILE support
  any more, and haven't in a long time. User-space has been converted to
  the perf interfaces.

  The dcookies stuff is only used by the oprofile code. Now that
  oprofile's support is getting removed from the kernel, there is no
  need for dcookies as well.

  Remove kernel's old oprofile and dcookies support"

* tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux:
  fs: Remove dcookies support
  drivers: Remove CONFIG_OPROFILE support
  arch: xtensa: Remove CONFIG_OPROFILE support
  arch: x86: Remove CONFIG_OPROFILE support
  arch: sparc: Remove CONFIG_OPROFILE support
  arch: sh: Remove CONFIG_OPROFILE support
  arch: s390: Remove CONFIG_OPROFILE support
  arch: powerpc: Remove oprofile
  arch: powerpc: Stop building and using oprofile
  arch: parisc: Remove CONFIG_OPROFILE support
  arch: mips: Remove CONFIG_OPROFILE support
  arch: microblaze: Remove CONFIG_OPROFILE support
  arch: ia64: Remove rest of perfmon support
  arch: ia64: Remove CONFIG_OPROFILE support
  arch: hexagon: Don't select HAVE_OPROFILE
  arch: arc: Remove CONFIG_OPROFILE support
  arch: arm: Remove CONFIG_OPROFILE support
  arch: alpha: Remove CONFIG_OPROFILE support
2021-02-21 10:40:34 -08:00
Linus Torvalds
591fd30eee Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ELF compat updates from Al Viro:
 "Sanitizing ELF compat support, especially for triarch architectures:

   - X32 handling cleaned up

   - MIPS64 uses compat_binfmt_elf.c both for O32 and N32 now

   - Kconfig side of things regularized

  Eventually I hope to have compat_binfmt_elf.c killed, with both native
  and compat built from fs/binfmt_elf.c, with -DELF_BITS={64,32} passed
  by kbuild, but that's a separate story - not included here"

* 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get rid of COMPAT_ELF_EXEC_PAGESIZE
  compat_binfmt_elf: don't bother with undef of ELF_ARCH
  Kconfig: regularize selection of CONFIG_BINFMT_ELF
  mips compat: switch to compat_binfmt_elf.c
  mips: don't bother with ELF_CORE_EFLAGS
  mips compat: don't bother with ELF_ET_DYN_BASE
  mips: KVM_GUEST makes no sense for 64bit builds...
  mips: kill unused definitions in binfmt_elf[on]32.c
  mips binfmt_elf*32.c: use elfcore-compat.h
  x32: make X32, !IA32_EMULATION setups able to execute x32 binaries
  [amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly
  elf_prstatus: collect the common part (everything before pr_reg) into a struct
  binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID
2021-02-21 09:29:23 -08:00
Linus Torvalds
02f9fc286e Power management updates for 5.12-rc1
- Add new power capping facility called DTPM (Dynamic Thermal Power
    Management), based on the existing power capping framework, to
    allow aggregate power constraints to be applied to sets of devices
    in a distributed manner, along with a CPU backend driver based on
    the Energy Model (Daniel Lezcano, Dan Carpenter, Colin Ian King).
 
  - Add AlderLake Mobile support to the Intel RAPL power capping
    driver and make it use the topology interface when laying out the
    system topology (Zhang Rui, Yunfeng Ye).
 
  - Drop the cpufreq tango driver belonging to a platform that is not
    supported any more (Arnd Bergmann).
 
  - Drop the redundant CPUFREQ_STICKY and CPUFREQ_PM_NO_WARN cpufreq
    driver flags (Viresh Kumar).
 
  - Update cpufreq drivers:
 
    * Fix max CPU frequency discovery in the intel_pstate driver and
      make janitorial changes in it (Chen Yu, Rafael Wysocki, Nigel
      Christian).
 
    * Fix resource leaks in the brcmstb-avs-cpufreq driver (Christophe
      JAILLET).
 
    * Make the tegra20 driver use the resource-managed API (Dmitry
      Osipenko).
 
    * Enable boost support in the qcom-hw driver (Shawn Guo).
 
  - Update the operating performance points (OPP) framework:
 
    * Clean up the OPP core (Dmitry Osipenko, Viresh Kumar).
 
    * Extend the OPP API by adding new helpers to it (Dmitry Osipenko,
      Viresh Kumar).
 
    * Allow required OPPs to be used for devfreq devices and update
      the devfreq governor code accordingly (Saravana Kannan).
 
    * Prepare the framework for introducing new dev_pm_opp_set_opp()
      helper (Viresh Kumar).
 
    * Drop dev_pm_opp_set_bw() and update related drivers (Viresh
      Kumar).
 
    * Allow lazy linking of required-OPPs (Viresh Kumar).
 
  - Simplify and clean up devfreq somewhat (Lukasz Luba, Yang Li,
    Pierre Kuo).
 
  - Update the generic power domains (genpd) framework:
 
    * Use device's next wakeup to determine domain idle state (Lina
      Iyer).
 
    * Improve initialization and debug (Dmitry Osipenko).
 
    * Simplify computations (Abaci Team).
 
  - Make janitorial changes in the core code handling system sleep
    and PM-runtime (Bhaskar Chowdhury, Bjorn Helgaas, Rikard Falkeborn,
    Zqiang).
 
  - Update the MAINTAINERS entry for the exynos cpuidle driver and
    drop DEBUG definition from intel_idle (Krzysztof Kozlowski, Tom
    Rix).
 
  - Extend the PM clock layer to cover clocks that must sleep (Nicolas
    Pitre).
 
  - Update the cpupower utility:
 
    * Update cpupower command, add support for AMD family 0x19 and clean
      up the code to remove many of the family checks to make future
      family updates easier (Nathan Fontenot, Robert Richter).
 
    * Add Makefile dependencies for install targets to allow building
      cpupower in parallel rather than serially (Ivan Babrou).
 
  - Make janitorial changes in power management Kconfig (Lukasz Luba).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmAquvMSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx7MEQAIFx7WGu0TquvYYbX1Op4oxHaN5wLZba
 XZjh5g4dSn+fKF+WXO4+Ze8VwFx2E6p2650OYJ9A3H83yqjWz5x0CoYm++LTWkWJ
 CRPhyAL2JzrqDL2oJ/7XdK41cz/DT1p2B5cdOVeI4OaLhTzLa4TAO3EwCA/Eyayb
 qumbp6vt3G3zfSDLMA8Wa/HDNaWcN0/NDdnhlj9zlT27COyFJJUARa1jGA/5+2BK
 k/LP7homFIPf8vTGMkL/JuHU7Fsqa2vCFEzB22xyD8GE59dHoXYTlKA2pOr/2lJM
 VV5h0FDS6lFbkp6AQDHwh5tsGusT7dREFXebBUWtmETZmB9iQZXAo4k+MnUbYGSt
 stYFdDJpQkK42icF7uhJE1xuZkQ16xBm02pBvlJWMSyYyHCHhUH83VxhA11szA5/
 glLHfhhdbAa1BKmFHuTEZiwCAssY+YmoAvUpgLW04csJ4s4+my7VCgVe6jOILv2H
 0PdakYam/UEXPoDR4bAdJePZQvwbeUQUtmZ/oYb7Ab2ztfFcpVtZ5T0QeKazXkiZ
 BDtJ+XQJAhZOmLL4u2owGdevvjQU+68QZF1NfMoOTW2K7bcok3pVjj5uKRkfir6F
 R74mzHNNKOWBO9NwByinrMNJqHqPxfbREvg92Xi7GGDtt+rD7/K3K4Au+9W9MxwL
 AvUhPGadUZT1
 =9xhj
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add a new power capping facility allowing aggregate power
  constraints to be applied to sets of devices in a distributed manner,
  add a new CPU ID to the RAPL power capping driver and improve it, drop
  a cpufreq driver belonging to a platform that is not supported any
  more, drop two redundant cpufreq driver flags, update cpufreq drivers
  (intel_pstate, brcmstb-avs, qcom-hw), update the operating performance
  points (OPP) framework (code cleanups, new helpers, devfreq-related
  modifications), clean up devfreq, extend the PM clock layer, update
  the cpupower utility and make assorted janitorial changes.

  Specifics:

   - Add new power capping facility called DTPM (Dynamic Thermal Power
     Management), based on the existing power capping framework, to
     allow aggregate power constraints to be applied to sets of devices
     in a distributed manner, along with a CPU backend driver based on
     the Energy Model (Daniel Lezcano, Dan Carpenter, Colin Ian King).

   - Add AlderLake Mobile support to the Intel RAPL power capping driver
     and make it use the topology interface when laying out the system
     topology (Zhang Rui, Yunfeng Ye).

   - Drop the cpufreq tango driver belonging to a platform that is not
     supported any more (Arnd Bergmann).

   - Drop the redundant CPUFREQ_STICKY and CPUFREQ_PM_NO_WARN cpufreq
     driver flags (Viresh Kumar).

   - Update cpufreq drivers:

      * Fix max CPU frequency discovery in the intel_pstate driver and
        make janitorial changes in it (Chen Yu, Rafael Wysocki, Nigel
        Christian).

      * Fix resource leaks in the brcmstb-avs-cpufreq driver (Christophe
        JAILLET).

      * Make the tegra20 driver use the resource-managed API (Dmitry
        Osipenko).

      * Enable boost support in the qcom-hw driver (Shawn Guo).

   - Update the operating performance points (OPP) framework:

      * Clean up the OPP core (Dmitry Osipenko, Viresh Kumar).

      * Extend the OPP API by adding new helpers to it (Dmitry Osipenko,
        Viresh Kumar).

      * Allow required OPPs to be used for devfreq devices and update
        the devfreq governor code accordingly (Saravana Kannan).

      * Prepare the framework for introducing new dev_pm_opp_set_opp()
        helper (Viresh Kumar).

      * Drop dev_pm_opp_set_bw() and update related drivers (Viresh
        Kumar).

      * Allow lazy linking of required-OPPs (Viresh Kumar).

   - Simplify and clean up devfreq somewhat (Lukasz Luba, Yang Li,
     Pierre Kuo).

   - Update the generic power domains (genpd) framework:

      * Use device's next wakeup to determine domain idle state (Lina
        Iyer).

      * Improve initialization and debug (Dmitry Osipenko).

      * Simplify computations (Abaci Team).

   - Make janitorial changes in the core code handling system sleep and
     PM-runtime (Bhaskar Chowdhury, Bjorn Helgaas, Rikard Falkeborn,
     Zqiang).

   - Update the MAINTAINERS entry for the exynos cpuidle driver and drop
     DEBUG definition from intel_idle (Krzysztof Kozlowski, Tom Rix).

   - Extend the PM clock layer to cover clocks that must sleep (Nicolas
     Pitre).

   - Update the cpupower utility:

      * Update cpupower command, add support for AMD family 0x19 and
        clean up the code to remove many of the family checks to make
        future family updates easier (Nathan Fontenot, Robert Richter).

      * Add Makefile dependencies for install targets to allow building
        cpupower in parallel rather than serially (Ivan Babrou).

   - Make janitorial changes in power management Kconfig (Lukasz Luba)"

* tag 'pm-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
  MAINTAINERS: cpuidle: exynos: include header in file pattern
  powercap: intel_rapl: Use topology interface in rapl_init_domains()
  powercap: intel_rapl: Use topology interface in rapl_add_package()
  PM: sleep: Constify static struct attribute_group
  PM: Kconfig: remove unneeded "default n" options
  PM: EM: update Kconfig description and drop "default n" option
  cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN
  cpufreq: Remove CPUFREQ_STICKY flag
  PM / devfreq: Add required OPPs support to passive governor
  PM / devfreq: Cache OPP table reference in devfreq
  OPP: Add function to look up required OPP's for a given OPP
  PM / devfreq: rk3399_dmc: Remove unneeded semicolon
  opp: Replace ENOTSUPP with EOPNOTSUPP
  opp: Fix "foo * bar" should be "foo *bar"
  opp: Don't ignore clk_get() errors other than -ENOENT
  opp: Update bandwidth requirements based on scaling up/down
  opp: Allow lazy-linking of required-opps
  opp: Remove dev_pm_opp_set_bw()
  devfreq: tegra30: Migrate to dev_pm_opp_set_opp()
  drm: msm: Migrate to dev_pm_opp_set_opp()
  ...
2021-02-20 21:42:18 -08:00
Linus Torvalds
51e6d17809 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Here is what we have this merge window:

   1) Support SW steering for mlx5 Connect-X6Dx, from Yevgeny Kliteynik.

   2) Add RSS multi group support to octeontx2-pf driver, from Geetha
      Sowjanya.

   3) Add support for KS8851 PHY. From Marek Vasut.

   4) Add support for GarfieldPeak bluetooth controller from Kiran K.

   5) Add support for half-duplex tcan4x5x can controllers.

   6) Add batch skb rx processing to bcrm63xx_enet, from Sieng Piaw
      Liew.

   7) Rework RX port offload infrastructure, particularly wrt, UDP
      tunneling, from Jakub Kicinski.

   8) Add BCM72116 PHY support, from Florian Fainelli.

   9) Remove Dsa specific notifiers, they are unnecessary. From Vladimir
      Oltean.

  10) Add support for picosecond rx delay in dwmac-meson8b chips. From
      Martin Blumenstingl.

  11) Support TSO on xfrm interfaces from Eyal Birger.

  12) Add support for MP_PRIO to mptcp stack, from Geliang Tang.

  13) Support BCM4908 integrated switch, from Rafał Miłecki.

  14) Support for directly accessing kernel module variables via module
      BTF info, from Andrii Naryiko.

  15) Add DASH (esktop and mobile Architecture for System Hardware)
      support to r8169 driver, from Heiner Kallweit.

  16) Add rx vlan filtering to dpaa2-eth, from Ionut-robert Aron.

  17) Add support for 100 base0x SFP devices, from Bjarni Jonasson.

  18) Support link aggregation in DSA, from Tobias Waldekranz.

  19) Support for bitwidse atomics in bpf, from Brendan Jackman.

  20) SmartEEE support in at803x driver, from Russell King.

  21) Add support for flow based tunneling to GTP, from Pravin B Shelar.

  22) Allow arbitrary number of interconnrcts in ipa, from Alex Elder.

  23) TLS RX offload for bonding, from Tariq Toukan.

  24) RX decap offklload support in mac80211, from Felix Fietkou.

  25) devlink health saupport in octeontx2-af, from George Cherian.

  26) Add TTL attr to SCM_TIMESTAMP_OPT_STATS, from Yousuk Seung

  27) Delegated actionss support in mptcp, from Paolo Abeni.

  28) Support receive timestamping when doin zerocopy tcp receive. From
      Arjun Ray.

  29) HTB offload support for mlx5, from Maxim Mikityanskiy.

  30) UDP GRO forwarding, from Maxim Mikityanskiy.

  31) TAPRIO offloading in dsa hellcreek driver, from Kurt Kanzenbach.

  32) Weighted random twos choice algorithm for ipvs, from Darby Payne.

  33) Fix netdev registration deadlock, from Johannes Berg.

  34) Various conversions to new tasklet api, from EmilRenner Berthing.

  35) Bulk skb allocations in veth, from Lorenzo Bianconi.

  36) New ethtool interface for lane setting, from Danielle Ratson.

  37) Offload failiure notifications for routes, from Amit Cohen.

  38) BCM4908 support, from Rafał Miłecki.

  39) Support several new iwlwifi chips, from Ihab Zhaika.

  40) Flow drector support for ipv6 in i40e, from Przemyslaw Patynowski.

  41) Support for mhi prrotocols, from Loic Poulain.

  42) Optimize bpf program stats.

  43) Implement RFC6056, for better port randomization, from Eric
      Dumazet.

  44) hsr tag offloading support from George McCollister.

  45) Netpoll support in qede, from Bhaskar Upadhaya.

  46) 2005/400g speed support in bonding 3ad mode, from Nikolay
      Aleksandrov.

  47) Netlink event support in mptcp, from Florian Westphal.

  48) Better skbuff caching, from Alexander Lobakin.

  49) MRP (Media Redundancy Protocol) offloading in DSA and a few
      drivers, from Horatiu Vultur.

  50) mqprio saupport in mvneta, from Maxime Chevallier.

  51) Remove of_phy_attach, no longer needed, from Florian Fainelli"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1766 commits)
  octeontx2-pf: Fix otx2_get_fecparam()
  cteontx2-pf: cn10k: Prevent harmless double shift bugs
  net: stmmac: Add PCI bus info to ethtool driver query output
  ptp: ptp_clockmatrix: clean-up - parenthesis around a == b are unnecessary
  ptp: ptp_clockmatrix: Simplify code - remove unnecessary `err` variable.
  ptp: ptp_clockmatrix: Coding style - tighten vertical spacing.
  ptp: ptp_clockmatrix: Clean-up dev_*() messages.
  ptp: ptp_clockmatrix: Remove unused header declarations.
  ptp: ptp_clockmatrix: Add alignment of 1 PPS to idtcm_perout_enable.
  ptp: ptp_clockmatrix: Add wait_for_sys_apll_dpll_lock.
  net: stmmac: dwmac-sun8i: Add a shutdown callback
  net: stmmac: dwmac-sun8i: Minor probe function cleanup
  net: stmmac: dwmac-sun8i: Use reset_control_reset
  net: stmmac: dwmac-sun8i: Remove unnecessary PHY power check
  net: stmmac: dwmac-sun8i: Return void from PHY unpower
  r8169: use macro pm_ptr
  net: mdio: Remove of_phy_attach()
  net: mscc: ocelot: select PACKING in the Kconfig
  net: re-solve some conflicts after net -> net-next merge
  net: dsa: tag_rtl4_a: Support also egress tags
  ...
2021-02-20 17:45:32 -08:00
Christoph Hellwig
ca10d0f8e5 swiotlb: clean up swiotlb_tbl_unmap_single
Remove a layer of pointless indentation, replace a hard to follow
ternary expression with a plain if/else.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jianxiong Gao <jxgao@google.com>
Tested-by: Jianxiong Gao <jxgao@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-02-20 10:13:51 -05:00
Christoph Hellwig
c32a77fd18 swiotlb: factor out a nr_slots helper
Factor out a helper to find the number of slots for a given size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jianxiong Gao <jxgao@google.com>
Tested-by: Jianxiong Gao <jxgao@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-02-20 10:13:46 -05:00
Christoph Hellwig
c7fbeca757 swiotlb: factor out an io_tlb_offset helper
Replace the very genericly named OFFSET macro with a little inline
helper that hardcodes the alignment to the only value ever passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jianxiong Gao <jxgao@google.com>
Tested-by: Jianxiong Gao <jxgao@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-02-20 10:13:40 -05:00
Christoph Hellwig
b5d7ccb7aa swiotlb: add a IO_TLB_SIZE define
Add a new IO_TLB_SIZE define instead open coding it using
IO_TLB_SHIFT all over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jianxiong Gao <jxgao@google.com>
Tested-by: Jianxiong Gao <jxgao@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2021-02-20 10:13:32 -05:00
Masami Hiramatsu
c85c9a2c6e kprobes: Fix to delay the kprobes jump optimization
Commit 36dadef23f ("kprobes: Init kprobes in early_initcall")
moved the kprobe setup in early_initcall(), which includes kprobe
jump optimization.
The kprobes jump optimizer involves synchronize_rcu_tasks() which
depends on the ksoftirqd and rcu_spawn_tasks_*(). However, since
those are setup in core_initcall(), kprobes jump optimizer can not
run at the early_initcall().

To avoid this issue, make the kprobe optimization disabled in the
early_initcall() and enables it in subsys_initcall().

Note that non-optimized kprobes is still available after
early_initcall(). Only jump optimization is delayed.

Link: https://lkml.kernel.org/r/161365856280.719838.12423085451287256713.stgit@devnote2

Fixes: 36dadef23f ("kprobes: Init kprobes in early_initcall")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: RCU <rcu@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Y . Ts'o" <tytso@mit.edu>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: stable@vger.kernel.org
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-19 14:57:12 -05:00
Yue Hu
e209cb51bf cpufreq: schedutil: Remove update_lock comment from struct sugov_policy definition
Currently, update_lock is also used in sugov_update_single_freq().

The comment is not helpful anymore.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-19 16:14:16 +01:00
Yue Hu
71f1309f4f cpufreq: schedutil: Remove needless sg_policy parameter from ignore_dl_rate_limit()
Since sg_policy is a member of struct sugov_cpu.

Also remove the local variable in sugov_update_single_common() to
make the code more clean.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Minor subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-19 16:11:19 +01:00
Frederic Weisbecker
4ae7dc97f7 entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Frederic Weisbecker
47b8ff194c entry: Explicitly flush pending rcuog wakeup before last rescheduling point
Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point on resuming to user mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-5-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Frederic Weisbecker
f8bb5cae96 rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Frederic Weisbecker
43789ef3f7 rcu/nocb: Perform deferred wake up before last idle's need_resched() check
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org
2021-02-17 14:12:43 +01:00
Frederic Weisbecker
54b7429eff rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Juri Lelli
e0ee463c93 sched/features: Distinguish between NORMAL and DEADLINE hrtick
The HRTICK feature has traditionally been servicing configurations that
need precise preemptions point for NORMAL tasks. More recently, the
feature has been extended to also service DEADLINE tasks with stringent
runtime enforcement needs (e.g., runtime < 1ms with HZ=1000).

Enabling HRTICK sched feature currently enables the additional timer and
task tick for both classes, which might introduced undesired overhead
for no additional benefit if one needed it only for one of the cases.

Separate HRTICK sched feature in two (and leave the traditional case
name unmodified) so that it can be selectively enabled when needed.

With:

  $ echo HRTICK > /sys/kernel/debug/sched_features

the NORMAL/fair hrtick gets enabled.

With:

  $ echo HRTICK_DL > /sys/kernel/debug/sched_features

the DEADLINE hrtick gets enabled.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210208073554.14629-3-juri.lelli@redhat.com
2021-02-17 14:12:42 +01:00
Juri Lelli
156ec6f42b sched/features: Fix hrtick reprogramming
Hung tasks and RCU stall cases were reported on systems which were not
100% busy. Investigation of such unexpected cases (no sign of potential
starvation caused by tasks hogging the system) pointed out that the
periodic sched tick timer wasn't serviced anymore after a certain point
and that caused all machinery that depends on it (timers, RCU, etc.) to
stop working as well. This issues was however only reproducible if
HRTICK was enabled.

Looking at core dumps it was found that the rbtree of the hrtimer base
used also for the hrtick was corrupted (i.e. next as seen from the base
root and actual leftmost obtained by traversing the tree are different).
Same base is also used for periodic tick hrtimer, which might get "lost"
if the rbtree gets corrupted.

Much alike what described in commit 1f71addd34 ("tick/sched: Do not
mess with an enqueued hrtimer") there is a race window between
hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in
__hrtick_restart() in which the former might be operating on an already
queued hrtick hrtimer, which might lead to corruption of the base.

Use hrtick_start() (which removes the timer before enqueuing it back) to
ensure hrtick hrtimer reprogramming is entirely guarded by the base
lock, so that no race conditions can occur.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com
2021-02-17 14:12:42 +01:00
Dietmar Eggemann
de40f33e78 sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
dl_add_task_root_domain() is called during sched domain rebuild:

  rebuild_sched_domains_locked()
    partition_and_rebuild_sched_domains()
      rebuild_root_domains()
         for all top_cpuset descendants:
           update_tasks_root_domain()
             for all tasks of cpuset:
               dl_add_task_root_domain()

Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.

Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.

One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggemann@arm.com
2021-02-17 14:12:42 +01:00
Sven Schnelle
b0d6d47896 uprobes: (Re)add missing get_uprobe() in __find_uprobe()
commit c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
accidentally removed the refcount increase. Add it again.

Fixes: c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210209150711.36778-1-svens@linux.ibm.com
2021-02-17 14:12:42 +01:00
Sebastian Andrzej Siewior
f9d34595ae smp: Process pending softirqs in flush_smp_call_function_from_idle()
send_call_function_single_ipi() may wake an idle CPU without sending an
IPI. The woken up CPU will process the SMP-functions in
flush_smp_call_function_from_idle(). Any raised softirq from within the
SMP-function call will not be processed.
Should the CPU have no tasks assigned, then it will go back to idle with
pending softirqs and the NOHZ will rightfully complain.

Process pending softirqs on return from flush_smp_call_function_queue().

Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210123201027.3262800-2-bigeasy@linutronix.de
2021-02-17 14:12:42 +01:00
Peter Zijlstra
ef72661e28 sched: Harden PREEMPT_DYNAMIC
Use the new EXPORT_STATIC_CALL_TRAMP() / static_call_mod() to unexport
the static_call_key for the PREEMPT_DYNAMIC calls such that modules
can no longer update these calls.

Having modules change/hi-jack the preemption calls would be horrible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-17 14:12:42 +01:00
Josh Poimboeuf
73f44fe19d static_call: Allow module use without exposing static_call_key
When exporting static_call_key; with EXPORT_STATIC_CALL*(), the module
can use static_call_update() to change the function called.  This is
not desirable in general.

Not exporting static_call_key however also disallows usage of
static_call(), since objtool needs the key to construct the
static_call_site.

Solve this by allowing objtool to create the static_call_site using
the trampoline address when it builds a module and cannot find the
static_call_key symbol. The module loader will then try and map the
trampole back to a key before it constructs the normal sites list.

Doing this requires a trampoline -> key associsation, so add another
magic section that keeps those.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210127231837.ifddpn7rhwdaepiu@treble
2021-02-17 14:12:42 +01:00
Peter Zijlstra
e59e10f8ef sched: Add /debug/sched_preempt
Add a debugfs file to muck about with the preempt mode at runtime.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/YAsGiUYf6NyaTplX@hirez.programming.kicks-ass.net
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel)
826bfeb37b preempt/dynamic: Support dynamic preempt with preempt= boot option
Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel)
40607ee97e preempt/dynamic: Provide irqentry_exit_cond_resched() static call
Provide static call to control IRQ preemption (called in CONFIG_PREEMPT)
so that we can override its behaviour when preempt= is overriden.

Since the default behaviour is full preemption, its call is
initialized to provide IRQ preemption when preempt= isn't passed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-8-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel)
2c9a98d3bc preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
Provide static calls to control preempt_schedule[_notrace]()
(called in CONFIG_PREEMPT) so that we can override their behaviour when
preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
initialized to the arch provided wrapper, if any.

[fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
           dependent on x86 with __preempt_schedule_func]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-7-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Peter Zijlstra (Intel)
b965f1ddb4 preempt/dynamic: Provide cond_resched() and might_resched() static calls
Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

  [fweisbec: branch might_resched() directly to __cond_resched(), only
             define static calls when PREEMPT_DYNAMIC]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.org
2021-02-17 14:12:42 +01:00
Michal Hocko
6ef869e064 preempt: Introduce CONFIG_PREEMPT_DYNAMIC
Preemption mode selection is currently hardcoded on Kconfig choices.
Introduce a dedicated option to tune preemption flavour at boot time,

This will be only available on architectures efficiently supporting
static calls in order not to tempt with the feature against additional
overhead that might be prohibitive or undesirable.

CONFIG_PREEMPT_DYNAMIC is automatically selected by CONFIG_PREEMPT if
the architecture provides the necessary support (CONFIG_STATIC_CALL_INLINE,
CONFIG_GENERIC_ENTRY, and provide with __preempt_schedule_function() /
__preempt_schedule_notrace_function()).

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[peterz: relax requirement to HAVE_STATIC_CALL]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-5-frederic@kernel.org
2021-02-17 14:12:24 +01:00
Peter Zijlstra
3f2a8fc4b1 static_call/x86: Add __static_call_return0()
Provide a stub function that return 0 and wire up the static call site
patching to replace the CALL with a single 5 byte instruction that
clears %RAX, the return value register.

The function can be cast to any function pointer type that has a
single %RAX return (including pointers). Also provide a version that
returns an int for convenience. We are clearing the entire %RAX register
in any case, whether the return value is 32 or 64 bits, since %RAX is
always a scratch register anyway.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-2-frederic@kernel.org
2021-02-17 14:08:43 +01:00
Dietmar Eggemann
c541bb7835 sched/core: Update task_prio() function header
The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc/<pid>/stat priority field:

  # cat /proc/<pid>/stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggemann@arm.com
2021-02-17 14:08:30 +01:00
Dietmar Eggemann
9d061ba6bc sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO
The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

Commit fe443ef2ac ("[POWERPC] spusched: Dynamic timeslicing for
SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

Commit a4ec24b48d ("sched: tidy up SCHED_RR") removed it from the task
scheduler in v2.6.24.

Commit 3ee237dddc ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

With:

  MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

                = MAX_PRIO - MAX_RT_PRIO

       MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

  MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

  MAX_USER_PRIO = NICE_WIDTH

MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
{*_}USER_PRIO defines.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggemann@arm.com
2021-02-17 14:08:17 +01:00
Dietmar Eggemann
ae18ad281e sched: Remove MAX_USER_RT_PRIO
Commit d46523ea32 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggemann@arm.com
2021-02-17 14:08:11 +01:00
Dietmar Eggemann
71e5f6644f sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa()
Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd->sd, j); <--
	  ...
      }

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
2021-02-17 14:08:05 +01:00
Peter Zijlstra
5a7987253e rbtree, rtmutex: Use rb_add_cached()
Reduce rbtree boiler plate by using the new helpers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17 14:07:57 +01:00
Peter Zijlstra
a905e84e64 rbtree, uprobes: Use rbtree helpers
Reduce rbtree boilerplate by using the new helpers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17 14:07:52 +01:00
Peter Zijlstra
a3b8986455 rbtree, perf: Use new rbtree helpers
Reduce rbtree boiler plate by using the new helpers.

One noteworthy change is unification of the various (partial) compare
functions. We construct a subtree match by forcing the sub-order to
always match, see __group_cmp().

Due to 'const' we had to touch cgroup_id().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17 14:07:48 +01:00
Peter Zijlstra
8ecca39483 rbtree, sched/deadline: Use rb_add_cached()
Reduce rbtree boiler plate by using the new helpers.

Make rb_add_cached() / rb_erase_cached() return a pointer to the
leftmost node to aid in updating additional state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17 14:07:44 +01:00
Peter Zijlstra
bf9be9a163 rbtree, sched/fair: Use rb_add_cached()
Reduce rbtree boiler plate by using the new helper function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
2021-02-17 14:07:39 +01:00
Mel Gorman
9fe1f127b9 sched/fair: Merge select_idle_core/cpu()
Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210127135203.19633-5-mgorman@techsingularity.net
2021-02-17 14:07:25 +01:00
Mel Gorman
6cd56ef1df sched/fair: Remove select_idle_smt()
In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210125085909.4600-4-mgorman@techsingularity.net
2021-02-17 14:06:59 +01:00
Ingo Molnar
ed3cd45f8c Linux 5.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmAppPgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGeXYH/imZPBd4A1jIMehN
 5HV2A53Z+MXmmaMuGj9X1KV6vsf55/xB+IhOoFdtRAIsO8c2yYSCO8i4+4R0XfYA
 +/YFJeq672rojQnmh6XbpR8dugaAV7CUHy6n7KDsyvtT6EOCpwFSwkOb4X3tBRX6
 TlYgm2d/xgV/wRHSgLVugK0MdFCLMAnyb7mkPfar9QrMgG1BiDKLq07xmwnS23On
 TkqpJ9yZ/rJpUrrUqQYPShSO/FmA+fSfWs0CDv7EIrJ40LUScD6PZxSHWTIHtjLk
 E4jFda6wuqLRVWsBwaBzUIdD0zk7X5quHRzEpbC5ga16SK6yrWvE5YJJXCguIEuZ
 f3FMRYs=
 =CAjn
 -----END PGP SIGNATURE-----

Merge tag 'v5.11' into sched/core, to pick up fixes & refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-17 14:04:39 +01:00
David S. Miller
d489ded1a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-16 17:51:13 -08:00
David S. Miller
b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Chris Wilson
bfe3911a91 kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.

Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.

References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
2021-02-16 09:59:41 +01:00
Sasha Levin
88a686728b kbuild: simplify access to the kernel's version
Instead of storing the version in a single integer and having various
kernel (and userspace) code how it's constructed, export individual
(major, patchlevel, sublevel) components and simplify kernel code that
uses it.

This should also make it easier on userspace.

Signed-off-by: Sasha Levin <sashal@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-02-16 12:01:45 +09:00
Ilya Leoshkevich
45159b2763 bpf: Clear subreg_def for global function return values
test_global_func4 fails on s390 as reported by Yauheni in [1].

The immediate problem is that the zext code includes the instruction,
whose result needs to be zero-extended, into the zero-extension
patchlet, and if this instruction happens to be a branch, then its
delta is not adjusted. As a result, the verifier rejects the program
later.

However, according to [2], as far as the verifier's algorithm is
concerned and as specified by the insn_no_def() function, branching
insns do not define anything. This includes call insns, even though
one might argue that they define %r0.

This means that the real problem is that zero extension kicks in at
all. This happens because clear_caller_saved_regs() sets BPF_REG_0's
subreg_def after global function calls. This can be fixed in many
ways; this patch mimics what helper function call handling already
does.

  [1] https://lore.kernel.org/bpf/20200903140542.156624-1-yauheni.kaliuta@redhat.com/
  [2] https://lore.kernel.org/bpf/CAADnVQ+2RPKcftZw8d+B1UwB35cpBhpF5u3OocNh90D9pETPwg@mail.gmail.com/

Fixes: 51c39bb1d5 ("bpf: Introduce function-by-function verification")
Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210212040408.90109-1-iii@linux.ibm.com
2021-02-15 23:39:35 +01:00
Rafael J. Wysocki
a9a939cb34 Merge branches 'powercap' and 'pm-misc'
* powercap:
  powercap: intel_rapl: Use topology interface in rapl_init_domains()
  powercap: intel_rapl: Use topology interface in rapl_add_package()
  powercap/intel_rapl: add support for AlderLake Mobile
  powercap/drivers/dtpm: Fix size of object being allocated
  powercap/drivers/dtpm: Fix an IS_ERR() vs NULL check
  powercap/drivers/dtpm: Fix some missing unlock bugs
  powercap/drivers/dtpm: Fix a double shift bug
  powercap/drivers/dtpm: Fix __udivdi3 and __aeabi_uldivmod unresolved symbols
  powercap/drivers/dtpm: Add CPU energy model based support
  powercap/drivers/dtpm: Add API for dynamic thermal power management
  Documentation/powercap/dtpm: Add documentation for dtpm
  units: Add Watt units

* pm-misc:
  PM: Kconfig: remove unneeded "default n" options
  PM: EM: update Kconfig description and drop "default n" option
2021-02-15 18:50:01 +01:00
Rafael J. Wysocki
6621cd2db5 Merge branches 'pm-sleep', 'pm-core', 'pm-domains' and 'pm-clk'
* pm-sleep:
  PM: sleep: Constify static struct attribute_group
  PM: sleep: Use dev_printk() when possible
  PM: sleep: No need to check PF_WQ_WORKER in thaw_kernel_threads()

* pm-core:
  PM: runtime: Fix typos and grammar
  PM: runtime: Fix resposible -> responsible in runtime.c

* pm-domains:
  PM: domains: Simplify the calculation of variables
  PM: domains: Add "performance" column to debug summary
  PM: domains: Make of_genpd_add_subdomain() return -EPROBE_DEFER
  PM: domains: Make set_performance_state() callback optional
  PM: domains: use device's next wakeup to determine domain idle state
  PM: domains: inform PM domain of a device's next wakeup

* pm-clk:
  PM: clk: make PM clock layer compatible with clocks that must sleep
2021-02-15 17:01:11 +01:00
Linus Torvalds
ac30d8ce28 Merge branch 'for-5.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Two cgroup fixes:

   - fix a NULL deref when trying to poll PSI in the root cgroup

   - fix confusing controller parsing corner case when mounting cgroup
     v1 hierarchies

  And doc / maintainer file updates"

* 'for-5.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: update PSI file description in docs
  cgroup: fix psi monitor for root cgroup
  MAINTAINERS: Update my email address
  MAINTAINERS: Remove stale URLs for cpuset
  cgroup-v1: add disabled controller check in cgroup1_parse_param()
2021-02-13 12:25:42 -08:00
Christoph Hellwig
6d4e9a8efe driver core: lift dma_default_coherent into common code
Lift the dma_default_coherent variable from the mips architecture code
to the driver core.  This allows an architecture to sdefault all device
to be DMA coherent at run time, even if the kernel is build with support
for DMA noncoherent device.  By allowing device_initialize to set the
->dma_coherent field to this default the amount of arch hooks required
for this behavior can be greatly reduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2021-02-13 09:51:45 +01:00
Dmitrii Banshchikov
e5069b9c23 bpf: Support pointers in global func args
Add an ability to pass a pointer to a type with known size in arguments
of a global function. Such pointers may be used to overcome the limit on
the maximum number of arguments, avoid expensive and tricky workarounds
and to have multiple output arguments.

A referenced type may contain pointers but indirect access through them
isn't supported.

The implementation consists of two parts.  If a global function has an
argument that is a pointer to a type with known size then:

  1) In btf_check_func_arg_match(): check that the corresponding
register points to NULL or to a valid memory region that is large enough
to contain the expected argument's type.

  2) In btf_prepare_func_args(): set the corresponding register type to
PTR_TO_MEM_OR_NULL and its size to the size of the expected type.

Only global functions are supported because allowance of pointers for
static functions might break validation. Consider the following
scenario. A static function has a pointer argument. A caller passes
pointer to its stack memory. Because the callee can change referenced
memory verifier cannot longer assume any particular slot type of the
caller's stack memory hence the slot type is changed to SLOT_MISC.  If
there is an operation that relies on slot type other than SLOT_MISC then
verifier won't be able to infer safety of the operation.

When verifier sees a static function that has a pointer argument
different from PTR_TO_CTX then it skips arguments check and continues
with "inline" validation with more information available. The operation
that relies on the particular slot type now succeeds.

Because global functions were not allowed to have pointer arguments
different from PTR_TO_CTX it's not possible to break existing and valid
code.

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210212205642.620788-4-me@ubique.spb.ru
2021-02-12 17:37:23 -08:00
Dmitrii Banshchikov
4ddb74165a bpf: Extract nullable reg type conversion into a helper function
Extract conversion from a register's nullable type to a type with a
value. The helper will be used in mark_ptr_not_null_reg().

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210212205642.620788-3-me@ubique.spb.ru
2021-02-12 17:37:23 -08:00
Dmitrii Banshchikov
feb4adfad5 bpf: Rename bpf_reg_state variables
Using "reg" for an array of bpf_reg_state and "reg[i + 1]" for an
individual bpf_reg_state is error-prone and verbose. Use "regs" for the
former and "reg" for the latter as other code nearby does.

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210212205642.620788-2-me@ubique.spb.ru
2021-02-12 17:37:23 -08:00
Daniel Borkmann
9b00f1b788 bpf: Fix truncation handling for mod32 dst reg wrt zero
Recently noticed that when mod32 with a known src reg of 0 is performed,
then the dst register is 32-bit truncated in verifier:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = 0
  1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b7) r1 = -1
  2: R0_w=inv0 R1_w=inv-1 R10=fp0
  2: (b4) w2 = -1
  3: R0_w=inv0 R1_w=inv-1 R2_w=inv4294967295 R10=fp0
  3: (9c) w1 %= w0
  4: R0_w=inv0 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
  4: (b7) r0 = 1
  5: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
  5: (1d) if r1 == r2 goto pc+1
   R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
  6: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
  6: (b7) r0 = 2
  7: R0_w=inv2 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0
  7: (95) exit
  7: R0=inv1 R1=inv(id=0,umin_value=4294967295,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2=inv4294967295 R10=fp0
  7: (95) exit

However, as a runtime result, we get 2 instead of 1, meaning the dst
register does not contain (u32)-1 in this case. The reason is fairly
straight forward given the 0 test leaves the dst register as-is:

  # ./bpftool p d x i 23
   0: (b7) r0 = 0
   1: (b7) r1 = -1
   2: (b4) w2 = -1
   3: (16) if w0 == 0x0 goto pc+1
   4: (9c) w1 %= w0
   5: (b7) r0 = 1
   6: (1d) if r1 == r2 goto pc+1
   7: (b7) r0 = 2
   8: (95) exit

This was originally not an issue given the dst register was marked as
completely unknown (aka 64 bit unknown). However, after 468f6eafa6
("bpf: fix 32-bit ALU op verification") the verifier casts the register
output to 32 bit, and hence it becomes 32 bit unknown. Note that for
the case where the src register is unknown, the dst register is marked
64 bit unknown. After the fix, the register is truncated by the runtime
and the test passes:

  # ./bpftool p d x i 23
   0: (b7) r0 = 0
   1: (b7) r1 = -1
   2: (b4) w2 = -1
   3: (16) if w0 == 0x0 goto pc+2
   4: (9c) w1 %= w0
   5: (05) goto pc+1
   6: (bc) w1 = w1
   7: (b7) r0 = 1
   8: (1d) if r1 == r2 goto pc+1
   9: (b7) r0 = 2
  10: (95) exit

Semantics also match with {R,W}x mod{64,32} 0 -> {R,W}x. Invalid div
has always been {R,W}x div{64,32} 0 -> 0. Rewrites are as follows:

  mod32:                            mod64:

  (16) if w0 == 0x0 goto pc+2       (15) if r0 == 0x0 goto pc+1
  (9c) w1 %= w0                     (9f) r1 %= r0
  (05) goto pc+1
  (bc) w1 = w1

Fixes: 468f6eafa6 ("bpf: fix 32-bit ALU op verification")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-02-13 00:53:12 +01:00
Jun'ichi Nomura
7d4553b69f bpf, devmap: Use GFP_KERNEL for xdp bulk queue allocation
The devmap bulk queue is allocated with GFP_ATOMIC and the allocation
may fail if there is no available space in existing percpu pool.

Since commit 75ccae62cb ("xdp: Move devmap bulk queue into struct net_device")
moved the bulk queue allocation to NETDEV_REGISTER callback, whose context
is allowed to sleep, use GFP_KERNEL instead of GFP_ATOMIC to let percpu
allocator extend the pool when needed and avoid possible failure of netdev
registration.

As the required alignment is natural, we can simply use alloc_percpu().

Fixes: 75ccae62cb ("xdp: Move devmap bulk queue into struct net_device")
Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210209082451.GA44021@jeru.linux.bs1.fc.nec.co.jp
2021-02-13 00:11:26 +01:00
Yonghong Song
17d8beda27 bpf: Fix an unitialized value in bpf_iter
Commit 15d83c4d7c ("bpf: Allow loading of a bpf_iter program")
cached btf_id in struct bpf_iter_target_info so later on
if it can be checked cheaply compared to checking registered names.

syzbot found a bug that uninitialized value may occur to
bpf_iter_target_info->btf_id. This is because we allocated
bpf_iter_target_info structure with kmalloc and never initialized
field btf_id afterwards. This uninitialized btf_id is typically
compared to a u32 bpf program func proto btf_id, and the chance
of being equal is extremely slim.

This patch fixed the issue by using kzalloc which will also
prevent future likely instances due to adding new fields.

Fixes: 15d83c4d7c ("bpf: Allow loading of a bpf_iter program")
Reported-by: syzbot+580f4f2a272e452d55cb@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210212005926.2875002-1-yhs@fb.com
2021-02-12 13:33:50 -08:00
Song Liu
3d06f34aa8 bpf: Allow bpf_d_path in bpf_iter program
task_file and task_vma iter programs have access to file->f_path. Enable
bpf_d_path to print paths of these file.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-3-songliubraving@fb.com
2021-02-12 12:56:53 -08:00
Song Liu
3a7b35b899 bpf: Introduce task_vma bpf_iter
Introduce task_vma bpf_iter to print memory information of a process. It
can be used to print customized information similar to /proc/<pid>/maps.

Current /proc/<pid>/maps and /proc/<pid>/smaps provide information of
vma's of a process. However, these information are not flexible enough to
cover all use cases. For example, if a vma cover mixed 2MB pages and 4kB
pages (x86_64), there is no easy way to tell which address ranges are
backed by 2MB pages. task_vma solves the problem by enabling the user to
generate customize information based on the vma (and vma->vm_mm,
vma->vm_file, etc.).

To access the vma safely in the BPF program, task_vma iterator holds
target mmap_lock while calling the BPF program. If the mmap_lock is
contended, task_vma unlocks mmap_lock between iterations to unblock the
writer(s). This lock contention avoidance mechanism is similar to the one
used in show_smaps_rollup().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com
2021-02-12 12:56:53 -08:00
Linus Torvalds
e77a6817d4 tracing: Fix buffer overflow in trace event filter
It was reported that if an trace event was larger than a page
 and was filtered, that it caused memory corruption. The reason
 is that filtered events first go into a buffer to test the filter
 before being written into the ring buffer. Unfortunately,
 this write did not check the size.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYCaSHBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqOpAQCUSlZdBxLzs87zeHgXbkMudWvCYSbA
 mndzddqtxPXlXwEAsRnO8BERyZnasEdXnJ98JJwQaFFYH0dBCA2pTU2onQc=
 =NokV
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix buffer overflow in trace event filter.

  It was reported that if an trace event was larger than a page and was
  filtered, that it caused memory corruption. The reason is that
  filtered events first go into a buffer to test the filter before being
  written into the ring buffer. Unfortunately, this write did not check
  the size"

* tag 'trace-v5.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Check length before giving out the filter buffer
2021-02-12 11:16:17 -08:00
John Ogness
13791c80b0 printk: avoid prb_first_valid_seq() where possible
If message sizes average larger than expected (more than 32
characters), the data_ring will wrap before the desc_ring. Once the
data_ring wraps, it will start invalidating descriptors. These
invalid descriptors hang around until they are eventually recycled
when the desc_ring wraps. Readers do not care about invalid
descriptors, but they still need to iterate past them. If the
average message size is much larger than 32 characters, then there
will be many invalid descriptors preceding the valid descriptors.

The function prb_first_valid_seq() always begins at the oldest
descriptor and searches for the first valid descriptor. This can
be rather expensive for the above scenario. And, in fact, because
of its heavy usage in /dev/kmsg, there have been reports of long
delays and even RCU stalls.

For code that does not need to search from the oldest record,
replace prb_first_valid_seq() usage with prb_read_valid_*()
functions, which provide a start sequence number to search from.

Fixes: 896fbe20b4 ("printk: use the lockless ringbuffer")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: J. Avila <elavila@google.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210211173152.1629-1-john.ogness@linutronix.de
2021-02-12 17:54:59 +01:00
Steven Rostedt (VMware)
99e22ce73c tracing: Make hash-ptr option default
Since the original behavior of the trace events is to hash the %p pointers,
make that the default, and have developers have to enable the option in
order to have them unhashed.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-12 11:52:59 -05:00
Paolo Bonzini
8c6e67bec3 KVM/arm64 updates for Linux 5.12
- Make the nVHE EL2 object relocatable, resulting in much more
   maintainable code
 - Handle concurrent translation faults hitting the same page
   in a more elegant way
 - Support for the standard TRNG hypervisor call
 - A bunch of small PMU/Debug fixes
 - Allow the disabling of symbol export from assembly code
 - Simplification of the early init hypercall handling
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmAmjqEPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDoUEQAIrJ7YF4v4gz06a0HG9+b6fbmykHyxlG7jfm
 trvctfaiKzOybKoY5odPpNFzhbYOOdXXqYipyTHGwBYtGSy9G/9SjMKSUrfln2Ni
 lr1wBqapr9TE+SVKoR8pWWuZxGGbHVa7brNuMbMsMi1wwAsM2/n70H9PXrdq3QiK
 Ge1DWLso2oEfhtTwqNKa4dwB2MHjBhBFhhq+Nq5pslm6mmxJaYqz7pyBmw/C+2cc
 oU/6kpAa1yPAauptWXtYXJYOMHihxgEa1IdK3Gl0hUyFyu96xVkwH/KFsj+bRs23
 QGGCSdy4313hzaoGaSOTK22R98Aeg0wI9a6tcCBvVVjTAztnlu1FPtUZr8e/F7uc
 +r8xVJUJFiywt3Zktf/D7YDK9LuMMqFnj0BkI4U9nIBY59XZRNhENsBCmjru5lnL
 iXa5cuta03H4emfssIChLpgn0XHFas6t5dFXBPGbXyw0qsQchTw98iQX9LVxefUK
 rOUGPIN4nE9ESRIZe0SPlAVeCtNP8cLH7+0YG9MJ1QeDVYaUsnvy9Ln/ox+514mR
 5y2KJ6y7xnLB136SKCzPDDloYtz7BDiJq6a/RPiXKGheKoxy+N+BSe58yWCqFZYE
 Fx/cGUr7oSg39U7gCboog6BDp5e2CXBfbRllg6P47bZFfdPNwzNEzHvk49VltMxx
 Rl2W05bk
 =6EwV
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 5.12

- Make the nVHE EL2 object relocatable, resulting in much more
  maintainable code
- Handle concurrent translation faults hitting the same page
  in a more elegant way
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Allow the disabling of symbol export from assembly code
- Simplification of the early init hypercall handling
2021-02-12 11:23:44 -05:00
Rikard Falkeborn
1556057413 PM: sleep: Constify static struct attribute_group
The only usage of suspend_attr_group is to put its address in an
array of pointers to const attribute_group structs.

Make it const to allow the compiler to put it into read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-12 16:46:59 +01:00
Lukasz Luba
c4cc3141b6 PM: Kconfig: remove unneeded "default n" options
Remove "default n" options. If the "default" line is removed, it
defaults to 'n'.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-12 16:43:23 +01:00
Lukasz Luba
3af2f0aa2e PM: EM: update Kconfig description and drop "default n" option
Energy Model supports now other devices like GPUs, DSPs, not only CPUs.
Thus, update the description in the config option. Remove also unneeded
"default n". If the "default" line is removed, it defaults to 'n'.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-12 16:43:23 +01:00
Ingo Molnar
a3251c1a36 Merge branch 'x86/paravirt' into x86/entry
Merge in the recent paravirt changes to resolve conflicts caused
by objtool annotations.

Conflicts:
	arch/x86/xen/xen-asm.S

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 13:36:43 +01:00
Ingo Molnar
85e853c5ec Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
  addresses to more easily locate bugs.  This has a couple of RCU-related commits,
  but is mostly MM.  Was pulled in with akpm's agreement.

- Per-callback-batch tracking of numbers of callbacks,
  which enables better debugging information and smarter
  reactions to large numbers of callbacks.

- The first round of changes to allow CPUs to be runtime switched from and to
  callback-offloaded state.

- CONFIG_PREEMPT_RT-related changes.

- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.

- Torture-test and torture-test scripting updates, including a "torture everything"
  script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
  Plus does an allmodconfig build.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 12:56:55 +01:00
Ingo Molnar
c11878fd50 Merge branch 'for-mingo-kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core
Pull KCSAN updates from Paul E. McKenney:

 "Kernel concurrency sanitizer (KCSAN) updates from Marco Elver."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 12:55:33 +01:00
Ingo Molnar
62137364e3 Merge branch 'linus' into locking/core, to pick up upstream fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-02-12 12:54:58 +01:00
Ilya Leoshkevich
b2e37a7114 bpf: Fix subreg optimization for BPF_FETCH
All 32-bit variants of BPF_FETCH (add, and, or, xor, xchg, cmpxchg)
define a 32-bit subreg and thus have zext_dst set. Their encoding,
however, uses dst_reg field as a base register, which causes
opt_subreg_zext_lo32_rnd_hi32() to zero-extend said base register
instead of the one the insn really defines (r0 or src_reg).

Fix by properly choosing a register being defined, similar to how
check_atomic() already does that.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210210204502.83429-1-iii@linux.ibm.com
2021-02-11 22:03:19 -08:00
Alexei Starovoitov
1336c66247 bpf: Clear per_cpu pointers during bpf_prog_realloc
bpf_prog_realloc copies contents of struct bpf_prog.
The pointers have to be cleared before freeing old struct.

Reported-by: Ilya Leoshkevich <iii@linux.ibm.com>
Fixes: 700d4796ef ("bpf: Optimize program stats")
Fixes: ca06f55b90 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-02-11 19:35:00 -08:00
Florent Revest
c5dbb89fc2 bpf: Expose bpf_get_socket_cookie to tracing programs
This needs a new helper that:
- can work in a sleepable context (using sock_gen_cookie)
- takes a struct sock pointer and checks that it's not NULL

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210111406.785541-2-revest@chromium.org
2021-02-11 17:44:41 -08:00
Masami Hiramatsu
a345a6718b tracing: Add ptr-hash option to show the hashed pointer value
Add tracefs/options/hash-ptr option to show hashed pointer
value by %p in event printk format string.

For the security reason, normal printk will show the hashed
pointer value (encrypted by random number) with %p to printk
buffer to hide the real address. But the tracefs/trace always
shows real address for debug. To bridge those outputs, add an
option to switch the output format. Ftrace users can use it
to find the hashed value corresponding to the real address
in trace log.

Link: https://lkml.kernel.org/r/160277372504.29307.14909828808982012211.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-11 16:31:57 -05:00
Masami Hiramatsu
efbbdaa22b tracing: Show real address for trace event arguments
To help debugging kernel, show real address for trace event arguments
in tracefs/trace{,pipe} instead of hashed pointer value.

Since ftrace human-readable format uses vsprintf(), all %p are
translated to hash values instead of pointer address.

However, when debugging the kernel, raw address value gives a
hint when comparing with the memory mapping in the kernel.
(Those are sometimes used with crash log, which is not hashed too)
So converting %p with %px when calling trace_seq_printf().

Moreover, this is not improving the security because the tracefs
can be used only by root user and the raw address values are readable
from tracefs/percpu/cpu*/trace_pipe_raw file.

Link: https://lkml.kernel.org/r/160277370703.29307.5134475491761971203.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-11 16:31:57 -05:00
Steven Rostedt (VMware)
b220c049d5 tracing: Check length before giving out the filter buffer
When filters are used by trace events, a page is allocated on each CPU and
used to copy the trace event fields to this page before writing to the ring
buffer. The reason to use the filter and not write directly into the ring
buffer is because a filter may discard the event and there's more overhead
on discarding from the ring buffer than the extra copy.

The problem here is that there is no check against the size being allocated
when using this page. If an event asks for more than a page size while being
filtered, it will get only a page, leading to the caller writing more that
what was allocated.

Check the length of the request, and if it is more than PAGE_SIZE minus the
header default back to allocating from the ring buffer directly. The ring
buffer may reject the event if its too big anyway, but it wont overflow.

Link: https://lore.kernel.org/ath10k/1612839593-2308-1-git-send-email-wgong@codeaurora.org/

Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Reported-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-11 14:23:37 -05:00
Alexei Starovoitov
638e4b825d bpf: Allows per-cpu maps and map-in-map in sleepable programs
Since sleepable programs are now executing under migrate_disable
the per-cpu maps are safe to use.
The map-in-map were ok to use in sleepable from the time sleepable
progs were introduced.

Note that non-preallocated maps are still not safe, since there is
no rcu_read_lock yet in sleepable programs and dynamically allocated
map elements are relying on rcu protection. The sleepable programs
have rcu_read_lock_trace instead. That limitation will be addresses
in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-9-alexei.starovoitov@gmail.com
2021-02-11 16:19:26 +01:00
Alexei Starovoitov
9ed9e9ba23 bpf: Count the number of times recursion was prevented
Add per-program counter for number of times recursion prevention mechanism
was triggered and expose it via show_fdinfo and bpf_prog_info.
Teach bpftool to print it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-7-alexei.starovoitov@gmail.com
2021-02-11 16:19:20 +01:00
Alexei Starovoitov
ca06f55b90 bpf: Add per-program recursion prevention mechanism
Since both sleepable and non-sleepable programs execute under migrate_disable
add recursion prevention mechanism to both types of programs when they're
executed via bpf trampoline.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-5-alexei.starovoitov@gmail.com
2021-02-11 16:19:13 +01:00
Alexei Starovoitov
f2dd3b3946 bpf: Compute program stats for sleepable programs
Since sleepable programs don't migrate from the cpu the excution stats can be
computed for them as well. Reuse the same infrastructure for both sleepable and
non-sleepable programs.

run_cnt     -> the number of times the program was executed.
run_time_ns -> the program execution time in nanoseconds including the
               off-cpu time when the program was sleeping.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-4-alexei.starovoitov@gmail.com
2021-02-11 16:19:06 +01:00
Alexei Starovoitov
031d6e02dd bpf: Run sleepable programs with migration disabled
In older non-RT kernels migrate_disable() was the same as preempt_disable().
Since commit 74d862b682 ("sched: Make migrate_disable/enable() independent of RT")
migrate_disable() is real and doesn't prevent sleeping.

Running sleepable programs with migration disabled allows to add support for
program stats and per-cpu maps later.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-3-alexei.starovoitov@gmail.com
2021-02-11 16:18:55 +01:00
Alexei Starovoitov
700d4796ef bpf: Optimize program stats
Move bpf_prog_stats from prog->aux into prog to avoid one extra load
in critical path of program execution.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-2-alexei.starovoitov@gmail.com
2021-02-11 16:17:50 +01:00
Waiman Long
d8d0da4eee locking/arch: Move qrwlock.h include after qspinlock.h
include/asm-generic/qrwlock.h was trying to get arch_spin_is_locked via
asm-generic/qspinlock.h.  However, this does not work because architectures
might be using queued rwlocks but not queued spinlocks (csky), or because they
might be defining their own queued_* macros before including asm/qspinlock.h.

To fix this, ensure that asm/spinlock.h always includes qrwlock.h after
defining arch_spin_is_locked (either directly for csky, or via
asm/qspinlock.h for other architectures).  The only inclusion elsewhere
is in kernel/locking/qrwlock.c.  That one is really unnecessary because
the file is only compiled in SMP configurations (config QUEUED_RWLOCKS
depends on SMP) and in that case linux/spinlock.h already includes
asm/qrwlock.h if needed, via asm/spinlock.h.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Waiman Long <longman@redhat.com>
Fixes: 26128cb6c7 ("locking/rwlocks: Add contention detection for rwlocks")
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Ben Gardon <bgardon@google.com>
[Add arch/sparc and kernel/locking parts per discussion with Waiman. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-11 07:59:54 -05:00
Daniel Thompson
f11e2bc682 kgdb: Remove kgdb_schedule_breakpoint()
To the very best of my knowledge there has never been any in-tree
code that calls this function. It exists largely to support an
out-of-tree driver that provides kgdb-over-ethernet using the
netpoll API.

kgdboe has been out-of-tree for more than 10 years and I don't
recall any serious attempt to upstream it at any point in the last
five. At this stage it looks better to stop carrying this code in
the kernel and integrate the code into the out-of-tree driver
instead.

The long term trajectory for the kernel looks likely to include
effort to remove or reduce the use of tasklets (something that has
also been true for the last 10 years). Thus the main real reason
for this patch is to make explicit that the in-tree kgdb features
do not require tasklets.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210210142525.2876648-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
2021-02-11 10:51:56 +00:00
Marco Elver
6df8fb8330 bpf_lru_list: Read double-checked variable once without lock
For double-checked locking in bpf_common_lru_push_free(), node->type is
read outside the critical section and then re-checked under the lock.
However, concurrent writes to node->type result in data races.

For example, the following concurrent access was observed by KCSAN:

  write to 0xffff88801521bc22 of 1 bytes by task 10038 on cpu 1:
   __bpf_lru_node_move_in        kernel/bpf/bpf_lru_list.c:91
   __local_list_flush            kernel/bpf/bpf_lru_list.c:298
   ...
  read to 0xffff88801521bc22 of 1 bytes by task 10043 on cpu 0:
   bpf_common_lru_push_free      kernel/bpf/bpf_lru_list.c:507
   bpf_lru_push_free             kernel/bpf/bpf_lru_list.c:555
   ...

Fix the data races where node->type is read outside the critical section
(for double-checked locking) by marking the access with READ_ONCE() as
well as ensuring the variable is only accessed once.

Fixes: 3a08c2fd76 ("bpf: LRU List")
Reported-by: syzbot+3536db46dfa58c573458@syzkaller.appspotmail.com
Reported-by: syzbot+516acdb03d3e27d91bcd@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210209112701.3341724-1-elver@google.com
2021-02-10 15:54:26 -08:00
Thomas Gleixner
db1cc7aede softirq: Move do_softirq_own_stack() to generic asm header
To avoid include recursion hell move the do_softirq_own_stack() related
content into a generic asm header and include it from all places in arch/
which need the prototype.

This allows architectures to provide an inline implementation of
do_softirq_own_stack() without introducing a lot of #ifdeffery all over the
place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210210002513.289960691@linutronix.de
2021-02-10 23:34:16 +01:00
David S. Miller
dc9d87581d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-10 13:30:12 -08:00
Lakshmi Ramasubramanian
f31e3386a4 ima: Free IMA measurement buffer after kexec syscall
IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function.  This buffer is not freed before
completing the kexec system call resulting in memory leak.

Add ima_buffer field in "struct kimage" to store the virtual address
of the buffer allocated for the IMA measurement list.
Free the memory allocated for the IMA measurement list in
kimage_file_post_load_cleanup() function.

Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Fixes: 7b8589cc29 ("ima: on soft reboot, save the measurement list")
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2021-02-10 15:49:38 -05:00
wanghongzhe
a381b70a1c seccomp: Improve performace by optimizing rmb()
According to Kees's suggest, we started with the patch that just replaces
rmb() with smp_rmb() and did a performance test with UnixBench. The
results showed the overhead about 2.53% in rmb() test compared to the
smp_rmb() one, in a x86-64 kernel with CONFIG_SMP enabled running inside a
qemu-kvm vm. The test is a "syscall" testcase in UnixBench, which executes
5 syscalls in a loop during a certain timeout (100 second in our test) and
counts the total number of executions of this 5-syscall sequence. We set
a seccomp filter with all allow rule for all used syscalls in this test
(which will go bitmap path) to make sure the rmb() will be executed. The
details for the test:

with rmb():
/txm # ./syscall_allow_min 100
COUNT|35861159|1|lps
/txm # ./syscall_allow_min 100
COUNT|35545501|1|lps
/txm # ./syscall_allow_min 100
COUNT|35664495|1|lps

with smp_rmb():
/txm # ./syscall_allow_min 100
COUNT|36552771|1|lps
/txm # ./syscall_allow_min 100
COUNT|36491247|1|lps
/txm # ./syscall_allow_min 100
COUNT|36504746|1|lps

For a x86-64 kernel with CONFIG_SMP enabled, the smp_rmb() is just a
compiler barrier() which have no impact in runtime, while rmb() is a
lfence which will prevent all memory access operations (not just load
according the recently claim by Intel) behind itself. We can also figure
it out in disassembly:

with rmb():
0000000000001430 <__seccomp_filter>:
    1430:   41 57                   push   %r15
    1432:   41 56                   push   %r14
    1434:   41 55                   push   %r13
    1436:   41 54                   push   %r12
    1438:   55                      push   %rbp
    1439:   53                      push   %rbx
    143a:   48 81 ec 90 00 00 00    sub    $0x90,%rsp
    1441:   89 7c 24 10             mov    %edi,0x10(%rsp)
    1445:   89 54 24 14             mov    %edx,0x14(%rsp)
    1449:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
    1450:   00 00
    1452:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
    1459:   00
    145a:   31 c0                   xor    %eax,%eax
*   145c:   0f ae e8                lfence
    145f:   48 85 f6                test   %rsi,%rsi
    1462:   49 89 f4                mov    %rsi,%r12
    1465:   0f 84 42 03 00 00       je     17ad <__seccomp_filter+0x37d>
    146b:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
    1472:   00 00
    1474:   48 8b 98 80 07 00 00    mov    0x780(%rax),%rbx
    147b:   48 85 db                test   %rbx,%rbx

with smp_rmb();
0000000000001430 <__seccomp_filter>:
    1430:   41 57                   push   %r15
    1432:   41 56                   push   %r14
    1434:   41 55                   push   %r13
    1436:   41 54                   push   %r12
    1438:   55                      push   %rbp
    1439:   53                      push   %rbx
    143a:   48 81 ec 90 00 00 00    sub    $0x90,%rsp
    1441:   89 7c 24 10             mov    %edi,0x10(%rsp)
    1445:   89 54 24 14             mov    %edx,0x14(%rsp)
    1449:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
    1450:   00 00
    1452:   48 89 84 24 88 00 00    mov    %rax,0x88(%rsp)
    1459:   00
    145a:   31 c0                   xor    %eax,%eax
    145c:   48 85 f6                test   %rsi,%rsi
    145f:   49 89 f4                mov    %rsi,%r12
    1462:   0f 84 42 03 00 00       je     17aa <__seccomp_filter+0x37a>
    1468:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
    146f:   00 00
    1471:   48 8b 98 80 07 00 00    mov    0x780(%rax),%rbx
    1478:   48 85 db                test   %rbx,%rbx

Signed-off-by: wanghongzhe <wanghongzhe@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1612496049-32507-1-git-send-email-wanghongzhe@huawei.com
2021-02-10 12:40:11 -08:00
Linus Torvalds
6016bf19b3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Another pile of networing fixes:

   1) ath9k build error fix from Arnd Bergmann

   2) dma memory leak fix in mediatec driver from Lorenzo Bianconi.

   3) bpf int3 kprobe fix from Alexei Starovoitov.

   4) bpf stackmap integer overflow fix from Bui Quang Minh.

   5) Add usb device ids for Cinterion MV31 to qmi_qwwan driver, from
      Christoph Schemmel.

   6) Don't update deleted entry in xt_recent netfilter module, from
      Jazsef Kadlecsik.

   7) Use after free in nftables, fix from Pablo Neira Ayuso.

   8) Header checksum fix in flowtable from Sven Auhagen.

   9) Validate user controlled length in qrtr code, from Sabyrzhan
      Tasbolatov.

  10) Fix race in xen/netback, from Juergen Gross,

  11) New device ID in cxgb4, from Raju Rangoju.

  12) Fix ring locking in rxrpc release call, from David Howells.

  13) Don't return LAPB error codes from x25_open(), from Xie He.

  14) Missing error returns in gsi_channel_setup() from Alex Elder.

  15) Get skb_copy_and_csum_datagram working properly with odd segment
      sizes, from Willem de Bruijn.

  16) Missing RFS/RSS table init in enetc driver, from Vladimir Oltean.

  17) Do teardown on probe failure in DSA, from Vladimir Oltean.

  18) Fix compilation failures of txtimestamp selftest, from Vadim
      Fedorenko.

  19) Limit rx per-napi gro queue size to fix latency regression, from
      Eric Dumazet.

  20) dpaa_eth xdp fixes from Camelia Groza.

  21) Missing txq mode update when switching CBS off, in stmmac driver,
      from Mohammad Athari Bin Ismail.

  22) Failover pending logic fix in ibmvnic driver, from Sukadev
      Bhattiprolu.

  23) Null deref fix in vmw_vsock, from Norbert Slusarek.

  24) Missing verdict update in xdp paths of ena driver, from Shay
      Agroskin.

  25) seq_file iteration fix in sctp from Neil Brown.

  26) bpf 32-bit src register truncation fix on div/mod, from Daniel
      Borkmann.

  27) Fix jmp32 pruning in bpf verifier, from Daniel Borkmann.

  28) Fix locking in vsock_shutdown(), from Stefano Garzarella.

  29) Various missing index bound checks in hns3 driver, from Yufeng Mo.

  30) Flush ports on .phylink_mac_link_down() in dsa felix driver, from
      Vladimir Oltean.

  31) Don't mix up stp and mrp port states in bridge layer, from Horatiu
      Vultur.

  32) Fix locking during netif_tx_disable(), from Edwin Peer"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
  bpf: Fix 32 bit src register truncation on div/mod
  bpf: Fix verifier jmp32 pruning decision logic
  bpf: Fix verifier jsgt branch analysis on max bound
  vsock: fix locking in vsock_shutdown()
  net: hns3: add a check for index in hclge_get_rss_key()
  net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
  net: hns3: add a check for queue_id in hclge_reset_vf_queue()
  net: dsa: felix: implement port flushing on .phylink_mac_link_down
  switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
  bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
  net: watchdog: hold device global xmit lock during tx disable
  netfilter: nftables: relax check for stateful expressions in set definition
  netfilter: conntrack: skip identical origin tuple in same zone only
  vsock/virtio: update credit only if socket is not closed
  net: fix iteration for sctp transport seq_files
  net: ena: Update XDP verdict upon failure
  net/vmw_vsock: improve locking in vsock_connect_timeout()
  net/vmw_vsock: fix NULL pointer dereference
  ibmvnic: Clear failover_pending if unable to schedule
  net: stmmac: set TxQ mode back to DCB after disabling CBS
  ...
2021-02-10 11:33:39 -08:00
Andrei Matei
01f810ace9 bpf: Allow variable-offset stack access
Before this patch, variable offset access to the stack was dissalowed
for regular instructions, but was allowed for "indirect" accesses (i.e.
helpers). This patch removes the restriction, allowing reading and
writing to the stack through stack pointers with variable offsets. This
makes stack-allocated buffers more usable in programs, and brings stack
pointers closer to other types of pointers.

The motivation is being able to use stack-allocated buffers for data
manipulation. When the stack size limit is sufficient, allocating
buffers on the stack is simpler than per-cpu arrays, or other
alternatives.

In unpriviledged programs, variable-offset reads and writes are
disallowed (they were already disallowed for the indirect access case)
because the speculative execution checking code doesn't support them.
Additionally, when writing through a variable-offset stack pointer, if
any pointers are in the accessible range, there's possilibities of later
leaking pointers because the write cannot be tracked precisely.

Writes with variable offset mark the whole range as initialized, even
though we don't know which stack slots are actually written. This is in
order to not reject future reads to these slots. Note that this doesn't
affect writes done through helpers; like before, helpers need the whole
stack range to be initialized to begin with.
All the stack slots are in range are considered scalars after the write;
variable-offset register spills are not tracked.

For reads, all the stack slots in the variable range needs to be
initialized (but see above about what writes do), otherwise the read is
rejected. All register spilled in stack slots that might be read are
marked as having been read, however reads through such pointers don't do
register filling; the target register will always be either a scalar or
a constant zero.

Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com
2021-02-10 10:44:19 -08:00
Dan Carpenter
1e80d9cb57 module: potential uninitialized return in module_kallsyms_on_each_symbol()
Smatch complains that:

	kernel/module.c:4472 module_kallsyms_on_each_symbol()
        error: uninitialized symbol 'ret'.

This warning looks like it could be correct if the &modules list is
empty.

Fixes: 013c1667cf ("kallsyms: refactor {,module_}kallsyms_on_each_symbol")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2021-02-10 16:57:04 +01:00
Sebastian Andrzej Siewior
0f319d49a4 locking/mutex: Kill mutex_trylock_recursive()
There are not users of mutex_trylock_recursive() in tree as of
v5.11-rc7.

Remove it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210210085248.219210-2-bigeasy@linutronix.de
2021-02-10 14:44:40 +01:00
Peter Zijlstra
c8cc7e8531 lockdep: Noinstr annotate warn_bogus_irq_restore()
vmlinux.o: warning: objtool: lock_is_held_type()+0x107: call to warn_bogus_irq_restore() leaves .noinstr.text section

As per the general rule that WARNs are allowed to violate noinstr to
get out, annotate it away.

Fixes: 997acaf6b4 ("lockdep: report broken irq restoration")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lkml.kernel.org/r/YCKyYg53mMp4E7YI@hirez.programming.kicks-ass.net
2021-02-10 14:44:39 +01:00
Muchun Song
8a8109f303 printk: fix deadlock when kernel panic
printk_safe_flush_on_panic() caused the following deadlock on our
server:

CPU0:                                         CPU1:
panic                                         rcu_dump_cpu_stacks
  kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
    register_nmi_handler(crash_nmi_callback)      printk_safe_flush
                                                    __printk_safe_flush
                                                      raw_spin_lock_irqsave(&read_lock)
    // send NMI to other processors
    apic_send_IPI_allbutself(NMI_VECTOR)
                                                        // NMI interrupt, dead loop
                                                        crash_nmi_callback
  printk_safe_flush_on_panic
    printk_safe_flush
      __printk_safe_flush
        // deadlock
        raw_spin_lock_irqsave(&read_lock)

DEADLOCK: read_lock is taken on CPU1 and will never get released.

It happens when panic() stops a CPU by NMI while it has been in
the middle of printk_safe_flush().

Handle the lock the same way as logbuf_lock. The printk_safe buffers
are flushed only when both locks can be safely taken. It can avoid
the deadlock _in this particular case_ at expense of losing contents
of printk_safe buffers.

Note: It would actually be safe to re-init the locks when all CPUs were
      stopped by NMI. But it would require passing this information
      from arch-specific code. It is not worth the complexity.
      Especially because logbuf_lock and printk_safe buffers have been
      obsoleted by the lockless ring buffer.

Fixes: cf9b1106c8 ("printk/nmi: flush NMI messages on the system panic")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: <stable@vger.kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210210034823.64867-1-songmuchun@bytedance.com
2021-02-10 13:57:06 +01:00
Daniel Borkmann
e88b2c6e5a bpf: Fix 32 bit src register truncation on div/mod
While reviewing a different fix, John and I noticed an oddity in one of the
BPF program dumps that stood out, for example:

  # bpftool p d x i 13
   0: (b7) r0 = 808464450
   1: (b4) w4 = 808464432
   2: (bc) w0 = w0
   3: (15) if r0 == 0x0 goto pc+1
   4: (9c) w4 %= w0
  [...]

In line 2 we noticed that the mov32 would 32 bit truncate the original src
register for the div/mod operation. While for the two operations the dst
register is typically marked unknown e.g. from adjust_scalar_min_max_vals()
the src register is not, and thus verifier keeps tracking original bounds,
simplified:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = -1
  1: R0_w=invP-1 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b7) r1 = -1
  2: R0_w=invP-1 R1_w=invP-1 R10=fp0
  2: (3c) w0 /= w1
  3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP-1 R10=fp0
  3: (77) r1 >>= 32
  4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP4294967295 R10=fp0
  4: (bf) r0 = r1
  5: R0_w=invP4294967295 R1_w=invP4294967295 R10=fp0
  5: (95) exit
  processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

Runtime result of r0 at exit is 0 instead of expected -1. Remove the
verifier mov32 src rewrite in div/mod and replace it with a jmp32 test
instead. After the fix, we result in the following code generation when
having dividend r1 and divisor r6:

  div, 64 bit:                             div, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (55) if r6 != 0x0 goto pc+2           2: (56) if w6 != 0x0 goto pc+2
   3: (ac) w1 ^= w1                         3: (ac) w1 ^= w1
   4: (05) goto pc+1                        4: (05) goto pc+1
   5: (3f) r1 /= r6                         5: (3c) w1 /= w6
   6: (b7) r0 = 0                           6: (b7) r0 = 0
   7: (95) exit                             7: (95) exit

  mod, 64 bit:                             mod, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (15) if r6 == 0x0 goto pc+1           2: (16) if w6 == 0x0 goto pc+1
   3: (9f) r1 %= r6                         3: (9c) w1 %= w6
   4: (b7) r0 = 0                           4: (b7) r0 = 0
   5: (95) exit                             5: (95) exit

x86 in particular can throw a 'divide error' exception for div
instruction not only for divisor being zero, but also for the case
when the quotient is too large for the designated register. For the
edx:eax and rdx:rax dividend pair it is not an issue in x86 BPF JIT
since we always zero edx (rdx). Hence really the only protection
needed is against divisor being zero.

Fixes: 68fda450a7 ("bpf: fix 32-bit divide by zero")
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-02-10 01:32:40 +01:00
Daniel Borkmann
fd675184fc bpf: Fix verifier jmp32 pruning decision logic
Anatoly has been fuzzing with kBdysch harness and reported a hang in
one of the outcomes:

  func#0 @0
  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = 808464450
  1: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b4) w4 = 808464432
  2: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP808464432 R10=fp0
  2: (9c) w4 %= w0
  3: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
  3: (66) if w4 s> 0x30303030 goto pc+0
   R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  4: (7f) r0 >>= r0
  5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
  5: (9c) w4 %= w0
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
  9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  9: (95) exit
  propagating r0

  from 6 to 7: safe
  4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
  4: (7f) r0 >>= r0
  5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
  5: (9c) w4 %= w0
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  propagating r0
  7: safe
  propagating r0

  from 6 to 7: safe
  processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1

The underlying program was xlated as follows:

  # bpftool p d x i 10
   0: (b7) r0 = 808464450
   1: (b4) w4 = 808464432
   2: (bc) w0 = w0
   3: (15) if r0 == 0x0 goto pc+1
   4: (9c) w4 %= w0
   5: (66) if w4 s> 0x30303030 goto pc+0
   6: (7f) r0 >>= r0
   7: (bc) w0 = w0
   8: (15) if r0 == 0x0 goto pc+1
   9: (9c) w4 %= w0
  10: (66) if w0 s> 0x3030 goto pc+0
  11: (d6) if w0 s<= 0x303030 goto pc+1
  12: (05) goto pc-1
  13: (95) exit

The verifier rewrote original instructions it recognized as dead code with
'goto pc-1', but reality differs from verifier simulation in that we are
actually able to trigger a hang due to hitting the 'goto pc-1' instructions.

Taking a closer look at the verifier analysis, the reason is that it misjudges
its pruning decision at the first 'from 6 to 7: safe' occasion. What happens
is that while both old/cur registers are marked as precise, they get misjudged
for the jmp32 case as range_within() yields true, meaning that the prior
verification path with a wider register bound could be verified successfully
and therefore the current path with a narrower register bound is deemed safe
as well whereas in reality it's not. R0 old/cur path's bounds compare as
follows:

  old: smin_value=0x8000000000000000,smax_value=0x7fffffffffffffff,umin_value=0x0,umax_value=0xffffffffffffffff,var_off=(0x0; 0xffffffffffffffff)
  cur: smin_value=0x8000000000000000,smax_value=0x7fffffff7fffffff,umin_value=0x0,umax_value=0xffffffff7fffffff,var_off=(0x0; 0xffffffff7fffffff)

  old: s32_min_value=0x80000000,s32_max_value=0x00003030,u32_min_value=0x00000000,u32_max_value=0xffffffff
  cur: s32_min_value=0x00003031,s32_max_value=0x7fffffff,u32_min_value=0x00003031,u32_max_value=0x7fffffff

The 64 bit bounds generally look okay and while the information that got
propagated from 32 to 64 bit looks correct as well, it's not precise enough
for judging a conditional jmp32. Given the latter only operates on subregisters
we also need to take these into account as well for a range_within() probe
in order to be able to prune paths. Extending the range_within() constraint
to both bounds will be able to tell us that the old signed 32 bit bounds are
not wider than the cur signed 32 bit bounds.

With the fix in place, the program will now verify the 'goto' branch case as
it should have been:

  [...]
  6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  6: (66) if w0 s> 0x3030 goto pc+0
   R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
  9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
  9: (95) exit

  7: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=12337,u32_min_value=12337,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  7: (d6) if w0 s<= 0x303030 goto pc+1
   R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  8: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
  8: (30) r0 = *(u8 *)skb[808464432]
  BPF_LD_[ABS|IND] uses reserved fields
  processed 11 insns (limit 1000000) max_states_per_insn 1 total_states 1 peak_states 1 mark_read 1

The bug is quite subtle in the sense that when verifier would determine that
a given branch is dead code, it would (here: wrongly) remove these instructions
from the program and hard-wire the taken branch for privileged programs instead
of the 'goto pc-1' rewrites which will cause hard to debug problems.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-02-10 01:31:46 +01:00