Commit Graph

778 Commits

Author SHA1 Message Date
Linus Torvalds bc6cb4d5bc Locking changes for v6.5:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
 
   The cmpxchg128() family of functions is basically & functionally
   the same as cmpxchg_double(), but with a saner interface: instead
   of a 6-parameter horror that forced u128 - u64/u64-halves layout
   details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128 types.
 
 - Restructure the generated atomic headers, and add
   kerneldoc comments for all of the generic atomic{,64,_long}_t
   operations. Generated definitions are much cleaner now,
   and come with documentation.
 
 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering
   when taking multiple locks of the same type. This gets rid of
   one use of lockdep_set_novalidate_class() in the bcache code.
 
 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
   variable shadowing generating garbage code on Clang on certain
   ARM builds.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
 zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
 QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
 8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
 sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
 wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
 YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
 r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
 giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
 7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
 o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
 x9Nt+Tp0Ze4=
 =DsYj
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()

   The cmpxchg128() family of functions is basically & functionally the
   same as cmpxchg_double(), but with a saner interface.

   Instead of a 6-parameter horror that forced u128 - u64/u64-halves
   layout details on the interface and exposed users to complexity,
   fragility & bugs, use a natural 3-parameter interface with u128
   types.

 - Restructure the generated atomic headers, and add kerneldoc comments
   for all of the generic atomic{,64,_long}_t operations.

   The generated definitions are much cleaner now, and come with
   documentation.

 - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
   taking multiple locks of the same type.

   This gets rid of one use of lockdep_set_novalidate_class() in the
   bcache code.

 - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
   shadowing generating garbage code on Clang on certain ARM builds.

* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
  percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
  locking/atomic: treewide: delete arch_atomic_*() kerneldoc
  locking/atomic: docs: Add atomic operations to the driver basic API documentation
  locking/atomic: scripts: generate kerneldoc comments
  docs: scripts: kernel-doc: accept bitwise negation like ~@var
  locking/atomic: scripts: simplify raw_atomic*() definitions
  locking/atomic: scripts: simplify raw_atomic_long*() definitions
  locking/atomic: scripts: split pfx/name/sfx/order
  locking/atomic: scripts: restructure fallback ifdeffery
  locking/atomic: scripts: build raw_atomic_long*() directly
  locking/atomic: treewide: use raw_atomic*_<op>()
  locking/atomic: scripts: add trivial raw_atomic*_<op>()
  locking/atomic: scripts: factor out order template generation
  locking/atomic: scripts: remove leftover "${mult}"
  locking/atomic: scripts: remove bogus order parameter
  locking/atomic: xtensa: add preprocessor symbols
  locking/atomic: x86: add preprocessor symbols
  locking/atomic: sparc: add preprocessor symbols
  locking/atomic: sh: add preprocessor symbols
  ...
2023-06-27 14:14:30 -07:00
Linus Torvalds 4baa098a14 - Remove the local symbols prefix of the get/put_user() exception
handling symbols so that tools do not get confused by the presence of
   code belonging to the wrong symbol/not belonging to any symbol
 
 - Improve csum_partial()'s performance
 
 - Some improvements to the kcpuid tool
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSawNoACgkQEsHwGGHe
 VUppXw//YezVoWUUUeTedZl8nRbotwXUlATjsIGcRGe2rZQ/7Ud/NUagWiLmKcpy
 fAEt+Rd0MbukCNPmTjcw04NN9djs2avVXJS3CCsGNDv/Q6AsBpMcOD4dESxbWIgh
 NpkNvO3bKRKxtaoJukmxiiIBlMFzXXtKg/fgzB8FeYZDhGMfS7wBlcDeJIdmWWxO
 T5hykFoc/47e8SPG+K/VLT8hoQCg4KPpi3aSN6n+eq8nnlosABr95JKvgqeq1mXf
 UPdITYzKHDiny0ZqL2nqsx1MGh24CLc3QCxi5qMDE27NVFokRdfyCiK3DVZvgrNo
 IA5BsiKJ0Ddeo2F1Weu+rBI7Hhf+OBZlw7WmWpqQ3rEbeEJ4L1iWeeHwrBNzyuZq
 ftb7OScukusaGAMamhhnErR2GwdP3SBDnnUtsue3qqPK1acYPdFfCJCqXvYsCczQ
 Pn6eKE2Vlp/3febce7QtZtcz7qlv60UZvj3OpYbECIKcD1/8BWEidquSgPASxs9e
 WH+MvDlV/tgwzLVAG0Zp5x7DE/VzDPIKtMMRzgx1clSSPyRwzW0jhp+C4/xPsDCT
 2lLHZu/ay7O2A1kiH6m0/ULAm/gUzRNsKCNRlP/HVVXl7+U6lZeZR3D14QOPl8n8
 F1W/seOCLxnxx8dVF/hHmirDQuwSjF9vRewmWvvOUgzmYBid8j0=
 =6U2J
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:

 - Remove the local symbols prefix of the get/put_user() exception
   handling symbols so that tools do not get confused by the presence of
   code belonging to the wrong symbol/not belonging to any symbol

 - Improve csum_partial()'s performance

 - Some improvements to the kcpuid tool

* tag 'x86_misc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/lib: Make get/put_user() exception handling a visible symbol
  x86/csum: Fix clang -Wuninitialized in csum_partial()
  x86/csum: Improve performance of `csum_partial`
  tools/x86/kcpuid: Add .gitignore
  tools/x86/kcpuid: Dump the correct CPUID function in error
2023-06-27 12:25:42 -07:00
Linus Torvalds 19300488c9 - Address -Wmissing-prototype warnings
- Remove repeated 'the' in comments
  - Remove unused current_untag_mask()
  - Document urgent tip branch timing
  - Clean up MSR kernel-doc notation
  - Clean up paravirt_ops doc
  - Update Srivatsa S. Bhat's maintained areas
  - Remove unused extern declaration acpi_copy_wakeup_routine()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ6xIACgkQaDWVMHDJ
 krB9aQ/+NjB4CiWLbrnOYj9QYG6p1GE7lfu2dzIDdmcNuiai8htopXys54Igy3Rq
 BbIoW4E0SGK5E2OD7nLe4fBA/LpsYZTwDhGUu3SiovxLOoC5qkF0Q+6aVypPJE5o
 q7kn0Eo9IDL1dO0EbJptFDJRjk3K5caEoyXJRelarjIfPRbDEhUFaybVRykMZN9I
 4AOxrlb9WFggT4gUE4+N0kWyEqdgI9/aguavmasaG4lBHZ5JAHNQPNIa8bkVSAPL
 wULAzsrGp96V3tVxdjDCzD9aumk4xlJq7gk+v7mfx013dg7Cjs074Xoi2Y+TmaC7
 fdIZiGPJIkNToW+nENVO7BYtACSQhXeVTGxLQO/HNTDc//ZWiIUoJT2U4qu/6e6F
 aAIGoLwv68H4BghS2qx6Gz+BTIfl35mcPUb75MQhu+D84QZoZWrdamCYhsvHeZzc
 uC3nojrb6PBOth9nJsRae+j1zpRe/DT2LvHSWPJgK6EygOAi05ZfYUll/6sb0vze
 IXkUrVV1BvDDVpY9/HnE8RpDCDolP0/ezK9zsw48arZtkc+Qmw2WlD/2D98E+pSb
 MJPelbVmpzWTaoR4jDzXJCXkWe7CQJ5uPQj5azAE9l7YvnxgCQP5xnm5sLU9eyLu
 RsOwRzss0+3z44x5rJi9nSxQJ0LHfTAzW8/ZmNSZGHzi0ClszK0=
 =N82i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Dave Hansen:
 "As usual, these are all over the map. The biggest cluster is work from
  Arnd to eliminate -Wmissing-prototype warnings:

   - Address -Wmissing-prototype warnings

   - Remove repeated 'the' in comments

   - Remove unused current_untag_mask()

   - Document urgent tip branch timing

   - Clean up MSR kernel-doc notation

   - Clean up paravirt_ops doc

   - Update Srivatsa S. Bhat's maintained areas

   - Remove unused extern declaration acpi_copy_wakeup_routine()"

* tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
  Documentation: virt: Clean up paravirt_ops doc
  x86/mm: Remove unused current_untag_mask()
  x86/mm: Remove repeated word in comments
  x86/lib/msr: Clean up kernel-doc notation
  x86/platform: Avoid missing-prototype warnings for OLPC
  x86/mm: Add early_memremap_pgprot_adjust() prototype
  x86/usercopy: Include arch_wb_cache_pmem() declaration
  x86/vdso: Include vdso/processor.h
  x86/mce: Add copy_mc_fragile_handle_tail() prototype
  x86/fbdev: Include asm/fb.h as needed
  x86/hibernate: Declare global functions in suspend.h
  x86/entry: Add do_SYSENTER_32() prototype
  x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
  x86/mm: Include asm/numa.h for set_highmem_pages_init()
  x86: Avoid missing-prototype warnings for doublefault code
  x86/fpu: Include asm/fpu/regset.h
  x86: Add dummy prototype for mk_early_pgtbl_32()
  x86/pci: Mark local functions as 'static'
  x86/ftrace: Move prepare_ftrace_return prototype to header
  ...
2023-06-26 16:43:54 -07:00
Linus Torvalds 941d77c773 - Compute the purposeful misalignment of zen_untrain_ret automatically
and assert __x86_return_thunk's alignment so that future changes to
   the symbol macros do not accidentally break them.
 
 - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
   pointless
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ1wgACgkQEsHwGGHe
 VUrXlRAAhIonFM1suIHo6w085jY5YA1XnsziJr/bT3e16FdHrF1i3RBEX4ml0m3O
 ADwa9dMsC9UJIa+/TKRNFfQvfRcLE/rsUKlS1Rluf/IRIxuSt/Oa4bFQHGXFRwnV
 eSlnWTNiaWrRs/vJEYAnMOe98oRyElHWa9kZ7K5FC+Ksfn/WO1U1RQ2NWg2A2wkN
 8MHJiS41w2piOrLU/nfUoI7+esHgHNlib222LoptDGHuaY8V2kBugFooxAEnTwS3
 PCzWUqCTgahs393vbx6JimoIqgJDa7bVdUMB0kOUHxtpbBiNdYYVy6e7UKnV1yjB
 qP3v9jQW4+xIyRmlFiErJXEZx7DjAIP5nulGRrUMzRfWEGF8mdRZ+ugGqFMHCeC8
 vXI+Ixp2vvsfhG3N/algsJUdkjlpt3hBpElRZCfR08M253KAbAmUNMOr4sx4RPi5
 ymC+pLIHd1K0G9jiZaFnOMaY71gAzWizwxwjFKLQMo44q+lpNJvsVO00cr+9RBYj
 LQL2APkONVEzHPMYR/LrXCslYaW//DrfLdRQjNbzUTonxFadkTO2Eu8J90B/5SFZ
 CqC1NYKMQPVFeg4XuGWCgZEH+jokCGhl8vvmXClAMcOEOZt0/s4H89EKFkmziyon
 L1ZrA/U72gWV8EwD7GLtuFJmnV4Ayl/hlek2j0qNKaj6UUgTFg8=
 =LcUq
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Compute the purposeful misalignment of zen_untrain_ret automatically
   and assert __x86_return_thunk's alignment so that future changes to
   the symbol macros do not accidentally break them.

 - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
   pointless

* tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/retbleed: Add __x86_return_thunk alignment checks
  x86/cpu: Remove X86_FEATURE_NAMES
  x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
2023-06-26 15:42:34 -07:00
Linus Torvalds 8c69e7afe9 - Up until now the Fast Short Rep Mov optimizations implied the presence
of the ERMS CPUID flag. AMD decoupled them with a BIOS setting so decouple
   that dependency in the kernel code too
 
 - Teach the alternatives machinery to handle relocations
 
 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in
 
 - Other fixes, cleanups and optimizations to the patching code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZi2AACgkQEsHwGGHe
 VUqhGw/9EC/m5HTFBlCy9PS5Qy6pPLzmHR5Tuy4meqlnB1gN+5wzfxdYEwHm46hH
 SR6WqR12yVaCMIzh66y8nTJyMbIykaBbfFJb3WesdDrBIYUZ9f+7O+Xd0JS6Jykd
 2HBHOyaVS1/W75+y6w9JhTExBH5xieCpJVIYyAvifbn/pB8XmuTTwJ1Z3EJ8DzkK
 AN16i46bUiKNBdTYZUMhtKL4vHVfqLYMskgWe6IG7DmRLOwikR0uRVhuVqP/bmUj
 U128cUacGJT2AYbZarTAKmOa42nDj3TpJqRp1qit3y6Cun4vxKH+1A91UPd7IHTa
 M5H1bNSgfXMm8rU+JgfvXKqrCTckGn2OqlCkJfPV3RBeP9IcQBBF0vE3dnM/X2We
 dwbXeDfJvc+1s4/M41MOhyahTUbW+4iRK5UCZEt1mprTbtzHTlN7RROo7QLpFsWx
 T0Jqvsd1raAutPTgTjU7ToQwDpSQNnn4Y/KoEdpvOCXR8wU7Wo5/+Qa4tEkIY3W6
 mUFpJcgFC9QEKLuaNAofPIhMuZ/vzRVtpK7wbLn4KR5JZA8AxznenMFVg8YPWRFI
 4oga0kMFJ7t6z/CXHtrxFaLQ9e7WAUSRU6gPiz8As1F/K9N0JWMUfjuTJcgjUsF8
 bwdCNinwG8y3rrPUCrqbO5N766ZkLYd6NksKlmIyUvtCcS0ksbg=
 =mH38
 -----END PGP SIGNATURE-----

Merge tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 instruction alternatives updates from Borislav Petkov:

 - Up until now the Fast Short Rep Mov optimizations implied the
   presence of the ERMS CPUID flag. AMD decoupled them with a BIOS
   setting so decouple that dependency in the kernel code too

 - Teach the alternatives machinery to handle relocations

 - Make debug_alternative accept flags in order to see only that set of
   patching done one is interested in

 - Other fixes, cleanups and optimizations to the patching code

* tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternative: PAUSE is not a NOP
  x86/alternatives: Add cond_resched() to text_poke_bp_batch()
  x86/nospec: Shorten RESET_CALL_DEPTH
  x86/alternatives: Add longer 64-bit NOPs
  x86/alternatives: Fix section mismatch warnings
  x86/alternative: Optimize returns patching
  x86/alternative: Complicate optimize_nops() some more
  x86/alternative: Rewrite optimize_nops() some
  x86/lib/memmove: Decouple ERMS from FSRM
  x86/alternative: Support relocations in alternatives
  x86/alternative: Make debug-alternative selective
2023-06-26 15:14:55 -07:00
Randy Dunlap b26d3d054d x86/lib/msr: Clean up kernel-doc notation
Convert x86/lib/msr.c comments to kernel-doc notation to
eliminate kernel-doc warnings:

  msr.c:30: warning: This comment starts with '/**', but isn't \
  a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
  ...

Fixes: 22085a66c2 ("x86: Add another set of MSR accessor functions")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/oe-kbuild-all/202304120048.v4uqUq9Q-lkp@intel.com/
2023-06-06 15:19:50 +02:00
Peter Zijlstra 6d12c8d308 percpu: Wire up cmpxchg128
In order to replace cmpxchg_double() with the newly minted
cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230531132323.654945124@infradead.org
2023-06-05 09:36:37 +02:00
Nadav Amit 5516c89d58 x86/lib: Make get/put_user() exception handling a visible symbol
The .L-prefixed exception handling symbols of get_user() and put_user()
do get discarded from the symbol table of the final kernel image.

This confuses tools which parse that symbol table and try to map the
chunk of code to a symbol. And, in general, from toolchain perspective,
it is a good practice to have all code belong to a symbol, and the
correct one at that.

  ( Currently, objdump displays that exception handling chunk as part
    of the previous symbol which is a "fallback" of sorts and not
    correct. )

While at it, rename them to something more descriptive.

  [ bp: Rewrite commit message, rename symbols. ]

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230525184244.2311-1-namit@vmware.com
2023-06-02 10:51:46 +02:00
Nathan Chancellor 2fe1e67e69 x86/csum: Fix clang -Wuninitialized in csum_partial()
Clang warns:

  arch/x86/lib/csum-partial_64.c:74:20: error: variable 'result' is uninitialized when used here [-Werror,-Wuninitialized]
                  return csum_tail(result, temp64, odd);
                                   ^~~~~~
  arch/x86/lib/csum-partial_64.c:48:22: note: initialize the variable 'result' to silence this warning
          unsigned odd, result;
                              ^
                               = 0
  1 error generated.

The only initialization and uses of result in csum_partial() were moved
into csum_tail() but result is still being passed by value to
csum_tail() (clang's -Wuninitialized does not do interprocedural
analysis to realize that result is always assigned in csum_tail()
however). Sink the declaration of result into csum_tail() to clear up
the warning.

Closes: https://lore.kernel.org/202305262039.3HUYjWJk-lkp@intel.com/
Fixes: 688eb8191b ("x86/csum: Improve performance of `csum_partial`")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230526-csum_partial-wuninitialized-v1-1-ebc0108dcec1%40kernel.org
2023-05-29 06:52:32 -07:00
Linus Torvalds 47ee3f1dd9 x86: re-introduce support for ERMS copies for user space accesses
I tried to streamline our user memory copy code fairly aggressively in
commit adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory
copies"), in order to then be able to clean up the code and inline the
modern FSRM case in commit 577e6a7fd5 ("x86: inline the 'rep movs' in
user copies for the FSRM case").

We had reports [1] of that causing regressions earlier with blogbench,
but that turned out to be a horrible benchmark for that case, and not a
sufficient reason for re-instating "rep movsb" on older machines.

However, now Eric Dumazet reported [2] a regression in performance that
seems to be a rather more real benchmark, where due to the removal of
"rep movs" a TCP stream over a 100Gbps network no longer reaches line
speed.

And it turns out that with the simplified the calling convention for the
non-FSRM case in commit 427fda2c8a ("x86: improve on the non-rep
'copy_user' function"), re-introducing the ERMS case is actually fairly
simple.

Of course, that "fairly simple" is glossing over several missteps due to
having to fight our assembler alternative code.  This code really wanted
to rewrite a conditional branch to have two different targets, but that
made objtool sufficiently unhappy that this instead just ended up doing
a choice between "jump to the unrolled loop, or use 'rep movsb'
directly".

Let's see if somebody finds a case where the kernel memory copies also
care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for
small memory copies").  But Eric does argue that the user copies are
special because networking tries to copy up to 32KB at a time, if
order-3 pages allocations are possible.

In-kernel memory copies are typically small, unless they are the special
"copy pages at a time" kind that still use "rep movs".

Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2]
Reported-and-tested-by: Eric Dumazet <edumazet@google.com>
Fixes: adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-26 12:34:20 -07:00
Noah Goldstein 688eb8191b x86/csum: Improve performance of `csum_partial`
1) Add special case for len == 40 as that is the hottest value. The
   nets a ~8-9% latency improvement and a ~30% throughput improvement
   in the len == 40 case.

2) Use multiple accumulators in the 64-byte loop. This dramatically
   improves ILP and results in up to a 40% latency/throughput
   improvement (better for more iterations).

Results from benchmarking on Icelake. Times measured with rdtsc()
 len   lat_new   lat_old      r    tput_new  tput_old      r
   8      3.58      3.47  1.032        3.58      3.51  1.021
  16      4.14      4.02  1.028        3.96      3.78  1.046
  24      4.99      5.03  0.992        4.23      4.03  1.050
  32      5.09      5.08  1.001        4.68      4.47  1.048
  40      5.57      6.08  0.916        3.05      4.43  0.690
  48      6.65      6.63  1.003        4.97      4.69  1.059
  56      7.74      7.72  1.003        5.22      4.95  1.055
  64      6.65      7.22  0.921        6.38      6.42  0.994
  96      9.43      9.96  0.946        7.46      7.54  0.990
 128      9.39     12.15  0.773        8.90      8.79  1.012
 200     12.65     18.08  0.699       11.63     11.60  1.002
 272     15.82     23.37  0.677       14.43     14.35  1.005
 440     24.12     36.43  0.662       21.57     22.69  0.951
 952     46.20     74.01  0.624       42.98     53.12  0.809
1024     47.12     78.24  0.602       46.36     58.83  0.788
1552     72.01    117.30  0.614       71.92     96.78  0.743
2048     93.07    153.25  0.607       93.28    137.20  0.680
2600    114.73    194.30  0.590      114.28    179.32  0.637
3608    156.34    268.41  0.582      154.97    254.02  0.610
4096    175.01    304.03  0.576      175.89    292.08  0.602

There is no such thing as a free lunch, however, and the special case
for len == 40 does add overhead to the len != 40 cases. This seems to
amount to be ~5% throughput and slightly less in terms of latency.

Testing:
Part of this change is a new kunit test. The tests check all
alignment X length pairs in [0, 64) X [0, 512).
There are three cases.
    1) Precomputed random inputs/seed. The expected results where
       generated use the generic implementation (which is assumed to be
       non-buggy).
    2) An input of all 1s. The goal of this test is to catch any case
       a carry is missing.
    3) An input that never carries. The goal of this test si to catch
       any case of incorrectly carrying.

More exhaustive tests that test all alignment X length pairs in
[0, 8192) X [0, 8192] on random data are also available here:
https://github.com/goldsteinn/csum-reproduction

The reposity also has the code for reproducing the above benchmark
numbers.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
2023-05-25 10:55:18 -07:00
Arnd Bergmann 3b939ba0c2 x86/usercopy: Include arch_wb_cache_pmem() declaration
arch_wb_cache_pmem() is declared in a global header but defined by
the architecture. On x86, the implementation needs to include the header
to avoid this warning:

arch/x86/lib/usercopy_64.c:39:6: error: no previous prototype for 'arch_wb_cache_pmem' [-Werror=missing-prototypes]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-18-arnd%40kernel.org
2023-05-18 11:56:18 -07:00
Borislav Petkov (AMD) f220125b99 x86/retbleed: Add __x86_return_thunk alignment checks
Add a linker assertion and compute the 0xcc padding dynamically so that
__x86_return_thunk is always cacheline-aligned. Leave the SYM_START()
macro in as the untraining doesn't need ENDBR annotations anyway.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Link: https://lore.kernel.org/r/20230515140726.28689-1-bp@alien8.de
2023-05-17 12:14:21 +02:00
Borislav Petkov (AMD) 9a48d60467 x86/retbleed: Fix return thunk alignment
SYM_FUNC_START_LOCAL_NOALIGN() adds an endbr leading to this layout
(leaving only the last 2 bytes of the address):

  3bff <zen_untrain_ret>:
  3bff:       f3 0f 1e fa             endbr64
  3c03:       f6                      test   $0xcc,%bl

  3c04 <__x86_return_thunk>:
  3c04:       c3                      ret
  3c05:       cc                      int3
  3c06:       0f ae e8                lfence

However, "the RET at __x86_return_thunk must be on a 64 byte boundary,
for alignment within the BTB."

Use SYM_START instead.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-12 17:19:53 -05:00
Borislav Petkov (AMD) 14e4ec9c3e x86/lib/memmove: Decouple ERMS from FSRM
Up until now it was perceived that FSRM is an improvement to ERMS and
thus it was made dependent on latter.

However, there are AMD BIOSes out there which allow for disabling of
either features and thus preventing kernels from booting due to the CMP
disappearing and thus breaking the logic in the memmove() function.

Similar observation happens on some VM migration scenarios.

Patch the proper sequences depending on which feature is enabled.

Reported-by: Daniel Verkamp <dverkamp@chromium.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y/yK0dyzI0MMdTie@zn.tnic
2023-05-10 14:51:56 +02:00
Linus Torvalds 2aff7c706c Objtool changes for v6.4:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did
    this inconsistently follow this new, common convention, and fix all the fallout
    that objtool can now detect statically.
 
  - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity,
    split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it.
 
  - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code.
 
  - Generate ORC data for __pfx code
 
  - Add more __noreturn annotations to various kernel startup/shutdown/panic functions.
 
  - Misc improvements & fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM
 pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul
 TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd
 LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP
 c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT
 /PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a
 1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1
 0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ
 SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p
 vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo
 x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi
 fRho4CqzrtM=
 =NwEV
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Mark arch_cpu_idle_dead() __noreturn, make all architectures &
   drivers that did this inconsistently follow this new, common
   convention, and fix all the fallout that objtool can now detect
   statically

 - Fix/improve the ORC unwinder becoming unreliable due to
   UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
   and UNWIND_HINT_UNDEFINED to resolve it

 - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code

 - Generate ORC data for __pfx code

 - Add more __noreturn annotations to various kernel startup/shutdown
   and panic functions

 - Misc improvements & fixes

* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/hyperv: Mark hv_ghcb_terminate() as noreturn
  scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
  x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
  btrfs: Mark btrfs_assertfail() __noreturn
  objtool: Include weak functions in global_noreturns check
  cpu: Mark nmi_panic_self_stop() __noreturn
  cpu: Mark panic_smp_self_stop() __noreturn
  arm64/cpu: Mark cpu_park_loop() and friends __noreturn
  x86/head: Mark *_start_kernel() __noreturn
  init: Mark start_kernel() __noreturn
  init: Mark [arch_call_]rest_init() __noreturn
  objtool: Generate ORC data for __pfx code
  x86/linkage: Fix padding for typed functions
  objtool: Separate prefix code from stack validation code
  objtool: Remove superfluous dead_end_function() check
  objtool: Add symbol iteration helpers
  objtool: Add WARN_INSN()
  scripts/objdump-func: Support multiple functions
  context_tracking: Fix KCSAN noinstr violation
  objtool: Add stackleak instrumentation to uaccess safe list
  ...
2023-04-28 14:02:54 -07:00
Linus Torvalds 22b8cc3e78 Add support for new Linear Address Masking CPU feature. This is similar
to ARM's Top Byte Ignore and allows userspace to store metadata in some
 bits of pointers without masking it out before use.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ
 krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt
 5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W
 cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE
 pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l
 Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i
 j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn
 +5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+
 +YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK
 A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd
 m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc
 ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI=
 =qitk
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
 "Add support for the new Linear Address Masking CPU feature.

  This is similar to ARM's Top Byte Ignore and allows userspace to store
  metadata in some bits of pointers without masking it out before use"

* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
  x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
  selftests/x86/lam: Add test cases for LAM vs thread creation
  selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
  selftests/x86/lam: Add inherit test cases for linear-address masking
  selftests/x86/lam: Add io_uring test cases for linear-address masking
  selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
  selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
  x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
  iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
  mm: Expose untagging mask in /proc/$PID/status
  x86/mm: Provide arch_prctl() interface for LAM
  x86/mm: Reduce untagged_addr() overhead for systems without LAM
  x86/uaccess: Provide untagged_addr() and remove tags before address check
  mm: Introduce untagged_addr_remote()
  x86/mm: Handle LAM on context switch
  x86: CPUID and CR3/CR4 flags for Linear Address Masking
  x86: Allow atomic MM_CONTEXT flags setting
  x86/mm: Rework address range check in get_user() and put_user()
2023-04-28 09:43:49 -07:00
Linus Torvalds 682f7bbad2 - Unify duplicated __pa() and __va() definitions
- Simplify sysctl tables registration
 
 - Remove unused symbols
 
 - Correct function name in comment
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRKjI4ACgkQEsHwGGHe
 VUq85Q/9FbGNdHD2uX2KcpeWkEjYVlrhDudsG8e1JAMCjfSD0CCNhG7yU+6Jabfs
 LszYLwkNeHVpLUUOOtAnObqXpdcv2vML7/j6Cgg5aqdMDv3RwIgTti5tSkHr7s1A
 ejH0Qo/oYYt2OsJYkl+KuGhcaBmdpqEOIeOtV98vBtqgkRDCwdJhhMZeF0qgZ1kN
 r3bFdwy0KIiyI+EBYDXEsew/nI9oEuzoNgaOVIZCeOtHjtbgdl/kc7JgfDd0838D
 nsoNk1R8PVSl6RY30my7TKbFl7epWibinnD9M8NcyYpbLlfZKI7L60ZtQZ5Q49pz
 z+LtXTgeS/fjaFuM8LKkekGprpNiDClgygNini3QsmSb3kfb4ymxJLKbVuXziOLZ
 eYAE+xexCNUYXhmeamvPWjRP9cUgQc3TQD0IQFv/FO8M0gXBA4jTauyRrs+NNmVI
 G7W7T90x1XUu4fZDM/QZ2cn5qtdcRMZm4NcV0WY5OU/ZrrMmMNyGvDfrwLhFOSXi
 nOqzlJ9GNRVjhHsQhCG16B2y3guWmPGXyCvn6Ruuv7RQcm7oK4Rmq6bHuuqcAyaI
 R5z2pRib3AzPNgHUfMgDWuCa7D9jBimVJI/dG0bXG8DCnzaBXfYJn2ruvwvQlVLC
 4WqwdyUxR7k+vf1l0kQ5voGCLbXOcLFBfGP+7RRnEzlyCut2t74=
 =I3Mj
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Unify duplicated __pa() and __va() definitions

 - Simplify sysctl tables registration

 - Remove unused symbols

 - Correct function name in comment

* tag 'x86_cleanups_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Centralize __pa()/__va() definitions
  x86: Simplify one-level sysctl registration for itmt_kern_table
  x86: Simplify one-level sysctl registration for abi_table2
  x86/platform/intel-mid: Remove unused definitions from intel-mid.h
  x86/uaccess: Remove memcpy_page_flushcache()
  x86/entry: Change stale function name in comment to error_return()
2023-04-28 09:22:30 -07:00
Linus Torvalds 034ff37d34 x86: rewrite '__copy_user_nocache' function
I didn't really want to do this, but as part of all the other changes to
the user copy loops, I've been looking at this horror.

I tried to clean it up multiple times, but every time I just found more
problems, and the way it's written, it's just too hard to fix them.

For example, the code is written to do quad-word alignment, and will use
regular byte accesses to get to that point.  That's fairly simple, but
it means that any initial 8-byte alignment will be done with cached
copies.

However, the code then is very careful to do any 4-byte _tail_ accesses
using an uncached 4-byte write, and that was claimed to be relevant in
commit a82eee7424 ("x86/uaccess/64: Handle the caching of 4-byte
nocache copies properly in __copy_user_nocache()").

So if you do a 4-byte copy using that function, it carefully uses a
4-byte 'movnti' for the destination.  But if you were to do a 12-byte
copy that is 4-byte aligned, it would _not_ do a 4-byte 'movnti'
followed by a 8-byte 'movnti' to keep it all uncached.

Instead, it would align the destination to 8 bytes using a
byte-at-a-time loop, and then do a 8-byte 'movnti' for the final 8
bytes.

The main caller that cares is __copy_user_flushcache(), which knows
about this insanity, and has odd cases for it all.  But I just can't
deal with looking at this kind of "it does one case right, and another
related case entirely wrong".

And the code really wasn't fixable without hard drugs, which I try to
avoid.

So instead, rewrite it in a form that hopefully not only gets this
right, but is a bit more maintainable.  Knock wood.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-20 18:53:49 -07:00
Linus Torvalds e1f2750edc x86: remove 'zerorest' argument from __copy_user_nocache()
Every caller passes in zero, meaning they don't want any partial copy to
zero the remainder of the destination buffer.

Which is just as well, because the implementation of that function
didn't actually even look at that argument, and wasn't even aware it
existed, although some misleading comments did mention it still.

The 'zerorest' thing is a historical artifact of how "copy_from_user()"
worked, in that it would zero the rest of the kernel buffer that it
copied into.

That zeroing still exists, but it's long since been moved to generic
code, and the raw architecture-specific code doesn't do it.  See
_copy_from_user() in lib/usercopy.c for this all.

However, while __copy_user_nocache() shares some history and superficial
other similarities with copy_from_user(), it is in many ways also very
different.

In particular, while the code makes it *look* similar to the generic
user copy functions that can copy both to and from user space, and take
faults on both reads and writes as a result, __copy_user_nocache() does
no such thing at all.

__copy_user_nocache() always copies to kernel space, and will never take
a page fault on the destination.  What *can* happen, though, is that the
non-temporal stores take a machine check because one of the use cases is
for writing to stable memory, and any memory errors would then take
synchronous faults.

So __copy_user_nocache() does look a lot like copy_from_user(), but has
faulting behavior that is more akin to our old copy_in_user() (which no
longer exists, but copied from user space to user space and could fault
on both source and destination).

And it very much does not have the "zero the end of the destination
buffer", since a problem with the destination buffer is very possibly
the very source of the partial copy.

So this whole thing was just a confusing historical artifact from having
shared some code with a completely different function with completely
different use cases.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-19 19:09:52 -07:00
Linus Torvalds 427fda2c8a x86: improve on the non-rep 'copy_user' function
The old 'copy_user_generic_unrolled' function was oddly implemented for
largely historical reasons: it had been largely based on the uncached
copy case, which has some other concerns.

For example, the __copy_user_nocache() function uses 'movnti' for the
destination stores, and those want the destination to be aligned.  In
contrast, the regular copy function doesn't really care, and trying to
align things only complicates matters.

Also, like the clear_user function, the copy function had some odd
handling of the repeat counts, complicating the exception handling for
no really good reason.  So as with clear_user, just write it to keep all
the byte counts in the %rcx register, exactly like the 'rep movs'
functionality that this replaces.

Unlike a real 'rep movs', we do allow for this to trash a few temporary
registers to not have to unnecessarily save/restore registers on the
stack.

And like the clearing case, rename this to what it now clearly is:
'rep_movs_alternative', and make it one coherent function, so that it
shows up as such in profiles (instead of the odd split between
"copy_user_generic_unrolled" and "copy_user_short_string", the latter of
which was not about strings at all, and which was shared with the
uncached case).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds 8c9b6a88b7 x86: improve on the non-rep 'clear_user' function
The old version was oddly written to have the repeat count in multiple
registers.  So instead of taking advantage of %rax being zero, it had
some sub-counts in it.  All just for a "single word clearing" loop,
which isn't even efficient to begin with.

So get rid of those games, and just keep all the state in the same
registers we got it in (and that we should return things in).  That not
only makes this act much more like 'rep stos' (which this function is
replacing), but makes it much easier to actually do the obvious loop
unrolling.

Also rename the function from the now nonsensical 'clear_user_original'
to what it now clearly is: 'rep_stos_alternative'.

End result: if we don't have a fast 'rep stosb', at least we can have a
fast fallback for it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds 577e6a7fd5 x86: inline the 'rep movs' in user copies for the FSRM case
This does the same thing for the user copies as commit 0db7058e8e
("x86/clear_user: Make it faster") did for clear_user().  In other
words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set,
avoiding the function call entirely.

In order to do that, it makes the calling convention for the out-of-line
case ("copy_user_generic_unrolled") match the 'rep movs' calling
convention, although it does also end up clobbering a number of
additional registers.

Also, to simplify code sharing in the low-level assembly with the
__copy_user_nocache() function (that uses the normal C calling
convention), we end up with a kind of mixed return value for the
low-level asm code: it will return the result in both %rcx (to work as
an alternative for the 'rep movs' case), _and_ in %rax (for the nocache
case).

We could avoid this by wrapping __copy_user_nocache() callers in an
inline asm, but since the cost is just an extra register copy, it's
probably not worth it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds 3639a53558 x86: move stac/clac from user copy routines into callers
This is preparatory work for inlining the 'rep movs' case, but also a
cleanup.  The __copy_user_nocache() function was mis-used by the rdma
code to do uncached kernel copies that don't actually want user copies
at all, and as a result doesn't want the stac/clac either.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds d2c95f9d68 x86: don't use REP_GOOD or ERMS for user memory clearing
The modern target to use is FSRS (Fast Short REP STOS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).

Note! This changes the conditional for the inlining from FSRM ("fast
short rep movs") to FSRS ("fast short rep stos").

We'll have a separate fixup for AMD microarchitectures that have a good
'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM
was there first).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds adfcf4231b x86: don't use REP_GOOD or ERMS for user memory copies
The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds 20f3337d35 x86: don't use REP_GOOD or ERMS for small memory clearing
The modern target to use is FSRS (Fast Short REP STOS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Linus Torvalds 68674f94ff x86: don't use REP_GOOD or ERMS for small memory copies
The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
copying and clearing).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-18 17:05:28 -07:00
Josh Poimboeuf fb799447ae x86,objtool: Split UNWIND_HINT_EMPTY in two
Mark reported that the ORC unwinder incorrectly marks an unwind as
reliable when the unwind terminates prematurely in the dark corners of
return_to_handler() due to lack of information about the next frame.

The problem is UNWIND_HINT_EMPTY is used in two different situations:

  1) The end of the kernel stack unwind before hitting user entry, boot
     code, or fork entry

  2) A blind spot in ORC coverage where the unwinder has to bail due to
     lack of information about the next frame

The ORC unwinder has no way to tell the difference between the two.
When it encounters an undefined stack state with 'end=1', it blindly
marks the stack reliable, which can break the livepatch consistency
model.

Fix it by splitting UNWIND_HINT_EMPTY into UNWIND_HINT_UNDEFINED and
UNWIND_HINT_END_OF_STACK.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/fd6212c8b450d3564b855e1cb48404d6277b4d9f.1677683419.git.jpoimboe@kernel.org
2023-03-23 23:18:58 +01:00
Kirill A. Shutemov b19b74bc99 x86/mm: Rework address range check in get_user() and put_user()
The functions get_user() and put_user() check that the target address
range resides in the user space portion of the virtual address space.
In order to perform this check, the functions compare the end of the
range against TASK_SIZE_MAX.

For kernels compiled with CONFIG_X86_5LEVEL, this process requires some
additional trickery using ALTERNATIVE, as TASK_SIZE_MAX depends on the
paging mode in use.

Linus suggested that this check could be simplified for 64-bit kernels.
It is sufficient to check bit 63 of the address to ensure that the range
belongs to user space. Additionally, the use of branches can be avoided
by setting the target address to all ones if bit 63 is set.

There's no need to check the end of the access range as there's huge
gap between end of userspace range and start of the kernel range. The
gap consists of canonical hole and unused ranges on both kernel and
userspace sides.

If an address with bit 63 set is passed down, it will trigger a #GP
exception. _ASM_EXTABLE_UA() complains about this. Replace it with
plain _ASM_EXTABLE() as it is expected behaviour now.

The updated get_user() and put_user() checks are also compatible with
Linear Address Masking, which allows user space to encode metadata in
the upper bits of pointers and eliminates the need to untag the address
before handling it.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230312112612.31869-2-kirill.shutemov%40linux.intel.com
2023-03-16 13:08:38 -07:00
Ira Weiny b9da86e3aa x86/uaccess: Remove memcpy_page_flushcache()
Commit 21b56c8477 ("iov_iter: get rid of separate bvec and xarray
callbacks") removed the calls to memcpy_page_flushcache().

In addition, memcpy_page_flushcache() uses the deprecated
kmap_atomic().

Remove the unused x86 memcpy_page_flushcache() implementation and
also get rid of one more kmap_atomic() user.

[ dhansen: tweak changelog ]

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221230-kmap-x86-v1-1-15f1ecccab50%40intel.com
2023-03-16 08:27:57 -07:00
Linus Torvalds 877934769e - Cache the AMD debug registers in per-CPU variables to avoid MSR writes
where possible, when supporting a debug registers swap feature for
   SEV-ES guests
 
 - Add support for AMD's version of eIBRS called Automatic IBRS which is
   a set-and-forget control of indirect branch restriction speculation
   resources on privilege change
 
 - Add support for a new x86 instruction - LKGS - Load kernel GS which is
   part of the FRED infrastructure
 
 - Reset SPEC_CTRL upon init to accomodate use cases like kexec which
   rediscover
 
 - Other smaller fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmP1RDIACgkQEsHwGGHe
 VUohBw//ZB9ZRqsrKdm6D9YaP2x4Zb+kqKqo6rjYeWaYqyPyCwDujPwh+pb3Oq1t
 aj62muDv1t/wEJc8mKNkfXkjEEtBVAOcpb5YIpKreoEvNKyevol83Ih0u5iJcTRE
 E5qf8HDS8b/JZrcazJJLl6WQmQNH5RiKSu5bbCpRhoeOcyo5pRYR5MztK9vNmAQk
 GMdwHsUSU+jN8uiE4HnpaOb/luhgFindRwZVTpdjJegQWLABS8cl3CKeTv4+PW45
 isvv37XnQP248wsptIEVRHeG6g3g/HtvwRx7DikUw06QwUyUK7H9hJssOoSP8TL9
 u4psRwfWnJ1OxU6klL+s0Ii+pjQ97wXmK/oqK7QkdUwhWqR/mQAW2e9kWHAngyDn
 A6mKbzSM6HFAeSXQpB9cMb6uvYRD44SngDFe3WXtEK8jiiQ70ikUm4E28I5KJOPg
 s+RyioHk0NFRHYSOOBqNG1NKz6ED7L3GbgbbzxkgMh21AAyI3X351t+PtGoLV5ew
 eqOsM7lbg9Scg1LvPk1JcoALS8USWqgar397rz9qGUs+OkPWBtEBCmTdMz/Eb+2t
 g/WHdLS5/ajSs5gNhT99W3DeqZMPDEkgBRSeyBBmY3CUD3gBL2wXEktRXv504zBR
 RC4oyUPX3c9E2ib6GATLE3kBLbcz9hTWbMxF+X3lLJvTVd/Qc2o=
 =v/ZC
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:

 - Cache the AMD debug registers in per-CPU variables to avoid MSR
   writes where possible, when supporting a debug registers swap feature
   for SEV-ES guests

 - Add support for AMD's version of eIBRS called Automatic IBRS which is
   a set-and-forget control of indirect branch restriction speculation
   resources on privilege change

 - Add support for a new x86 instruction - LKGS - Load kernel GS which
   is part of the FRED infrastructure

 - Reset SPEC_CTRL upon init to accomodate use cases like kexec which
   rediscover

 - Other smaller fixes and cleanups

* tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/amd: Cache debug register values in percpu variables
  KVM: x86: Propagate the AMD Automatic IBRS feature to the guest
  x86/cpu: Support AMD Automatic IBRS
  x86/cpu, kvm: Add the SMM_CTL MSR not present feature
  x86/cpu, kvm: Add the Null Selector Clears Base feature
  x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf
  x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature
  KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
  x86/cpu, kvm: Add support for CPUID_80000021_EAX
  x86/gsseg: Add the new <asm/gsseg.h> header to <asm/asm-prototypes.h>
  x86/gsseg: Use the LKGS instruction if available for load_gs_index()
  x86/gsseg: Move load_gs_index() to its own new header file
  x86/gsseg: Make asm_load_gs_index() take an u16
  x86/opcode: Add the LKGS instruction to x86-opcode-map
  x86/cpufeature: Add the CPU feature bit for LKGS
  x86/bugs: Reset speculation control settings on init
  x86/cpu: Remove redundant extern x86_read_arch_cap_msr()
2023-02-21 14:51:40 -08:00
Linus Torvalds 6be3dafc7c Header fixes and a DocBook fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbpoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iUXw//daoNLMPL0EffCGdTtlM0th52C9MBPpqA
 OGXlurhCepDZek/1bKyAlK+5jqHXpMmFjk+wYWa+xPVN+FzKGzt3uGeFn/elJcYw
 7/3VJkbyibzR7uUq20KPknRqZiXEQ8/ZqVW/4oC1EjAAAWYZDWJKPwkK4DsS81RX
 dDWHdvl8dnKkcScmYOvFMQCbsPLVx1OTinQa/SBX7S4BLqAo+EoTXSNu/LXlBHtS
 FkykM2MDKW+wU3HN2iYaHk8ZWiZHX9/jHK1NDv+Uf6VyXPT2bBU1A6L3vpNDfncF
 pXFw00pe/sK7/9RbqKPoGK9KW8KFkwLqICV44Hj81srsorKjhwGqz6YRiEHxw5CX
 SY2JWHleZCYK9zzjwAb+zx72OJzHs/Pn/EYgHc3nPBAag7rsKajmIB6tMuo7/YW6
 cyoLAVk5szEYbvgQiAHs8KWHxNrKt8Pj77f77K38OW+emiIF8J9hzbtdN9nanHus
 Jg3vtJ1mPkUDKpbRhgMv/sSz7+vztm2iux346GOw55spjgWk2t1AM7RnRUSt66EW
 X2HE6iDqwjYCkFRCUVZHbtMHzW/dKd+ymVMDqVY+eaquMTGIw1b/5u/RYkm2i2Ic
 YGwxj/rrOJCjR3sPkxBkfYrnSEGj0wCHUGbXkdoFTu0v6AwuYzeL2i6vp0hXWZ/C
 WTPFwHr1VII=
 =gXpS
 -----END PGP SIGNATURE-----

Merge tag 'x86-asm-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm updates from Ingo Molnar:
 "Header fixes and a DocBook fix"

* tag 'x86-asm-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/lib: Fix compiler and kernel-doc warnings
  x86/lib: Include <asm/misc.h> to fix a missing prototypes warning at build time
2023-02-20 18:25:30 -08:00
Ingo Molnar 57a30218fa Linux 6.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
 E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
 nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
 TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
 s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
 ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
 SDc/Y/c=
 =zpaD
 -----END PGP SIGNATURE-----

Merge tag 'v6.2-rc6' into sched/core, to pick up fixes

Pick up fixes before merging another batch of cpuidle updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31 15:01:20 +01:00
Peter Zijlstra 69d4c0d321 entry, kasan, x86: Disallow overriding mem*() functions
KASAN cannot just hijack the mem*() functions, it needs to emit
__asan_mem*() variants if it wants instrumentation (other sanitizers
already do this).

  vmlinux.o: warning: objtool: sync_regs+0x24: call to memcpy() leaves .noinstr.text section
  vmlinux.o: warning: objtool: vc_switch_off_ist+0xbe: call to memcpy() leaves .noinstr.text section
  vmlinux.o: warning: objtool: fixup_bad_iret+0x36: call to memset() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __sev_get_ghcb+0xa0: call to memcpy() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __sev_put_ghcb+0x35: call to memcpy() leaves .noinstr.text section

Remove the weak aliases to ensure nobody hijacks these functions and
add them to the noinstr section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195542.028523143@infradead.org
2023-01-13 11:48:17 +01:00
H. Peter Anvin (Intel) 5a91f12660 x86/opcode: Add the LKGS instruction to x86-opcode-map
Add the instruction opcode used by LKGS to x86-opcode-map.

Opcode number is per public FRED draft spec v3.0.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230112072032.35626-3-xin3.li@intel.com
2023-01-12 13:06:36 +01:00
Anuradha Weeraman 65c24d7b0f x86/lib: Fix compiler and kernel-doc warnings
Fix the following W=1 warnings:

arch/x86/lib/cmdline.c:

  - Include <asm/cmdline.h> to fix missing-prototypes warnings.

  - Update comment for __cmdline_find_option_bool to fix a kernel-doc warning.

Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230103114725.108431-1-anuradha@debian.org
2023-01-03 18:46:21 +01:00
Jason A. Donenfeld 72bb8f8cc0 x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type
Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants,
whose namespace overlaps.

Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be
used from the same source file.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2023-01-03 18:46:06 +01:00
Mikulas Patocka 55d235361f x86/asm: Fix an assembler warning with current binutils
Fix a warning: "found `movsd'; assuming `movsl' was meant"

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-01-03 17:55:11 +01:00
Anuradha Weeraman 4651b9438b x86/lib: Include <asm/misc.h> to fix a missing prototypes warning at build time
Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230103034637.74679-1-anuradha@debian.org
2023-01-03 11:11:03 +01:00
Linus Torvalds 94a855111e - Add the call depth tracking mitigation for Retbleed which has
been long in the making. It is a lighterweight software-only fix for
 Skylake-based cores where enabling IBRS is a big hammer and causes a
 significant performance impact.
 
 What it basically does is, it aligns all kernel functions to 16 bytes
 boundary and adds a 16-byte padding before the function, objtool
 collects all functions' locations and when the mitigation gets applied,
 it patches a call accounting thunk which is used to track the call depth
 of the stack at any time.
 
 When that call depth reaches a magical, microarchitecture-specific value
 for the Return Stack Buffer, the code stuffs that RSB and avoids its
 underflow which could otherwise lead to the Intel variant of Retbleed.
 
 This software-only solution brings a lot of the lost performance back,
 as benchmarks suggest:
 
   https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
 
 That page above also contains a lot more detailed explanation of the
 whole mechanism
 
 - Implement a new control flow integrity scheme called FineIBT which is
 based on the software kCFI implementation and uses hardware IBT support
 where present to annotate and track indirect branches using a hash to
 validate them
 
 - Other misc fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
 VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
 VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
 gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
 iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
 Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
 ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
 ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
 73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
 i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
 Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
 zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
 =cRy1
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Add the call depth tracking mitigation for Retbleed which has been
   long in the making. It is a lighterweight software-only fix for
   Skylake-based cores where enabling IBRS is a big hammer and causes a
   significant performance impact.

   What it basically does is, it aligns all kernel functions to 16 bytes
   boundary and adds a 16-byte padding before the function, objtool
   collects all functions' locations and when the mitigation gets
   applied, it patches a call accounting thunk which is used to track
   the call depth of the stack at any time.

   When that call depth reaches a magical, microarchitecture-specific
   value for the Return Stack Buffer, the code stuffs that RSB and
   avoids its underflow which could otherwise lead to the Intel variant
   of Retbleed.

   This software-only solution brings a lot of the lost performance
   back, as benchmarks suggest:

       https://lore.kernel.org/all/20220915111039.092790446@infradead.org/

   That page above also contains a lot more detailed explanation of the
   whole mechanism

 - Implement a new control flow integrity scheme called FineIBT which is
   based on the software kCFI implementation and uses hardware IBT
   support where present to annotate and track indirect branches using a
   hash to validate them

 - Other misc fixes and cleanups

* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  x86/paravirt: Use common macro for creating simple asm paravirt functions
  x86/paravirt: Remove clobber bitmask from .parainstructions
  x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
  x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
  x86/Kconfig: Enable kernel IBT by default
  x86,pm: Force out-of-line memcpy()
  objtool: Fix weak hole vs prefix symbol
  objtool: Optimize elf_dirty_reloc_sym()
  x86/cfi: Add boot time hash randomization
  x86/cfi: Boot time selection of CFI scheme
  x86/ibt: Implement FineIBT
  objtool: Add --cfi to generate the .cfi_sites section
  x86: Add prefix symbols for function padding
  objtool: Add option to generate prefix symbols
  objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
  objtool: Slice up elf_create_section_symbol()
  kallsyms: Revert "Take callthunks into account"
  x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
  x86/retpoline: Fix crash printing warning
  x86/paravirt: Fix a !PARAVIRT build warning
  ...
2022-12-14 15:03:00 -08:00
Linus Torvalds 8b9ed79c2d - Move the 32-bit memmove() asm implementation out-of-line in order to
fix a 32-bit full LTO build failure with clang where it would fail at
 register allocation. Move it to an asm file and clean it up while at it,
 similar to what has been already done on 64-bit
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYYQMACgkQEsHwGGHe
 VUqs3xAAvc1nsW0FiR3DWE7+k9jti8OwDMRt51n/zX9wjVrYeJM/6pLNqu+rBnLQ
 A18dhRMVm2QKcxh06bcth/B1V0zvYUMRaOrucjAMkBQ/XAVsEEt3ZSgklj7rYm1S
 phG+YhzxJWP9KdmiN29kfBj6kS9QdYYHQyca1KWxTlggWSIwlCvIzQ8k9ip8s/5W
 9fxiU/P9KA8a4aEkzeQ+waytylus8Swbl19zjnmfHswCi+l2ekS9PfXDE190ec01
 ePSLn2dX1bbPCIIvKD7z6gWR7BOJLpVTv2ftg4XzcozZ7ebBaaQxXczhTvUPOtG0
 NOuYTz4hI+AcG5QZU8JgGbDXqS//NLIklwydfXgS1fNiwudmO62Y734mL8MV0aw7
 FVy7zTXamHb0D10ES0e9tL5GLSTYs0NZ6/193PNERUEdUWj/Ix9FPbuw53DYs3gh
 7BkZPRsrOJYmAtTG5fg44SMyG2MHsfsU7+3Xl0nHUKihgdZi7YdmuhLqXnhczCwe
 HvQFgUF3d5oRjcApYLJKHP88XnPH7o5ogIRXL1WEiZ0+SgChAzckKdVjjYrd3FLq
 ppnr7uZQCJaT4YoL+zqcQEQD0ukC4BfhlxmXj1IuSlpqnIXOf+JD0WOF3/twErLm
 g8PE+PYkm+am6TAFDDU/45/DqpEJ+vCD1SJNfRtDeS4piFV0OzQ=
 =hby2
 -----END PGP SIGNATURE-----

Merge tag 'x86_asm_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 asm updates from Borislav Petkov:

 - Move the 32-bit memmove() asm implementation out-of-line in order to
   fix a 32-bit full LTO build failure with clang where it would fail at
   register allocation.

   Move it to an asm file and clean it up while at it, similar to what
   has been already done on 64-bit

* tag 'x86_asm_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mem: Move memmove to out of line assembler
2022-12-13 14:40:54 -08:00
Ingo Molnar 0ce096db71 Linux 6.1-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
 Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
 FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
 1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
 qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
 WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
 1GfXkXs=
 =gt6P
 -----END PGP SIGNATURE-----

Merge tag 'v6.1-rc6' into x86/core, to resolve conflicts

Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c:

 # upstream:
 debc5a1ec0 ("KVM: x86: use a separate asm-offsets.c file")

 # retbleed work in x86/core:
 5d8213864a ("x86/retbleed: Add SKL return thunk")

... and these commits in include/linux/bpf.h:

  # upstram:
  18acb7fac2 ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")")

  # x86/core commits:
  931ab63664 ("x86/ibt: Implement FineIBT")
  bea75b3389 ("x86/Kconfig: Introduce function padding")

The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream.

 Conflicts:
	arch/x86/kernel/asm-offsets.c
	include/linux/bpf.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-11-21 23:01:51 +01:00
Alexander Potapenko 11385b2612 x86/uaccess: instrument copy_from_user_nmi()
Make sure usercopy hooks from linux/instrumented.h are invoked for
copy_from_user_nmi().  This fixes KMSAN false positives reported when
dumping opcodes for a stack trace.

Link: https://lkml.kernel.org/r/20221102110611.1085175-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:24 -08:00
Nick Desaulniers bce5a1e8a3 x86/mem: Move memmove to out of line assembler
When building ARCH=i386 with CONFIG_LTO_CLANG_FULL=y, it's possible
(depending on additional configs which I have not been able to isolate)
to observe a failure during register allocation:

  error: inline assembly requires more registers than available

when memmove is inlined into tcp_v4_fill_cb() or tcp_v6_fill_cb().

memmove is quite large and probably shouldn't be inlined due to size
alone. A noinline function attribute would be the simplest fix, but
there's a few things that stand out with the current definition:

In addition to having complex constraints that can't always be resolved,
the clobber list seems to be missing %bx. By using numbered operands
rather than symbolic operands, the constraints are quite obnoxious to
refactor.

Having a large function be 99% inline asm is a code smell that this
function should simply be written in stand-alone out-of-line assembler.

Moving this to out of line assembler guarantees that the
compiler cannot inline calls to memmove.

This has been done previously for 64b:
commit 9599ec0471 ("x86-64, mem: Convert memmove() to assembly file
and fix return value bug")

That gives the opportunity for other cleanups like fixing the
inconsistent use of tabs vs spaces and instruction suffixes, and the
label 3 appearing twice.  Symbolic operands, local labels, and
additional comments would provide this code with a fresh coat of paint.

Finally, add a test that tickles the `rep movsl` implementation to test
it for correctness, since it has implicit operands.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/all/20221018172155.287409-1-ndesaulniers%40google.com
2022-11-01 15:44:07 -07:00
Thomas Gleixner f5c1bb2afe x86/calldepth: Add ret/call counting for debug
Add a debuigfs mechanism to validate the accounting, e.g. vs. call/ret
balance and to gather statistics about the stuffing to call ratio.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.204285506@infradead.org
2022-10-17 16:41:16 +02:00
Peter Zijlstra 3b6c1747da x86/retpoline: Add SKL retthunk retpolines
Ensure that retpolines do the proper call accounting so that the return
accounting works correctly.

Specifically; retpolines are used to replace both 'jmp *%reg' and
'call *%reg', however these two cases do not have the same accounting
requirements. Therefore split things up and provide two different
retpoline arrays for SKL.

The 'jmp *%reg' case needs no accounting, the
__x86_indirect_jump_thunk_array[] covers this. The retpoline is
changed to not use the return thunk; it's a simple call;ret construct.

[ strictly speaking it should do:
	andq $(~0x1f), PER_CPU_VAR(__x86_call_depth)
  but we can argue this can be covered by the fuzz we already have
  in the accounting depth (12) vs the RSB depth (16) ]

The 'call *%reg' case does need accounting, the
__x86_indirect_call_thunk_array[] covers this. Again, this retpoline
avoids the use of the return-thunk, in this case to avoid double
accounting.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.996634749@infradead.org
2022-10-17 16:41:15 +02:00
Thomas Gleixner 5d8213864a x86/retbleed: Add SKL return thunk
To address the Intel SKL RSB underflow issue in software it's required to
do call depth tracking.

Provide a return thunk for call depth tracking on Intel SKL CPUs.

The tracking does not use a counter. It uses uses arithmetic shift
right on call entry and logical shift left on return.

The depth tracking variable is initialized to 0x8000.... when the call
depth is zero. The arithmetic shift right sign extends the MSB and
saturates after the 12th call. The shift count is 5 so the tracking covers
12 nested calls. On return the variable is shifted left logically so it
becomes zero again.

 CALL	 	   	RET
 0: 0x8000000000000000	0x0000000000000000
 1: 0xfc00000000000000	0xf000000000000000
...
11: 0xfffffffffffffff8	0xfffffffffffffc00
12: 0xffffffffffffffff	0xffffffffffffffe0

After a return buffer fill the depth is credited 12 calls before the next
stuffing has to take place.

There is a inaccuracy for situations like this:

   10 calls
    5 returns
    3 calls
    4 returns
    3 calls
    ....

The shift count might cause this to be off by one in either direction, but
there is still a cushion vs. the RSB depth. The algorithm does not claim to
be perfect, but it should obfuscate the problem enough to make exploitation
extremly difficult.

The theory behind this is:

RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguide the
speculation path to a gadget.

Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,

The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach but failed.

There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about it.

The SAR/SHL usage was suggested by Andi Kleen.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.890071690@infradead.org
2022-10-17 16:41:15 +02:00
Thomas Gleixner cb855971d7 x86/putuser: Provide room for padding
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111146.746429822@infradead.org
2022-10-17 16:41:10 +02:00
Peter Zijlstra 1934dc9a8a x86/error_inject: Align function properly
Ensure inline asm functions are consistently aligned with compiler
generated and SYM_FUNC_START*() functions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.930201368@infradead.org
2022-10-17 16:40:59 +02:00