linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-15 23:25:07 +00:00

History

Ingo Molnar be6cb02779 x86: Align jump targets to 1-byte boundaries The following NOP in a hot function caught my attention: > 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) That's a dead NOP that bloats the function a bit, added for the default 16-byte alignment that GCC applies for jump targets. I realize that x86 CPU manufacturers recommend 16-byte jump target alignments (it's in the Intel optimization manual), to help their relatively narrow decoder prefetch alignment and uop cache constraints, but the cost of that is very significant: text data bss dec filename 12566391 1617840 1089536 15273767 vmlinux.align.16-byte 12224951 1617840 1089536 14932327 vmlinux.align.1-byte By using 1-byte jump target alignment (i.e. no alignment at all) we get an almost 3% reduction in kernel size (!) - and a probably similar reduction in I$ footprint. Now, the usual justification for jump target alignment is the following: - modern decoders tend to have 16-byte (effective) decoder prefetch windows. (AMD documents it higher but measurements suggest the effective prefetch window on curretn uarchs is still around 16 bytes) - on Intel there's also the uop-cache with cachelines that have 16-byte granularity and limited associativity. - older x86 uarchs had a penalty for decoder fetches that crossed 16-byte boundaries. These limits are mostly gone from recent uarchs. So if a forward jump target is aligned to cacheline boundary then prefetches will start from a new prefetch-cacheline and there's higher chance for decoding in fewer steps and packing tightly. But I think that argument is flawed for typical optimized kernel code flows: forward jumps often go to 'cold' (uncommon) pieces of code, and aligning cold code to cache lines does not bring a lot of advantages (they are uncommon), while it causes collateral damage: - their alignment 'spreads out' the cache footprint, it shifts followup hot code further out - plus it slows down even 'cold' code that immediately follows 'hot' code (like in the above case), which could have benefited from the partial cacheline that comes off the end of hot code. But even in the cache-hot case the 16 byte alignment brings disadvantages: - it spreads out the cache footprint, possibly making the code fall out of the L1 I$. - On Intel CPUs, recent microarchitectures have plenty of uop cache (typically doubling every 3 years) - while the size of the L1 cache grows much less aggressively. So workloads are rarely uop cache limited. The only situation where alignment might matter are tight loops that could fit into a single 16 byte chunk - but those are pretty rare in the kernel: if they exist they tend to be pointer chasing or generic memory ops, which both tend to be cache miss (or cache allocation) intensive and are not decoder bandwidth limited. So the balance of arguments strongly favors packing kernel instructions tightly versus maximizing for decoder bandwidth: this patch changes the jump target alignment from 16 bytes to 1 byte (tightly packed, unaligned). Acked-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-05-15 11:04:28 +02:00
..
boot	* Avoid garbage names in efivarfs due to buggy firmware by zero'ing	2015-05-06 08:30:24 +02:00
configs	x86/build/defconfig: Enable USB_EHCI_TT_NEWSCHED=y	2015-02-19 02:21:14 +01:00
crypto	crypto: x86/sha512_ssse3 - fixup for asm function prototype change	2015-04-24 20:09:01 +08:00
ia32	x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code	2015-05-08 13:50:02 +02:00
include	x86/asm/uaccess: Unify the ALIGN_DESTINATION macro	2015-05-14 07:25:34 +02:00
kernel	x86/alternatives: Switch AMD F15h and later to the P6 NOPs	2015-05-11 10:26:05 +02:00
kvm	kvm: x86: fix kvmclock update protocol	2015-04-27 15:48:59 +02:00
lguest	x86/asm/entry: Remove SYSCALL_VECTOR	2015-05-10 12:34:28 +02:00
lib	x86/asm/uaccess: Get rid of copy_user_nocache_64.S	2015-05-14 07:25:35 +02:00
math-emu
mm	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-05-06 10:57:37 -07:00
net	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-12-10 15:48:20 -05:00
oprofile	x86/asm/entry: Change all 'user_mode_vm()' calls to 'user_mode()'	2015-03-23 11:14:17 +01:00
pci	x86/PCI/ACPI: Make all resources except [io 0xcf8-0xcff] available on PCI bus	2015-04-30 22:17:34 +02:00
platform	TTY/Serial patches for 4.1-rc1	2015-04-21 09:33:10 -07:00
power	x86/asm, x86/power/hibernate: Use local labels in asm	2015-04-15 11:37:51 +02:00
purgatory	Merge branches 'x86-build-for-linus', 'x86-cleanups-for-linus' and 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2014-12-10 12:35:46 -08:00
realmode	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-16 14:58:12 -08:00
syscalls	xen: features and fixes for 4.1-rc0	2015-04-16 14:01:03 -05:00
tools	x86, build: replace Perl script with Shell script	2015-01-26 13:37:18 -08:00
um	Merge branch 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc	2015-04-15 13:53:55 -07:00
vdso	x86: pvclock: Really remove the sched notifier for cross-cpu migrations	2015-04-27 15:49:30 +02:00
video
xen	x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code	2015-05-08 13:50:02 +02:00
.gitignore	x86/build: Add arch/x86/purgatory/ make generated files to gitignore	2014-10-09 09:29:46 +02:00
Kbuild
Kconfig	Initial ACPI support for arm64:	2015-04-24 08:23:45 -07:00
Kconfig.cpu
Kconfig.debug	x86, intel-mid: remove Intel MID specific serial support	2015-03-07 03:25:18 +01:00
Makefile	x86: Align jump targets to 1-byte boundaries	2015-05-15 11:04:28 +02:00
Makefile.um	kbuild: use relative path more to include Makefile	2015-04-02 16:42:08 +02:00
Makefile_32.cpu