linux-stable/arch/x86
Ingo Molnar be6cb02779 x86: Align jump targets to 1-byte boundaries
The following NOP in a hot function caught my attention:

  >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

That's a dead NOP that bloats the function a bit, added for the
default 16-byte alignment that GCC applies for jump targets.

I realize that x86 CPU manufacturers recommend 16-byte jump
target alignments (it's in the Intel optimization manual),
to help their relatively narrow decoder prefetch alignment
and uop cache constraints, but the cost of that is very
significant:

        text           data       bss         dec      filename
    12566391        1617840   1089536    15273767      vmlinux.align.16-byte
    12224951        1617840   1089536    14932327      vmlinux.align.1-byte

By using 1-byte jump target alignment (i.e. no alignment at all)
we get an almost 3% reduction in kernel size (!) - and a
probably similar reduction in I$ footprint.

Now, the usual justification for jump target alignment is the
following:

 - modern decoders tend to have 16-byte (effective) decoder
   prefetch windows. (AMD documents it higher but measurements
   suggest the effective prefetch window on curretn uarchs is
   still around 16 bytes)

 - on Intel there's also the uop-cache with cachelines that have
   16-byte granularity and limited associativity.

 - older x86 uarchs had a penalty for decoder fetches that crossed
   16-byte boundaries. These limits are mostly gone from recent
   uarchs.

So if a forward jump target is aligned to cacheline boundary then
prefetches will start from a new prefetch-cacheline and there's
higher chance for decoding in fewer steps and packing tightly.

But I think that argument is flawed for typical optimized kernel
code flows: forward jumps often go to 'cold' (uncommon) pieces
of code, and  aligning cold code to cache lines does not bring a
lot of advantages  (they are uncommon), while it causes
collateral damage:

 - their alignment 'spreads out' the cache footprint, it shifts
   followup hot code further out

 - plus it slows down even 'cold' code that immediately follows 'hot'
   code (like in the above case), which could have benefited from the
   partial cacheline that comes off the end of hot code.

But even in the cache-hot case the 16 byte alignment brings
disadvantages:

 - it spreads out the cache footprint, possibly making the code
   fall out of the L1 I$.

 - On Intel CPUs, recent microarchitectures have plenty of
   uop cache (typically doubling every 3 years) - while the
   size of the L1 cache grows much less aggressively. So
   workloads are rarely uop cache limited.

The only situation where alignment might matter are tight
loops that could fit into a single 16 byte chunk - but those
are pretty rare in the kernel: if they exist they tend
to be pointer chasing or generic memory ops, which both tend
to be cache miss (or cache allocation) intensive and are not
decoder bandwidth limited.

So the balance of arguments strongly favors packing kernel
instructions tightly versus maximizing for decoder bandwidth:
this patch changes the jump target alignment from 16 bytes
to 1 byte (tightly packed, unaligned).

Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-15 11:04:28 +02:00
..
boot * Avoid garbage names in efivarfs due to buggy firmware by zero'ing 2015-05-06 08:30:24 +02:00
configs x86/build/defconfig: Enable USB_EHCI_TT_NEWSCHED=y 2015-02-19 02:21:14 +01:00
crypto crypto: x86/sha512_ssse3 - fixup for asm function prototype change 2015-04-24 20:09:01 +08:00
ia32 x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code 2015-05-08 13:50:02 +02:00
include x86/asm/uaccess: Unify the ALIGN_DESTINATION macro 2015-05-14 07:25:34 +02:00
kernel x86/alternatives: Switch AMD F15h and later to the P6 NOPs 2015-05-11 10:26:05 +02:00
kvm kvm: x86: fix kvmclock update protocol 2015-04-27 15:48:59 +02:00
lguest x86/asm/entry: Remove SYSCALL_VECTOR 2015-05-10 12:34:28 +02:00
lib x86/asm/uaccess: Get rid of copy_user_nocache_64.S 2015-05-14 07:25:35 +02:00
math-emu
mm Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-05-06 10:57:37 -07:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-10 15:48:20 -05:00
oprofile x86/asm/entry: Change all 'user_mode_vm()' calls to 'user_mode()' 2015-03-23 11:14:17 +01:00
pci x86/PCI/ACPI: Make all resources except [io 0xcf8-0xcff] available on PCI bus 2015-04-30 22:17:34 +02:00
platform TTY/Serial patches for 4.1-rc1 2015-04-21 09:33:10 -07:00
power x86/asm, x86/power/hibernate: Use local labels in asm 2015-04-15 11:37:51 +02:00
purgatory Merge branches 'x86-build-for-linus', 'x86-cleanups-for-linus' and 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 12:35:46 -08:00
realmode Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-16 14:58:12 -08:00
syscalls xen: features and fixes for 4.1-rc0 2015-04-16 14:01:03 -05:00
tools x86, build: replace Perl script with Shell script 2015-01-26 13:37:18 -08:00
um Merge branch 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc 2015-04-15 13:53:55 -07:00
vdso x86: pvclock: Really remove the sched notifier for cross-cpu migrations 2015-04-27 15:49:30 +02:00
video
xen x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code 2015-05-08 13:50:02 +02:00
.gitignore x86/build: Add arch/x86/purgatory/ make generated files to gitignore 2014-10-09 09:29:46 +02:00
Kbuild
Kconfig Initial ACPI support for arm64: 2015-04-24 08:23:45 -07:00
Kconfig.cpu
Kconfig.debug x86, intel-mid: remove Intel MID specific serial support 2015-03-07 03:25:18 +01:00
Makefile x86: Align jump targets to 1-byte boundaries 2015-05-15 11:04:28 +02:00
Makefile.um kbuild: use relative path more to include Makefile 2015-04-02 16:42:08 +02:00
Makefile_32.cpu