linux-stable/Documentation
Dave Chiluk 3dec71e388 sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
commit de53fd7aed upstream.

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74 ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 19:18:29 +01:00
..
ABI qmi_wwan: extend permitted QMAP mux_id value range 2019-07-21 09:04:26 +02:00
accounting
acpi This is the bulk of GPIO changes for the v4.13 series: 2017-07-07 12:40:27 -07:00
admin-guide x86/xen: Return from panic notifier 2019-11-06 12:43:13 +01:00
aoe
arm ARM: 8833/1: Ensure that NEON code always compiles with Clang 2019-04-05 22:31:34 +02:00
arm64 arm64: Expose Arm v8.4 features 2019-10-29 09:17:10 +01:00
auxdisplay
backlight
blackfin
block doc, block, bfq: better describe how to properly configure bfq 2017-08-31 13:55:26 -06:00
blockdev SCSI misc on 20170907 2017-09-07 21:11:05 -07:00
bus-devices
cdrom
cgroup-v1 mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
cma
connector
console
core-api doc: Fix RCU's docbook options 2017-10-19 22:26:11 -04:00
cpu-freq cpufreq: docs: Drop intel-pstate.txt from index.txt 2017-09-28 02:08:43 +02:00
cpuidle
cris
crypto KEYS: Add documentation for asymmetric keyring restrictions 2017-07-14 11:01:38 +10:00
dev-tools kmemcheck: rip it out 2018-02-22 15:42:24 +01:00
device-mapper dm thin: fix documentation relative to low water mark threshold 2018-04-26 11:02:07 +02:00
devicetree usb: dwc3: Allow disabling of metastability workaround 2019-11-12 19:18:18 +01:00
dmaengine Merge branch 'topic/dmatest' into for-linus 2017-09-06 21:55:10 +05:30
doc-guide sphinx.rst: Allow Sphinx version 1.6 at the docs 2017-08-26 15:50:27 -06:00
driver-api USB: core: Fix bug caused by duplicate interface PM usage counter 2019-05-08 07:20:46 +02:00
driver-model driver core: remove DRIVER_ATTR 2017-09-19 09:20:33 +02:00
early-userspace
EDID
extcon
fault-injection fault-inject: add /proc/<pid>/fail-nth 2017-07-14 15:05:13 -07:00
fb fbcon: remove restriction on margin color 2017-09-04 16:00:49 +02:00
features docs/features: parisc implements tracehook 2017-08-07 14:18:40 -06:00
filesystems mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps 2019-01-26 09:37:07 +01:00
firmware_class
fmc
fpga
frv
gpio Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2017-09-07 13:39:21 -07:00
gpu Merge tag 'drm-intel-next-2017-08-18' of git://anongit.freedesktop.org/git/drm-intel into drm-next 2017-08-22 10:03:07 +10:00
hid
hwmon hwmon: (ina2xx) fix sysfs shunt resistor read access 2018-10-03 17:00:58 -07:00
i2c i2c: i801: Add support for Intel Cedar Fork 2017-10-05 14:44:56 +02:00
ia64
ide
iio iio: adc: New driver for Cirrus Logic EP93xx ADC 2017-07-25 19:56:23 +01:00
infiniband Documentation: Hardware tag matching 2017-08-29 08:30:21 -04:00
input Documentation:input: fix typo 2017-08-30 15:18:24 -06:00
ioctl scsi: cxlflash: Introduce host ioctl support 2017-06-26 15:01:11 -04:00
isdn
kbuild kbuild: delete INSTALL_FW_PATH from kbuild documentation 2018-07-17 11:39:30 +02:00
kdump kexec/kdump: minor Documentation updates for arm64 and Image 2017-07-12 16:26:00 -07:00
kernel-hacking There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
laptops platform/x86: thinkpad_acpi: Fix warning about deprecated hwmon_device_register 2017-08-18 15:57:24 -07:00
leds
lightnvm
livepatch
locking Merge branch 'linus' into locking/core, to fix up conflicts 2017-09-04 11:01:18 +02:00
m68k
md
media media: media colorspaces*.rst: rename AdobeRGB to opRGB 2018-11-13 11:15:12 -08:00
memory-devices
metag
mic
mips
misc-devices
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: add tcp_min_snd_mss sysctl 2019-06-17 19:52:44 +02:00
nfc
nios2
nvdimm
nvmem NVMEM documentation fix: A minor typo 2017-08-24 13:31:58 -06:00
parisc
PCI
pcmcia
perf
phy
platform
power Merge branch 'pm-sleep' 2017-09-04 00:06:02 +02:00
powerpc powerpc updates for 4.13 2017-07-07 13:55:45 -07:00
pps drivers/pps: aesthetic tweaks to PPS-related content 2017-09-08 18:26:51 -07:00
process kbuild: verify that $DEPMOD is installed 2018-08-17 21:01:10 +02:00
pti
ptp
rapidio
RCU doc: Set down RCU's scheduling-clock-interrupt needs 2017-08-17 07:31:14 -07:00
s390
scheduler sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-11-12 19:18:29 +01:00
scsi
security docs: ReSTify table of contents in core.rst 2017-08-30 15:27:58 -06:00
serial
sh
sound sound updates for 4.13-rc1 2017-07-06 10:56:51 -07:00
sparc
sphinx doc: Cope with Sphinx logging deprecations 2019-06-09 09:18:17 +02:00
sphinx-static docs RTD theme: code-block with line nos - lines and line numbers don't line up. 2017-07-17 13:48:45 -06:00
spi
sysctl bpf: add bpf_jit_limit knob to restrict unpriv allocations 2019-08-25 10:50:03 +02:00
target
thermal
timers rcu: Eliminate NOCBs CPU-state Kconfig options 2017-06-08 18:52:43 -07:00
trace stm class: Document the stm_ftrace 2017-08-25 17:58:34 +03:00
translations kokr/memory-barriers.txt: Apply atomic_t.txt change 2017-09-08 10:10:53 -06:00
usb USB: rio500: Remove Rio 500 kernel driver 2019-10-17 13:43:20 -07:00
userspace-api Documentation: Add section about CPU vulnerabilities for Spectre 2019-07-21 09:04:31 +02:00
virtual KVM: Reject device ioctls from processes other than the VM's creator 2019-04-03 06:25:20 +02:00
vm hmm: heterogeneous memory management documentation 2017-09-08 18:26:45 -07:00
w1
watchdog watchdog: Revert "iTCO_wdt: all versions count down twice" 2017-09-09 17:41:24 +02:00
wimax
x86 x86/speculation/mds: Improve CPU buffer clear documentation 2019-05-21 18:50:13 +02:00
xtensa of: update ePAPR references to point to Devicetree Specification 2017-06-22 11:22:06 -05:00
.gitignore
00-INDEX linux-kselftest-4.13-rc1-update 2017-07-07 14:04:47 -07:00
atomic_bitops.txt Documentation/locking/atomic: Add documents for new atomic_t APIs 2017-08-10 12:29:00 +02:00
atomic_t.txt x86/atomic: Fix smp_mb__{before,after}_atomic() 2019-07-31 07:28:26 +02:00
bcache.txt bcache.txt: standardize document format 2017-07-14 13:51:27 -06:00
bt8xxgpio.txt bt8xxgpio.txt: standardize document format 2017-07-14 13:51:27 -06:00
btmrvl.txt btmrvl.txt: standardize document format 2017-07-14 13:51:27 -06:00
bus-virt-phys-mapping.txt bus-virt-phys-mapping.txt: standardize document format 2017-07-14 13:51:28 -06:00
cachetlb.txt cachetlb.txt: standardize document format 2017-07-14 13:51:28 -06:00
cgroup-v2.txt cgroup: add cgroup.stat interface with basic hierarchy stats 2017-08-02 12:05:20 -07:00
Changes
circular-buffers.txt circular-buffers.txt: standardize document format 2017-07-14 13:51:29 -06:00
clk.txt clk.txt: standardize document format 2017-07-14 13:51:29 -06:00
CodingStyle
conf.py docs: Fix conf.py for Sphinx 2.0 2019-06-09 09:18:17 +02:00
cpu-load.txt cpu-load: standardize document format 2017-07-14 13:51:30 -06:00
cputopology.txt cputopology.txt: standardize document format 2017-07-14 13:51:30 -06:00
crc32.txt crc32.txt: standardize document format 2017-07-14 13:51:30 -06:00
dcdbas.txt dcdbas.txt: standardize document format 2017-07-14 13:51:31 -06:00
debugging-modules.txt
debugging-via-ohci1394.txt debugging-via-ohci1394.txt: standardize document format 2017-07-14 13:51:34 -06:00
dell_rbu.txt dell_rbu.txt: standardize document format 2017-07-14 13:58:12 -06:00
digsig.txt digsig.txt: standardize document format 2017-07-14 13:51:31 -06:00
DMA-API-HOWTO.txt DMA-API-HOWTO.txt: standardize document format 2017-07-14 13:51:32 -06:00
DMA-API.txt dma-coherent: remove the DMA_MEMORY_MAP and DMA_MEMORY_IO flags 2017-09-01 11:59:17 +02:00
DMA-attributes.txt DMA-attributes.txt: standardize document format 2017-07-14 13:51:33 -06:00
DMA-ISA-LPC.txt DMA-ISA-LPC.txt: standardize document format 2017-07-14 13:51:33 -06:00
docutils.conf
dontdiff Remove gperf usage from toolchain 2017-08-19 11:02:53 -07:00
efi-stub.txt efi-stub.txt: standardize document format 2017-07-14 13:51:34 -06:00
eisa.txt eisa.txt: standardize document format 2017-07-14 13:51:34 -06:00
errseq.rst Documentation: add some docs for errseq_t 2017-07-29 09:01:02 -04:00
flexible-arrays.txt flexible-arrays.txt: standardize document format 2017-07-14 13:51:35 -06:00
futex-requeue-pi.txt futex-requeue-pi.txt: standardize document format 2017-07-14 13:51:35 -06:00
gcc-plugins.txt gcc-plugins.txt: standardize document format 2017-07-14 13:51:36 -06:00
highuid.txt highuid.txt: standardize document format 2017-07-14 13:51:36 -06:00
hw_random.txt hw_random.txt: standardize document format 2017-07-14 13:51:37 -06:00
hwspinlock.txt hwspinlock.txt: standardize document format 2017-07-14 13:51:37 -06:00
index.rst x86/speculation/mds: Add mds_clear_cpu_buffers() 2019-05-14 19:18:43 +02:00
Intel-IOMMU.txt Intel-IOMMU.txt: standardize document format 2017-07-14 13:51:38 -06:00
intel_txt.txt intel_txt.txt: standardize document format 2017-07-14 13:51:38 -06:00
io-mapping.txt io-mapping.txt: standardize document format 2017-07-14 13:51:38 -06:00
io_ordering.txt io_ordering.txt: standardize document format 2017-07-14 13:51:39 -06:00
iostats.txt iostats.txt: update it to cover recent Kernels 2017-07-14 13:51:40 -06:00
IPMI.txt IPMI.txt: standardize document format 2017-07-14 13:51:40 -06:00
IRQ-affinity.txt IRQ-affinity.txt: standardize document format 2017-07-14 13:51:41 -06:00
IRQ-domain.txt IRQ-domain.txt: standardize document format 2017-07-14 13:51:41 -06:00
IRQ.txt IRQ.txt: add a markup for its title 2017-07-14 13:51:42 -06:00
irqflags-tracing.txt irqflags-tracing.txt: standardize document format 2017-07-14 13:51:42 -06:00
isa.txt isa.txt: standardize document format 2017-07-14 13:51:43 -06:00
isapnp.txt isapnp.txt: promote title level 2017-07-14 13:51:43 -06:00
kernel-doc-nano-HOWTO.txt
kernel-per-CPU-kthreads.txt kernel-per-CPU-kthreads.txt: standardize document format 2017-07-14 13:51:43 -06:00
kobject.txt kobject.txt: standardize document format 2017-07-14 13:51:44 -06:00
kprobes.txt docs: kprobes.txt: Fix whitespacing 2017-07-14 13:58:14 -06:00
kref.txt kref.txt: standardize document format 2017-07-14 13:51:45 -06:00
ldm.txt ldm.txt: standardize document format 2017-07-14 13:51:45 -06:00
lockup-watchdogs.txt lockup-watchdogs.txt: standardize document format 2017-07-14 13:51:46 -06:00
logo.gif
logo.txt
lsm.txt
lzo.txt lzo.txt: standardize document format 2017-07-14 13:51:46 -06:00
mailbox.txt mailbox.txt: standardize document format 2017-07-14 13:51:47 -06:00
Makefile doc: Makefile: if sphinx is not found, run a check script 2017-08-24 13:18:30 -06:00
memory-barriers.txt Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-09-04 11:52:29 -07:00
memory-hotplug.txt memory-hotplug.txt: standardize document format 2017-07-14 13:57:53 -06:00
men-chameleon-bus.txt men-chameleon-bus.txt: standardize document format 2017-07-14 13:57:54 -06:00
nommu-mmap.txt nommu-mmap.txt: don't use all upper case on titles 2017-07-14 13:57:55 -06:00
ntb.txt This series converts a number of top-level documents to the RST format 2017-07-15 12:58:58 -07:00
numastat.txt numastat.txt: standardize document format 2017-07-14 13:57:56 -06:00
padata.txt padata.txt: standardize document format 2017-07-14 13:57:56 -06:00
parport-lowlevel.txt parport-lowlevel.txt: standardize document format 2017-07-14 13:57:57 -06:00
percpu-rw-semaphore.txt percpu-rw-semaphore.txt: standardize document format 2017-07-14 13:57:58 -06:00
phy.txt phy.txt: standardize document format 2017-07-14 13:57:58 -06:00
pi-futex.txt pi-futex.txt: standardize document format 2017-07-14 13:57:59 -06:00
pnp.txt pnp.txt: standardize document format 2017-07-14 13:57:59 -06:00
preempt-locking.txt preempt-locking.txt: standardize document format 2017-07-14 13:58:00 -06:00
printk-formats.txt lib/vsprintf: Remove atomic-unsafe support for %pCr 2018-07-03 11:24:48 +02:00
pwm.txt pwm: Standardize document format 2017-07-06 08:23:30 +02:00
rbtree.txt rbtree: cache leftmost node internally 2017-09-08 18:26:48 -07:00
remoteproc.txt remoteproc.txt: standardize document format 2017-07-14 13:58:02 -06:00
rfkill.txt rfkill.txt: standardize document format 2017-07-14 13:58:02 -06:00
robust-futex-ABI.txt robust-futex-ABI.txt: standardize document format 2017-07-14 13:58:03 -06:00
robust-futexes.txt futex: Update comments and docs about return values of arch futex code 2019-07-03 13:16:03 +02:00
rpmsg.txt rpmsg.txt: standardize document format 2017-07-14 13:58:04 -06:00
rtc.txt rtc: add generic nvmem support 2017-07-07 13:14:14 +02:00
SAK.txt SAK.txt: standardize document format 2017-07-14 13:58:04 -06:00
sgi-ioc4.txt sgi-ioc4.txt: standardize document format 2017-07-14 13:58:05 -06:00
siphash.txt siphash.txt: standardize document format 2017-07-14 13:58:06 -06:00
SM501.txt SM501.txt: standardize document format 2017-07-14 13:58:06 -06:00
smsc_ece1099.txt smsc_ece1099.txt: standardize document format 2017-07-14 13:58:07 -06:00
speculation.txt Documentation: Document array_index_nospec 2018-02-07 11:12:22 -08:00
static-keys.txt jump_label: Provide hotplug context variants 2017-08-10 12:28:59 +02:00
SubmittingPatches
svga.txt svga.txt: standardize document format 2017-07-14 13:58:08 -06:00
switchtec.txt
sync_file.txt
tee.txt tee.txt: standardize document format 2017-07-14 13:58:14 -06:00
this_cpu_ops.txt this_cpu_ops.txt: standardize document format 2017-07-14 13:58:08 -06:00
unaligned-memory-access.txt unaligned-memory-access.txt: standardize document format 2017-07-14 13:58:09 -06:00
vfio-mediated-device.txt vfio/mdev: Check globally for duplicate devices 2018-08-03 07:50:22 +02:00
vfio.txt vfio.txt: standardize document format 2017-07-14 13:58:10 -06:00
video-output.txt
xillybus.txt xillybus.txt: standardize document format 2017-07-14 13:58:11 -06:00
xz.txt xz.txt: standardize document format 2017-07-14 13:58:11 -06:00
zorro.txt zorro.txt: standardize document format 2017-07-14 13:58:12 -06:00