No description
Find a file
Jiri Wiesner 8868106251 clocksource: Skip watchdog check for large watchdog intervals
commit 6446495535 upstream.

There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:

  clocksource: timekeeping watchdog on CPU373:
  Marking clocksource 'tsc' as unstable because the skew is too large:
  clocksource:   'hpet' wd_nsec: 14523447520
  clocksource:   'tsc'  cs_nsec: 14524115132

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:

  cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds.  The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.

The intention of 2e27e793e2 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135 ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.

The root causes of the skew being directly proportinal to the length of
the readout interval are:

  * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
  * the conversion to nanoseconds is imprecise for large readout intervals

Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second).  To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.

As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.

Fixes: 2e27e793e2 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23 08:42:22 +01:00
arch um: net: Fix return type of uml_net_start_xmit() 2024-02-23 08:42:13 +01:00
block blk-iocost: Fix an UBSAN shift-out-of-bounds warning 2024-02-23 08:42:21 +01:00
certs certs/blacklist_hashes.c: fix const confusion in certs blacklist 2022-06-22 14:13:17 +02:00
crypto crypto: api - Disallow identical driver names 2024-02-23 08:41:52 +01:00
Documentation net: sysfs: Fix /sys/class/net/<iface> path 2024-02-23 08:42:17 +01:00
drivers vhost: use kzalloc() instead of kmalloc() followed by memset() 2024-02-23 08:42:22 +01:00
fs ceph: fix deadlock or deadcode of misusing dget() 2024-02-23 08:42:15 +01:00
include hrtimer: Report offline hrtimer enqueue 2024-02-23 08:42:22 +01:00
init rootfs: Fix support for rootfstype= when root= is given 2024-01-25 14:37:52 -08:00
io_uring io_uring/rw: ensure io->bytes_done is always initialized 2024-01-25 14:37:52 -08:00
ipc ipc/sem: Fix dangling sem_array access in semtimedop race 2022-12-08 11:24:00 +01:00
kernel clocksource: Skip watchdog check for large watchdog intervals 2024-02-23 08:42:22 +01:00
lib debugobjects: Stop accessing objects after releasing hash bucket lock 2024-02-23 08:42:02 +01:00
LICENSES
mm mm/sparsemem: fix race in accessing memory_section->usage 2024-02-23 08:42:00 +01:00
net net/af_iucv: clean up a try_then_request_module() 2024-02-23 08:42:21 +01:00
samples samples/hw_breakpoint: fix building without module unloading 2023-09-23 11:01:09 +02:00
scripts stddef: Introduce DECLARE_FLEX_ARRAY() helper 2024-02-23 08:41:54 +01:00
security lsm: new security_file_ioctl_compat() hook 2024-02-23 08:41:53 +01:00
sound ALSA: hda: intel-dspcfg: add filters for ARL-S and ARL 2024-02-23 08:42:12 +01:00
tools selftests: net: avoid just another constant wait 2024-02-23 08:42:19 +01:00
usr usr/include/Makefile: add linux/nfc.h to the compile-test coverage 2022-02-01 17:25:48 +01:00
virt KVM: use __vcalloc for very large allocations 2024-02-23 08:41:55 +01:00
.clang-format
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore kbuild: generate Module.symvers only when vmlinux exists 2021-05-19 10:12:59 +02:00
.mailmap
COPYING
CREDITS
Kbuild
Kconfig
MAINTAINERS Remove DECnet support from kernel 2023-06-21 15:45:38 +02:00
Makefile Linux 5.10.209 2024-01-25 14:37:57 -08:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.