linux-stable/drivers
Jiaqi Yan 44b8f8bf24 mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.

Background
==========

In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators.  The statistics include
2 categories:

* Memory error statistics, for example, how many memory error are
  encountered, how many of them are recovered by the kernel.  Note these
  memory errors are non-fatal to kernel: during the machine check
  exception (MCE) handling kernel already classified MCE's severity to be
  unnecessary to panic (but either action required or optional).

* Scanner statistics, for example how many times the scanner have fully
  scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========

Memory error stats are important to userspace but insufficient in kernel
today.  Datacenter administrators can better monitor a machine's memory
health with the visible stats.  For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer.  Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action. 
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
  capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector.  Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============

As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner.  Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector.  It exposes the following per NUMA node memory error
counters:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log.  The following math holds for the statistics:

* total = recovered + ignored + failed + delayed

These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries.  The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery.  The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6


This patch (of 3):

Today kernel provides following memory error info to userspace, but each
has its own disadvantage

* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though

* ras:memory_failure_event: only available after explicitly enabled

* /dev/mcelog provides many useful info about the MCEs, but
  doesn't capture how memory_failure recovered memory MCEs

* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively.  The following math
holds for the statistics:

* total = recovered + ignored + failed + delayed

Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:28 -08:00
..
accel Fix mismerge due to devnode now taking a 'const *' device 2022-12-16 13:04:15 -06:00
accessibility
acpi Merge branches 'acpi-resource' and 'acpi-video' 2023-01-13 11:11:05 +01:00
amba
android mm: remove zap_page_range and create zap_vma_pages 2023-01-18 17:12:55 -08:00
ata ata: pata_cs5535: Don't build on UML 2023-01-14 07:38:48 +09:00
atm treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
auxdisplay
base mm: memory-failure: add memory failure stats to sysfs 2023-02-02 22:33:28 -08:00
bcma
block zram: correctly handle all next_arg() cases 2023-01-18 17:12:56 -08:00
bluetooth treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
bus Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
cdrom
char mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() 2023-01-18 17:12:56 -08:00
clk
clocksource
comedi
connector
counter
cpufreq cpufreq: amd-pstate: fix kernel hang issue while amd-pstate unregistering 2023-01-10 20:31:08 +01:00
cpuidle powerpc updates for 6.2 2022-12-19 07:13:33 -06:00
crypto MTD changes: 2023-01-12 05:56:06 -06:00
cxl
dax
dca
devfreq
dio
dma dmaengine updates for v6.2 2022-12-19 08:54:17 -06:00
dma-buf Merge drm/drm-fixes into drm-misc-fixes 2023-01-03 08:32:12 +01:00
edac EDAC/highbank: Fix memory leak in highbank_mc_probe() 2023-01-03 17:03:57 +01:00
eisa
extcon Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
firewire
firmware kernel hardening fixes for v6.2-rc4 2023-01-14 10:04:00 -06:00
fpga Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
fsi
gnss
gpio gpio: sifive: Fix refcount leak in sifive_gpio_probe 2023-01-02 13:01:14 +01:00
gpu drm fixes for 6.2-rc4 2023-01-13 07:18:59 -06:00
greybus
hid treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
hsi
hte
hv
hwmon
hwspinlock
hwtracing
i2c Core got a new helper 'i2c_client_get_device_id', designware got some 2022-12-15 14:47:10 -08:00
i3c
idle
iio Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
infiniband RDMA/mlx5: Fix validation of max_rd_atomic caps for DC 2023-01-01 09:40:35 +02:00
input xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
interconnect
iommu mm: discard __GFP_ATOMIC 2023-02-02 22:33:13 -08:00
ipack
irqchip RISC-V Patches for the 6.2 Merge Window, Part 1 2022-12-14 15:23:49 -08:00
isdn treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
leds treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
macintosh
mailbox - qcom: enable sc8280xp, sm8550 and sm4250 support 2022-12-21 09:31:18 -08:00
mcb
md block: handle bio_split_to_limits() NULL return 2023-01-04 09:05:23 -07:00
media treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
memory
memstick
message
mfd - New Drivers 2022-12-21 09:19:24 -08:00
misc drivers/misc/open-dice: don't touch VM_MAYSHARE 2023-01-18 17:12:57 -08:00
mmc
most
mtd mtd: cfi: allow building spi-intel standalone 2023-01-02 12:08:53 +01:00
mux
net Including fixes from rxrpc. 2023-01-12 18:20:44 -06:00
nfc nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame() 2023-01-09 07:34:13 +00:00
ntb
nubus
nvdimm
nvme block-6.2-2023-01-13 2023-01-13 17:41:19 -06:00
nvmem Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
of Revert "mm: kmemleak: alloc gray object for reserved region with direct map" 2023-01-31 16:44:09 -08:00
opp
parisc parisc: led: Fix potential null-ptr-deref in start_task() 2022-12-17 23:19:38 +01:00
parport
pci pci-v6.2-fixes-1 2023-01-13 17:32:22 -06:00
pcmcia treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
peci
perf RISC-V Patches for the 6.2 Merge Window, Part 1 2022-12-14 15:23:49 -08:00
phy phy-for-6.2 2022-12-19 08:40:58 -06:00
pinctrl
platform platform/x86: thinkpad_acpi: Fix profile mode display in AMT mode 2023-01-13 11:40:30 +01:00
pnp
power power supply and reset changes for the v6.2 series 2022-12-17 08:39:31 -06:00
powercap
pps
ps3
ptp
pwm pwm: Changes for v6.2-rc1 2022-12-21 09:41:28 -08:00
rapidio
ras
regulator regulator: qcom-rpmh: PM8550 ldo11 regulator is an nldo 2023-01-03 15:54:38 +00:00
remoteproc
reset
rpmsg
rtc - New Drivers 2022-12-21 09:19:24 -08:00
s390 block-2023-01-06 2023-01-06 13:12:42 -08:00
sbus
scsi SCSI fixes on 20230114 2023-01-14 07:57:25 -06:00
sh
siox
slimbus
soc ARM: SoC fixes for 6.2 2022-12-19 16:07:59 -06:00
soundwire soundwire updates for 6.2 2022-12-19 08:47:33 -06:00
spi spi: Merge rename of spi-cs-setup-ns DT property 2023-01-11 14:15:22 +00:00
spmi
ssb
staging treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
target
tc
tee
thermal thermal: int340x: Add missing attribute for data rate base 2022-12-30 19:48:37 +01:00
thunderbolt
tty xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
ufs Merge branch '6.2/scsi-queue' into 6.2/scsi-fixes 2022-12-30 16:29:34 +00:00
uio
usb xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
vdpa vdpa_sim_net: should not drop the multicast/broadcast packet 2022-12-28 05:28:11 -05:00
vfio Driver Core changes for 6.2-rc1 2022-12-16 03:54:54 -08:00
vhost vhost_vdpa: fix the crash in unmap a large memory 2022-12-28 05:28:11 -05:00
video xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
virt Char/Misc driver changes for 6.2-rc1 2022-12-16 03:49:24 -08:00
virtio virtio: Implementing attribute show with sysfs_emit 2022-12-28 05:28:11 -05:00
vlynq
w1
watchdog linux-watchdog 6.2-rc1 tag 2022-12-17 08:34:01 -06:00
xen xen: branch for v6.2-rc4 2023-01-12 17:02:20 -06:00
zorro
Kconfig
Makefile