Commit graph

1609 commits

Author SHA1 Message Date
Sherry Sun
dfc6014e3b EDAC/synopsys: Do not print an error with back-to-back snprintf() calls
handle_error() currently calls snprintf() a couple of times in
succession to output the message for a CE/UE, therefore overwriting each
part of the message which was formatted with the previous snprintf()
call. As a result, only the part of the message from the last snprintf()
call will be printed.

The simplest and most effective way to fix this problem is to combine
the whole string into one which to supply to a single snprintf() call.

 [ bp: Massage. ]

Fixes: b500b4a029 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Cc: Manish Narani <manish.narani@xilinx.com>
Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com
2020-02-27 16:44:25 +01:00
Robert Richter
4d59588c09 EDAC/sysfs: Remove csrow objects on errors
All created csrow objects must be removed in the error path of
edac_create_csrow_objects(). The objects have been added as devices.

They need to be removed by doing a device_del() *and* put_device() call
to also free their memory. The missing put_device() leaves a memory
leak. Use device_unregister() instead of device_del() which properly
unregisters the device doing both.

Fixes: 7adc05d2dc ("EDAC/sysfs: Drop device references properly")
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200212120340.4764-4-rrichter@marvell.com
2020-02-13 13:29:41 +01:00
Robert Richter
216aa145aa EDAC/mc: Fix use-after-free and memleaks during device removal
A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and
DEBUG_KMEMLEAK set, revealed several issues when removing an mci device:

1) Use-after-free:

On 27.11.19 17:07:33, John Garry wrote:
> [   22.104498] BUG: KASAN: use-after-free in
> edac_remove_sysfs_mci_device+0x148/0x180

The use-after-free is caused by the mci_for_each_dimm() macro called in
edac_remove_sysfs_mci_device(). The iterator was introduced with

  c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator").

The iterator loop calls device_unregister(&dimm->dev), which removes
the sysfs entry of the device, but also frees the dimm struct in
dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(),
the dimm struct is accessed again, after having been freed already.

The fix is to free all the mci device's subsequent dimm and csrow
objects at a later point, in _edac_mc_free(), when the mci device itself
is being freed.

This keeps the data structures intact and the mci device can be
fully used until its removal. The change allows the safe usage of
mci_for_each_dimm() to release dimm devices from sysfs.

2) Memory leaks:

Following memory leaks have been detected:

 # grep edac /sys/kernel/debug/kmemleak | sort | uniq -c
       1     [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0      # mci->csrows
      16     [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0      # csr->channels
      16     [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0      # csr->channels[chn]
       1     [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0      # mci->dimms
      34     [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc()

All leaks are from memory allocated by edac_mc_alloc().

Note: The test above shows that edac_mc_alloc() was called here from
ghes_edac_register(), thus both functions show up in the stack trace
but the module causing the leaks is edac_mc. The comments with the data
structures involved were made manually by analyzing the objdump.

The data structures listed above and created by edac_mc_alloc() are
not properly removed during device removal, which is done in
edac_mc_free().

There are two paths implemented to remove the device depending on device
registration, _edac_mc_free() is called if the device is not registered
and edac_unregister_sysfs() otherwise.

The implemenations differ. For the sysfs case, the mci device removal
lacks the removal of subsequent data structures (csrows, channels,
dimms). This causes the memory leaks (see mci_attr_release()).

 [ bp: Massage commit message. ]

Fixes: c498afaf7d ("EDAC: Introduce an mci_for_each_dimm() iterator")
Fixes: faa2ad09c0 ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.")
Fixes: 7a623c0390 ("edac: rewrite the sysfs code to use struct device")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200212120340.4764-3-rrichter@marvell.com
2020-02-13 13:28:52 +01:00
Linus Torvalds
6a1000bd27 ioremap changes for 5.6
- remove ioremap_nocache given that is is equivalent to
    ioremap everywhere
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl4vKHwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMPGBAAuVNUZaZfWYHpiVP2oRcUQUguFiD3NTbknsyzV2oH
 J9P0GfeENSKwE9OOhZ7XIjnCZAJwQgTK/ppQY5yiQ/KAtYyyXjXEJ6jqqjiTDInr
 +3+I3t/LhkgrK7tMrb7ylTGa/d7KhaciljnOXC8+b75iddvM9I1z2pbHDbppZMS9
 wT4RXL/cFtRb85AfOyPLybcka3f5P2gGvQz38qyimhJYEzHDXZu9VO1Bd20f8+Xf
 eLBKX0o6yWMhcaPLma8tm0M0zaXHEfLHUKLSOkiOk+eHTWBZ3b/w5nsOQZYZ7uQp
 25yaClbameAn7k5dHajduLGEJv//ZjLRWcN3HJWJ5vzO111aHhswpE7JgTZJSVWI
 ggCVkytD3ESXapvswmACSeCIDMmiJMzvn6JvwuSMVB7a6e5mcqTuGo/FN+DrBF/R
 IP+/gY/T7zIIOaljhQVkiEIIwiD/akYo0V9fheHTBnqcKEDTHV4WjKbeF6aCwcO+
 b8inHyXZSKSMG//UlDuN84/KH/o1l62oKaB1uDIYrrL8JVyjAxctWt3GOt5KgSFq
 wVz1lMw4kIvWtC/Sy2H4oB+RtODLp6yJDqmvmPkeJwKDUcd/1JKf0KsZ8j3FpGei
 /rEkBEss0KBKyFAgBSRO2jIpdj2epgcBcsdB/r5mlhcn8L77AS6mHbA173kY4pQ/
 Kdg=
 =TUCJ
 -----END PGP SIGNATURE-----

Merge tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap

Pull ioremap updates from Christoph Hellwig:
 "Remove the ioremap_nocache API (plus wrappers) that are always
  identical to ioremap"

* tag 'ioremap-5.6' of git://git.infradead.org/users/hch/ioremap:
  remove ioremap_nocache and devm_ioremap_nocache
  MIPS: define ioremap_nocache to ioremap
2020-01-27 13:03:00 -08:00
Linus Torvalds
30f5a75640 Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:

 - Misc fixes to the MCE code all over the place, by Jan H. Schönherr.

 - Initial support for AMD F19h and other cleanups to amd64_edac, by
   Yazen Ghannam.

 - Other small cleanups.

* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/mce_amd: Make fam_ops static global
  EDAC/amd64: Drop some family checks for newer systems
  EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
  x86/amd_nb: Add Family 19h PCI IDs
  EDAC/mce_amd: Always load on SMCA systems
  x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
  x86/mce: Fix use of uninitialized MCE message string
  x86/mce: Fix mce=nobootlog
  x86/mce: Take action on UCNA/Deferred errors again
  x86/mce: Remove mce_inject_log() in favor of mce_log()
  x86/mce: Pass MCE message to mce_panic() on failed kernel recovery
  x86/mce/therm_throt: Mark throttle_active_work() as __maybe_unused
2020-01-27 09:19:35 -08:00
Linus Torvalds
b62061b82a A garden variety of small fixes all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl4uvKUACgkQEsHwGGHe
 VUoG5RAAoSdq9LvBPWTtJUMckxZ124hAqvtDOqvAbCnde3V2yICZLfOuGZSuWBy0
 +ED2Ebwnf6mxzB2vEEI1OQ0+LmmRXvc+UbI2z7SnC/4sLZnS7h9wDrxN41nR0dyB
 cMx0JB0ClyMW71WJo4Jbp9B+fiB4rBu/gmrXc0kC3L6ShmYvWzGdrPIV31qXPPYn
 zjqOn+tBENi7J+8omVsFg/KtAK9BTEk7FLuAz2mevLJzZGwycQBk/sigDHTuXJC2
 AACPmaQAqZgOYEOWID+F1USgD7x3yfYEZUBQWokqmDJDkN6agvcScQWoYh2PGobR
 wxJhGZR+Uq3NU0BsdoKlI4OCI/Hf+WYBAhpSdlu1/Wle3Wq4wGc91wisFtnpXU6s
 2VfiUtpf9WTROUeyHvhXcKgnjTwmRxSOBmaIhSYSL/aQ34AJaMmaeS8Vj1vpamrp
 1dotRIQKmjI8yDRs4JBfsJX3KtMjJ+u1qiz7SfwqBv1dqypSamTkvjpbPhItUrAz
 Hv9QHWQAOmCLBqWw3Jx2W9gwd7Z1gYmpvo9mqyUy81eF/BDNeJTi1zRkNL2jqElU
 gu6gaOheYr7UumA9LDsZUqh3cGgGii73pp2Fc1A2QvL9HeD++UVoXOlwjo8VEs1b
 +u+DRDKNy3OwgCLMGfK0zuWfvTRqgV+OnlPnBH2/2PnYpE9tawU=
 =fR3M
 -----END PGP SIGNATURE-----

Merge tag 'edac_for_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:
 "A totally boring branch this time around: a garden variety of small
  fixes all over the place"

* tag 'edac_for_5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/amd64: Do not warn when removing instances
  EDAC/sifive: Fix return value check in ecc_register()
  EDAC/aspeed: Remove unneeded semicolon
  EDAC: remove set but not used variable 'ecc_loc'
  EDAC: skx_common: downgrade message importance on missing PCI device
  EDAC/Kconfig: Fix Kconfig indentation
2020-01-27 09:16:22 -08:00
Borislav Petkov
7e5d6cf353 EDAC/amd64: Do not warn when removing instances
On machines which do not populate all nodes with DIMMs, the driver
doesn't initialize an instance there. However, the instance removal
remove_one_instance() path will warn unconditionally, which is wrong.

Remove the WARN_ON() even if the warning is innocent because it causes a
splat in dmesg.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200117115939.5524-1-bp@alien8.de
2020-01-17 13:00:06 +01:00
Wei Yongjun
6cd18453b6 EDAC/sifive: Fix return value check in ecc_register()
In case of error, the function edac_device_alloc_ctl_info() returns a
NULL pointer, not ERR_PTR(). Replace the IS_ERR() test in the return
value check with a NULL test.

Fixes: 91abaeaaff ("EDAC/sifive: Add EDAC platform driver for SiFive SoCs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200115150303.112627-1-weiyongjun1@huawei.com
2020-01-17 01:37:51 +01:00
Borislav Petkov
86e9f9d60e EDAC/mce_amd: Make fam_ops static global
... and do not kmalloc a three-pointer struct. Which simplifies
mce_amd_init() a bit.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200116163403.GF27148@zn.tnic
2020-01-16 21:52:48 +01:00
Yazen Ghannam
dcd01394ce EDAC/amd64: Drop some family checks for newer systems
In general, "pvt->umc != NULL" is used to check if the system is Family
17h+. However, there are a few places that are using direct family
checks.

Replace the remaining family checks with a check for "pvt->umc != NULL".

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-6-Yazen.Ghannam@amd.com
2020-01-16 17:09:29 +01:00
Yazen Ghannam
2eb61c91c3 EDAC/amd64: Add family ops for Family 19h Models 00h-0Fh
Add family ops to support AMD Family 19h systems. Existing Family 17h
functions can be used. Also, add Family 19h to the list of families to
automatically load the module.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-5-Yazen.Ghannam@amd.com
2020-01-16 17:09:23 +01:00
Yazen Ghannam
9f6aef8631 EDAC/mce_amd: Always load on SMCA systems
MCA error decoding on SMCA systems is not dependent on family. Return
success early if the system supports the SMCA feature.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-3-Yazen.Ghannam@amd.com
2020-01-16 17:09:13 +01:00
Yazen Ghannam
89a76171bf x86/MCE/AMD, EDAC/mce_amd: Add new Load Store unit McaType
Add support for a new version of the Load Store unit bank type as
indicated by its McaType value, which will be present in future SMCA
systems.

Add the new (HWID, MCATYPE) tuple. Reuse the same name, since this is
logically the same to the user.

Also, add the new error descriptions to edac_mce_amd.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200110015651.14887-2-Yazen.Ghannam@amd.com
2020-01-16 17:09:02 +01:00
Yash Shah
13cf4cf030 riscv: move sifive_l2_cache.h to include/soc
The commit 9209fb5189 ("riscv: move sifive_l2_cache.c to drivers/soc")
moves the sifive L2 cache driver to driver/soc. It did not move the
header file along with the driver. Therefore this patch moves the header
file to driver/soc

Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: updated to fix the include guard]
Fixes: 9209fb5189 ("riscv: move sifive_l2_cache.c to drivers/soc")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2020-01-12 10:12:44 -08:00
Christoph Hellwig
4bdc0d676a remove ioremap_nocache and devm_ioremap_nocache
ioremap has provided non-cached semantics by default since the Linux 2.6
days, so remove the additional ioremap_nocache interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
2020-01-06 09:45:59 +01:00
Christoph Hellwig
9209fb5189 riscv: move sifive_l2_cache.c to drivers/soc
The sifive_l2_cache.c is in no way related to RISC-V architecture
memory management.  It is a little stub driver working around the fact
that the EDAC maintainers prefer their drivers to be structured in a
certain way that doesn't fit the SiFive SOCs.

Move the file to drivers/soc and add a Kconfig option for it, as well
as the whole drivers/soc boilerplate for CONFIG_SOC_SIFIVE.

Fixes: a967a289f1 ("RISC-V: sifive_l2_cache: Add L2 cache controller driver for SiFive SoCs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
[paul.walmsley@sifive.com: keep the MAINTAINERS change specific to the L2$ controller code]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-12-20 03:40:24 -08:00
Xu Wang
a651c6c644 EDAC/aspeed: Remove unneeded semicolon
Remove unneeded semicolon reported by coccinelle.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andrew Jeffery <andrew@aj.id.au>
Cc: James Morse <james.morse@arm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-aspeed@lists.ozlabs.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Stefan Schaeckeler <sschaeck@cisco.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1576648806-1114-1-git-send-email-vulab@iscas.ac.cn
2019-12-19 07:27:09 +01:00
yu kuai
2403ed2f44 EDAC: remove set but not used variable 'ecc_loc'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/edac/i5100_edac.c: In function ‘i5100_read_log’:
drivers/edac/i5100_edac.c:489:11: warning: variable ‘ecc_loc’
set but not used [-Wunused-but-set-variable]

It is never used, and so can be removed.
Signed-off-by: yu kuai <yukuai3@huawei.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20191216110121.46698-1-yukuai3@huawei.com
2019-12-16 13:54:02 -08:00
Aristeu Rozanski
854bb48018 EDAC: skx_common: downgrade message importance on missing PCI device
Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
available which leads the module to be automatically loaded in virtual machines
as well. That will fail due the missing PCI devices. In both drivers the first
function to make use of the PCI devices is skx_get_hi_lo() will simply print

	EDAC skx: Can't get tolm/tohm

for each CPU core, which is noisy. This patch makes it a debug message.

Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com
2019-12-10 14:14:43 -08:00
Krzysztof Kozlowski
a483e22791 EDAC/Kconfig: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with a command like:

$ sed -e 's/^        /\t/' -i */Kconfig

 [ bp: make it a single line. ]

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191120134206.15588-1-krzk@kernel.org
2019-12-09 19:07:40 +01:00
Thor Thayer
5781823fd0 EDAC/altera: Use the Altera System Manager driver
Simplify by using the Altera System Manager driver that abstracts the
differences between ARM32 and ARM64. Also allows the removal of the
Arria10 test function since this is handled by the System Manager
driver.

Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Meng.Li@windriver.com
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1574361048-17572-4-git-send-email-thor.thayer@linux.intel.com
2019-11-22 10:18:29 +01:00
Thor Thayer
08a260d968 EDAC/altera: Cleanup the ECC Manager
Cleanup the ECC Manager peripheral test in probe function as suggested
by James. Remove the check for Stratix10.

Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1573156890-26891-2-git-send-email-thor.thayer@linux.intel.com
2019-11-22 10:16:43 +01:00
Meng Li
56d9e7bd3f EDAC/altera: Use fast register IO for S10 IRQs
When an IRQ occurs, regmap_{read,write,...}() is invoked in atomic
context. Regmap must indicate register IO is fast so that a spinlock is
used instead of a mutex to avoid sleeping in atomic context:

  lock_acquire
  __mutex_lock
  mutex_lock_nested
  regmap_lock_mutex
  regmap_write
  a10_eccmgr_irq_unmask
  unmask_irq.part.0
  irq_enable
  __irq_startup
  irq_startup
  __setup_irq
  request_threaded_irq
  devm_request_threaded_irq
  altr_sdram_probe

Mark it so.

 [ bp: Massage. ]

Fixes: 3dab6bd526 ("EDAC, altera: Add support for Stratix10 SDRAM EDAC")
Reported-by: Meng Li <Meng.Li@windriver.com>
Signed-off-by: Meng Li <Meng.Li@windriver.com>
Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: stable <stable@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1574361048-17572-2-git-send-email-thor.thayer@linux.intel.com
2019-11-22 10:14:56 +01:00
Robert Richter
16214bd9e4 EDAC/ghes: Do not warn when incrementing refcount on 0
The following warning from the refcount framework is seen during ghes
initialization:

  EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
  ------------[ cut here ]------------
  refcount_t: increment on 0; use-after-free.
  WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked
 [...]
  Call trace:
   refcount_inc_checked
   ghes_edac_register
   ghes_probe
   ...

It warns if the refcount is incremented from zero. This warning is
reasonable as a kernel object is typically created with a refcount of
one and freed once the refcount is zero. Afterwards the object would be
"used-after-free".

For GHES, the refcount is initialized with zero, and that is why this
message is seen when initializing the first instance. However, whenever
the refcount is zero, the device will be allocated and registered. Since
the ghes_reg_mutex protects the refcount and serializes allocation and
freeing of ghes devices, a use-after-free cannot happen here.

Instead of using refcount_inc() for the first instance, use
refcount_set(). This can be used here because the refcount is zero at
this point and can not change due to its protection by the mutex.

Fixes: 23f61b9fc5 ("EDAC/ghes: Fix locking and memory barrier issues")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: John Garry <john.garry@huawei.com>
Cc: <huangming23@huawei.com>
Cc: James Morse <james.morse@arm.com>
Cc: <linuxarm@huawei.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: <tanxiaofei@huawei.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <wanghuiqiang@huawei.com>
Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com
2019-11-22 09:53:08 +01:00
Robert Richter
787d899914 EDAC: Unify the mc_event tracepoint call
The code in ghes_edac.c and edac_mc.c for grain_bits calculation and
calling trace_mc_event() is now the same. Move it to a single location
in edac_raw_mc_handle_error().

The only difference is the missing IS_ENABLED(CONFIG_RAS) switch, but
this is needed for ghes too.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-13-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
501eb40d2b EDAC/ghes: Remove intermediate buffer pvt->detail_location
detail_location[] is used to collect two location strings so they can
be passed as one to trace_mc_event(). Instead of having an extra copy
step, assemble the location string in other_detail[] from the
beginning.

Using other_detail[] to call trace_mc_event() is now the same as in
edac_mc.c and code can be unified.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-12-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
7088e29e04 EDAC/ghes: Fix grain calculation
The current code to convert a physical address mask to a grain
(defined as granularity in bytes) is:

	e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);

This is broken in several ways:

1) It calculates to wrong grain values. E.g., a physical address mask
of ~0xfff should give a grain of 0x1000. Without considering
PAGE_MASK, there is an off-by-one. Things are worse when also
filtering it with ~PAGE_MASK. This will calculate to a grain with the
upper bits set. In the example it even calculates to ~0.

2) The grain does not depend on and is unrelated to the kernel's
page-size. The page-size only matters when unmapping memory in
memory_failure(). Smaller grains are wrongly rounded up to the
page-size, on architectures with a configurable page-size (e.g. arm64)
this could round up to the even bigger page-size of the hypervisor.

Fix this with:

	e->grain = ~mem_err->physical_addr_mask + 1;

The grain_bits are defined as:

	grain = 1 << grain_bits;

Change also the grain_bits calculation accordingly, it is the same
formula as in edac_mc.c now and the code can be unified.

The value in ->physical_addr_mask coming from firmware is assumed to
be contiguous, but this is not sanity-checked. However, in case the
mask is non-contiguous, a conversion to grain_bits effectively
converts the grain bit mask to a power of 2 by rounding it up.

Suggested-by: James Morse <james.morse@arm.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-11-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
7c10493170 EDAC/ghes: Use standard kernel macros for page calculations
Use standard macros for page calculations.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-10-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
0d8292e003 EDAC/mc: Reduce indentation level in edac_mc_handle_error()
Reduce the indentation level in edac_mc_handle_error() a bit.

No functional changes.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-7-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
47bec6b4c3 EDAC/mc: Remove needless zero string termination
The e string to which this is pointing to has already been cleared
earlier in the function so remove the needless zero string termination.

 [ bp: Correct the commit message. ]

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-6-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
d260e8ff51 EDAC/mc: Do not BUG_ON() in edac_mc_alloc()
No need to crash the system in case edac_mc_alloc() is called with
invalid arguments, just warn and return. This would cause a checkpatch
warning when touching the code later, so just fix it.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-5-rrichter@marvell.com
2019-11-10 12:40:14 +01:00
Robert Richter
c498afaf7d EDAC: Introduce an mci_for_each_dimm() iterator
Introduce an mci_for_each_dimm() iterator. It returns a pointer to
a struct dimm_info. This makes the declaration and use of an index
obsolete and avoids access to internal data of struct mci (direct array
access etc).

 [ bp: push the struct dimm_info *dimm; declaration into the
   CONFIG_EDAC_DEBUG block. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-4-rrichter@marvell.com
2019-11-10 12:39:40 +01:00
Robert Richter
977b1ce7c1 EDAC: Remove EDAC_DIMM_OFF() macro
The EDAC_DIMM_OFF() macro takes 5 arguments to get the DIMM's index.
Simplify this by storing the index in struct dimm_info to avoid its
calculation and remove the EDAC_DIMM_OFF() macro. The index can be
directly used then.

Another advantage is that edac_mc_alloc() could be used even if the
exact size of the layers is unknown. Only the number of DIMMs would be
needed.

Rename iterator variable to idx, while at it. The name is more handy,
esp. when searching for it in the code.

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-3-rrichter@marvell.com
2019-11-09 11:23:49 +01:00
Robert Richter
bc9ad9e40d EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() function
The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info.
Clean up this interface to only pass the mci struct and replace this
macro with a new function edac_get_dimm().

Also introduce an edac_get_dimm_by_index() function for later use.
This allows it to get a DIMM pointer only by a given index. This can
be useful if the DIMM's position within the layers of the memory
controller or the exact size of the layers are unknown.

Small style changes made for some hunks after applying the semantic
patch.

Semantic patch used:

@@ expression mci, a, b,c; @@

-EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c)
+edac_get_dimm(mci, a, b, c)

 [ bp: Touchups. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tero Kristo <t-kristo@ti.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com
2019-11-09 10:32:32 +01:00
Borislav Petkov
7fdfee926b EDAC/amd64: Get rid of the ECC disabled long message
This message keeps flooding dmesg on boxes where ECC is disabled or the
DIMMs do not support ECC but the module gets auto-probed. What's even
worse is that autoprobing happens on every CPU due to the CPU-family
matching the driver does and uevent being generated for each CPU device.

What is more, this message is becoming even more useless on newer
systems where forcing ECC is not recommended and it should be done in
the BIOS so the BIOS can do all the necessary work, i.e., just setting a
bit in an MSR is not enough anymore.

So get rid of it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20191106160607.GC28380@zn.tnic
2019-11-09 10:06:36 +01:00
Robert Richter
23f61b9fc5 EDAC/ghes: Fix locking and memory barrier issues
The ghes registration and refcount is broken in several ways:

 * ghes_edac_register() returns with success for a 2nd instance
   even if a first instance's registration is still running. This is
   not correct as the first instance may fail later. A subsequent
   registration may not finish before the first. Parallel registrations
   must be avoided.

 * The refcount was increased even if a registration failed. This
   leads to stale counters preventing the device from being released.

 * The ghes refcount may not be decremented properly on unregistration.
   Always decrement the refcount once ghes_edac_unregister() is called to
   keep the refcount sane.

 * The ghes_pvt pointer is handed to the irq handler before registration
   finished.

 * The mci structure could be freed while the irq handler is running.

Fix this by adding a mutex to ghes_edac_register(). This mutex
serializes instances to register and unregister. The refcount is only
increased if the registration succeeded. This makes sure the refcount is
in a consistent state after registering or unregistering a device.

Note: A spinlock cannot be used here as the code section may sleep.

The ghes_pvt is protected by ghes_lock now. This ensures the pointer is
not updated before registration was finished or while the irq handler is
running. It is unset before unregistering the device including necessary
(implicit) memory barriers making the changes visible to other CPUs.
Thus, the device can not be used anymore by an interrupt.

Also, rename ghes_init to ghes_refcount for better readability and
switch to refcount API.

A refcount is needed because there can be multiple GHES structures being
defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error
Source, "Some platforms may describe multiple Generic Hardware Error
Source structures with different notification types, ...").

Another approach to use the mci's device refcount (get_device()) and
have a release function does not work here. A release function will be
called only for device_release() with the last put_device() call. The
device must be deleted *before* that with device_del(). This is only
possible by maintaining an own refcount.

 [ bp: touchups. ]

Fixes: 0fe5f281f7 ("EDAC, ghes: Model a single, logical memory controller")
Fixes: 1e72e673b9 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")
Co-developed-by: James Morse <james.morse@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com
2019-11-08 16:28:28 +01:00
Yazen Ghannam
582f94b590 EDAC/amd64: Check for memory before fully initializing an instance
Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Return an error code when ECC is disabled. This check happens after
checking for memory. The module should explicitly fail to load if memory
is populated on a node and ECC is disabled.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-6-Yazen.Ghannam@amd.com
2019-11-06 11:10:11 +01:00
Yazen Ghannam
1c9b08bac5 EDAC/amd64: Use cached data when checking for ECC
...now that the data is available earlier.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-5-Yazen.Ghannam@amd.com
2019-11-06 11:07:57 +01:00
Yazen Ghannam
5e4c55276a EDAC/amd64: Save max number of controllers to family type
The maximum number of memory controllers is fixed within a family/model
group. In most cases, this has been fixed at 2, but some systems may
have up to 8.

The struct amd64_family_type already contains family/model-specific
information, and this can be used rather than adding model checks to
various functions.

Create a new field in struct amd64_family_type for max_mcs.
Set this when setting other family type information, and use this when
needing the maximum number of memory controllers possible for a system.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-4-Yazen.Ghannam@amd.com
2019-11-06 11:07:01 +01:00
Yazen Ghannam
80355a3b2d EDAC/amd64: Gather hardware information early
Split out gathering hardware information from init_one_instance()
into a separate function hw_info_get(). This is necessary so that
the information can be cached earlier and used to check if memory is
populated and if ECC is enabled on a node.

Also, define a function hw_info_put() to back out changes made in
hw_info_get().

Check for an allocated PCI device (Function 0 for Family 17h or Function
1 for pre-Family 17h) before freeing, since hw_info_put() may be called
before PCI siblings are reserved.

Drop the family check when freeing pvt->umc. This will be NULL on
pre-Family 17h systems. However, kfree() is safe and will check for a
NULL pointer before freeing.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-3-Yazen.Ghannam@amd.com
2019-11-06 11:04:49 +01:00
Yazen Ghannam
38ddd4d157 EDAC/amd64: Make struct amd64_family_type global
The struct amd64_family_type doesn't change between multiple nodes and
instances of the module, so make it global.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106012448.243970-2-Yazen.Ghannam@amd.com
2019-11-06 10:58:12 +01:00
Yazen Ghannam
466503d6b1 EDAC/amd64: Set grain per DIMM
The following commit introduced a warning on error reports without a
non-zero grain value.

  3724ace582 ("EDAC/mc: Fix grain_bits calculation")

The amd64_edac_mod module does not provide a value, so the warning will
be given on the first reported memory error.

Set the grain per DIMM to cacheline size (64 bytes). This is the current
recommendation.

Fixes: 3724ace582 ("EDAC/mc: Fix grain_bits calculation")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191022203448.13962-7-Yazen.Ghannam@amd.com
2019-10-25 15:36:36 +02:00
Markus Elfring
5bbab3cf21 EDAC/aspeed: Use devm_platform_ioremap_resource() in aspeed_probe()
Simplify this function implementation by using a known wrapper function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joel Stanley <joel@jms.id.au>
Cc: Andrew Jeffery <andrew@aj.id.au>
Cc: James Morse <james.morse@arm.com>
Cc: kernel-janitors@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-aspeed@lists.ozlabs.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Stefan Schaeckeler <sschaeck@cisco.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/baabb9e9-a1b2-3a04-9fb6-aa632de5f722@web.de
2019-10-24 11:17:29 +02:00
Tony Luck
e80634a75a EDAC, skx: Retrieve and print retry_rd_err_log registers
Skylake logs some additional useful information in per-channel
registers in addition the the architectural status/addr/misc
logged in the machine check bank.

Pick up this information and add it to the EDAC log:

	retry_rd_err_[five 32-bit register values]

Sorry, no definitions for these registers. OEMs and DIMM vendors
will be able to use them to isolate which cells in the DIMM are
causing problems.

	correrrcnt[per rank corrected error counts]

Note that if additional errors are logged while these registers are
being read, you may see a jumble of values some from earlier errors,
others from later errors (since the registers report the most recent
logged error). The correrrcnt registers provide error counts per possible
rank. If these counts only change by one since the previous error logged
for this channel, then it is safe to assume that the registers logged
provide a coherent view of one error.

With this change EDAC logs look like this:

EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 -  err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])

Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-10-18 15:27:58 -07:00
Tony Luck
29b8e84fbc EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode.
Simplifies the code a little.

Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-10-18 15:27:48 -07:00
Borislav Petkov
3a5e7ec903 Merge branch 'edac-urgent' into edac-for-next
Pick up urgent change into next queue.

Signed-off-by: Borislav Petkov <bp@suse.de>
2019-10-17 13:47:12 +02:00
James Morse
1e72e673b9 EDAC/ghes: Fix Use after free in ghes_edac remove path
ghes_edac models a single logical memory controller, and uses a global
ghes_init variable to ensure only the first ghes_edac_register() will
do anything.

ghes_edac is registered the first time a GHES entry in the HEST is
probed. There may be multiple entries, so subsequent attempts to
register ghes_edac are silently ignored as the work has already been
done.

When a GHES entry is unregistered, it calls ghes_edac_unregister(),
which free()s the memory behind the global variables in ghes_edac.

But there may be multiple GHES entries, the next call to
ghes_edac_unregister() will dereference the free()d memory, and attempt
to free it a second time.

This may also be triggered on a platform with one GHES entry, if the
driver is unbound/re-bound and unbound. The re-bind step will do
nothing because of ghes_init, the second unbind will then do the same
work as the first.

Doing the unregister work on the first call is unsafe, as another
CPU may be processing a notification in ghes_edac_report_mem_error(),
using the memory we are about to free.

ghes_init is already half of the reference counting. We only need
to do the register work for the first call, and the unregister work
for the last. Add the unregister check.

This means we no longer free ghes_edac's memory while there are
GHES entries that may receive a notification.

This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE.

 [ bp: merge into a single patch. ]

Fixes: 0fe5f281f7 ("EDAC, ghes: Model a single, logical memory controller")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com
Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com
2019-10-17 11:27:05 +02:00
Hanna Hawa
9816b4af43 EDAC/device: Rework error logging API
Make the main workhorse the "count" functions which can log a @count
of errors. Have the current APIs edac_device_handle_{ce,ue}() call
the _count() variants and this way keep the exported symbols number
unchanged.

 [ bp: Rewrite. ]

Signed-off-by: Hanna Hawa <hhhawa@amazon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: benh@amazon.com
Cc: dwmw@amazon.co.uk
Cc: hanochu@amazon.com
Cc: James Morse <james.morse@arm.com>
Cc: jonnyc@amazon.com
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: ronenk@amazon.com
Cc: talel@amazon.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190923191741.29322-2-hhhawa@amazon.com
2019-10-09 13:01:42 +02:00
Mauro Carvalho Chehab
f05390d30e EDAC: skx_common: get rid of unused type var
drivers/edac/skx_common.c: In function ‘skx_mce_output_error’:
	drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
	  478 |  char *type, *optype;
	      |        ^~~~

Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30 15:41:54 -03:00
Mauro Carvalho Chehab
323014d85d EDAC: sb_edac: get rid of unused vars
There are several vars unused on this driver, probably because
it was a modified copy of another driver. Get rid of them.

	drivers/edac/sb_edac.c: In function ‘knl_get_dimm_capacity’:
	drivers/edac/sb_edac.c:1343:16: warning: variable ‘sad_size’ set but not used [-Wunused-but-set-variable]
	 1343 |  u64 sad_base, sad_size, sad_limit = 0;
	      |                ^~~~~~~~
	drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’:
	drivers/edac/sb_edac.c:2955:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
	 2955 |  char *type, *optype, msg[256];
	      |        ^~~~
	drivers/edac/sb_edac.c: In function ‘sbridge_unregister_mci’:
	drivers/edac/sb_edac.c:3203:22: warning: variable ‘pvt’ set but not used [-Wunused-but-set-variable]
	 3203 |  struct sbridge_pvt *pvt;
	      |                      ^~~
	At top level:
	drivers/edac/sb_edac.c:266:18: warning: ‘correrrthrsld’ defined but not used [-Wunused-const-variable=]
	  266 | static const u32 correrrthrsld[] = {
	      |                  ^~~~~~~~~~~~~
	drivers/edac/sb_edac.c:257:18: warning: ‘correrrcnt’ defined but not used [-Wunused-const-variable=]
	  257 | static const u32 correrrcnt[] = {
	      |                  ^~~~~~~~~~

Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30 15:41:54 -03:00