Commit Graph

46 Commits

Author SHA1 Message Date
Stanislaw Gruszka 002bf2fbc0 PCI/AER: Block runtime suspend when handling errors
PM runtime can be done simultaneously with AER error handling.  Avoid that
by using pm_runtime_get_sync() before and pm_runtime_put() after reset in
pcie_do_recovery() for all recovering devices.

pm_runtime_get_sync() will increase dev->power.usage_count counter to
prevent any possible future request to runtime suspend a device.  It will
also resume a device, if it was previously in D3hot state.

I tested with igc device by doing simultaneous aer_inject and rpm
suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control and can
reproduce:

  igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
  pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
  pcieport 0000:00:1c.2: AER: subordinate device reset failed
  pcieport 0000:00:1c.2: AER: device recovery failed
  igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

The problem disappears when this patch is applied.

Link: https://lore.kernel.org/r/20240212120135.146068-1-stanislaw.gruszka@linux.intel.com
Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Cc: <stable@vger.kernel.org>
2024-03-07 17:58:07 -06:00
Christoph Hellwig 5e69a33c5c PCI/ERR: Recognize disconnected devices in report_error_detected()
When a device is already unplugged by pciehp by the time the AER handler is
invoked, the PCIe device will already be in the pci_channel_io_perm_failure
state.  In that case simply return PCI_ERS_RESULT_DISCONNECT instead of
trying to do a state transition that will fail.

Also untangle the state transition failure from the lack of methods to
improve the debugging output in case it happens again.

Link: https://lore.kernel.org/r/20220601074024.3481035-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
2022-06-08 15:08:40 -05:00
Bjorn Helgaas e0217c5ba1 Revert "PCI: Use to_pci_driver() instead of pci_dev->driver"
This reverts commit 2a4d9408c9.

Robert reported a NULL pointer dereference caused by the PCI core
(local_pci_probe()) calling the i2c_designware_pci driver's
.runtime_resume() method before the .probe() method.  i2c_dw_pci_resume()
depends on initialization done by i2c_dw_pci_probe().

Prior to 2a4d9408c9 ("PCI: Use to_pci_driver() instead of
pci_dev->driver"), pci_pm_runtime_resume() avoided calling the
.runtime_resume() method because pci_dev->driver had not been set yet.

2a4d9408c9 and b5f9c644eb ("PCI: Remove struct pci_dev->driver"),
removed pci_dev->driver, replacing it by device->driver, which *has* been
set by this time, so pci_pm_runtime_resume() called the .runtime_resume()
method when it previously had not.

Fixes: 2a4d9408c9 ("PCI: Use to_pci_driver() instead of pci_dev->driver")
Link: https://lore.kernel.org/linux-i2c/CAP145pgdrdiMAT7=-iB1DMgA7t_bMqTcJL4N0=6u8kNY3EU0dw@mail.gmail.com/
Reported-by: Robert Święcki <robert@swiecki.net>
Tested-by: Robert Święcki <robert@swiecki.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-11-11 13:36:22 -06:00
Uwe Kleine-König 2a4d9408c9 PCI: Use to_pci_driver() instead of pci_dev->driver
Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.

Replace pci_dev->driver with to_pci_driver().  This is a step toward
removing pci_dev->driver.

[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-10-18 09:20:15 -05:00
Bjorn Helgaas 171d149ce8 PCI/ERR: Factor out common dev->driver expressions
Save the struct pci_driver pointer from pdev->driver instead of repeating
it several times.  No functional change.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-10-12 17:37:15 -05:00
Keith Busch 387c72cdd7 PCI/ERR: Retain status from error notification
Overwriting the frozen detected status with the result of the link reset
loses the NEED_RESET result that drivers are depending on for error
handling to report the .slot_reset() callback. Retain this status so
that subsequent error handling has the correct flow.

Link: https://lore.kernel.org/r/20210104230300.1277180-4-kbusch@kernel.org
Reported-by: Hinko Kocevar <hinko.kocevar@ess.eu>
Tested-by: Hedi Berriche <hedi.berriche@hpe.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
Acked-by: Hedi Berriche <hedi.berriche@hpe.com>
2021-02-23 17:10:42 -06:00
Keith Busch 7d7cbeaba5 PCI/ERR: Clear status of the reporting device
Error handling operates on the first Downstream Port above the detected
error, but the error may have been reported by a downstream device.
Clear the AER status of the device that reported the error rather than
the first Downstream Port.

Link: https://lore.kernel.org/r/20210104230300.1277180-2-kbusch@kernel.org
Tested-by: Hedi Berriche <hedi.berriche@hpe.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Sean V Kelley <sean.v.kelley@intel.com>
Acked-by: Hedi Berriche <hedi.berriche@hpe.com>
2021-02-23 17:10:10 -06:00
Qiuxu Zhuo 5790862255 PCI/ERR: Recover from RCiEP AER errors
Add support for handling AER errors detected by Root Complex Integrated
Endpoints (RCiEPs).  These errors are signaled to software natively via a
Root Complex Event Collector (RCEC) or non-natively via ACPI APEI if the
platform retains control of AER or uses a non-standard RCEC-like device.

When recovering from RCiEP errors, the Root Error Command and Status
registers are in the AER Capability of an associated RCEC (if any), not in
a Root Port.  In the non-native case, the platform is responsible for those
registers and we can't touch them.

[bhelgaas: commit log, etc]
Co-developed-by: Sean V Kelley <sean.v.kelley@intel.com>
Link: https://lore.kernel.org/r/20201121001036.8560-13-sean.v.kelley@intel.com
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-12-05 15:26:02 -06:00
Sean V Kelley a175102b0a PCI/ERR: Recover from RCEC AER errors
A Root Complex Event Collector (RCEC) collects and signals AER errors that
were detected by Root Complex Integrated Endpoints (RCiEPs), but it may
also signal errors it detects itself.  This is analogous to errors detected
and signaled by a Root Port.

Update the AER service driver to claim RCECs in addition to Root Ports.
Add support for handling RCEC-detected AER errors.  This does not
include handling RCiEP-detected errors that are signaled by the RCEC.

Note that we expect these errors only from the native AER and APEI paths,
not from DPC or EDR.

[bhelgaas: split from combined RCEC/RCiEP patch, commit log]
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-12-05 15:25:58 -06:00
Sean V Kelley aa344bc8b7 PCI/ERR: Clear AER status only when we control AER
In some cases a bridge may not exist as the hardware controlling may be
handled only by firmware and so is not visible to the OS. This scenario is
also possible in future use cases involving non-native use of RCECs by
firmware. In this scenario, we expect the platform to retain control of the
bridge and to clear error status itself.

Clear error status only when the OS has native control of AER.

Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 05e9ae19ab PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery()
Consolidate subordinate bus checks with pci_walk_bus() into
pci_walk_bridge() for walking below potentially AER affected bridges.

Link: https://lore.kernel.org/r/20201121001036.8560-10-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 3d7d8fc78f PCI/ERR: Avoid negated conditional for clarity
Reverse the sense of the Root Port/Downstream Port conditional for clarity.
No functional change intended.

Link: https://lore.kernel.org/r/20201121001036.8560-9-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 0791721d80 PCI/ERR: Use "bridge" for clarity in pcie_do_recovery()
pcie_do_recovery() may be called with "dev" being either a bridge (Root
Port or Switch Downstream Port) or an Endpoint.  The bulk of the function
deals with the bridge, so if we start with an Endpoint, we reset "dev" to
be the bridge leading to it.

For clarity, replace "dev" in the body of the function with "bridge".  No
functional change intended.

Link: https://lore.kernel.org/r/20201121001036.8560-8-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 480ef7cb9f PCI/ERR: Simplify by computing pci_pcie_type() once
Instead of calling pci_pcie_type(dev) twice, call it once and save the
result.  No functional change intended.

Link: https://lore.kernel.org/r/20201121001036.8560-7-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 5d69dcc9f8 PCI/ERR: Simplify by using pci_upstream_bridge()
Use pci_upstream_bridge() in place of dev->bus->self.  No functional change
intended.

Link: https://lore.kernel.org/r/20201121001036.8560-6-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2020-12-04 11:18:58 -06:00
Sean V Kelley 8f1bbfbc35 PCI/ERR: Rename reset_link() to reset_subordinates()
reset_link() appears to be misnamed.  The point is to reset any devices
below a given bridge, so rename it to reset_subordinates() to make it clear
that we are passing a bridge with the intent to reset the devices below it.

Link: https://lore.kernel.org/r/20201121001036.8560-5-sean.v.kelley@intel.com
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC
Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2020-12-04 11:18:58 -06:00
Jonathan Cameron 068c29a248 PCI/ERR: Clear PCIe Device Status errors only if OS owns AER
pcie_clear_device_status() resets the error bits in the PCIe Device Status
Register (PCI_EXP_DEVSTA).

Previously we did this unconditionally, but on ACPI systems, the _OSC AER
bit negotiates control of the AER capability.  Per sec 4.5.1 of the System
Firmware Intermediary _OSC and DPC Updates ECN [1], this bit also covers
other error enable/status bits including the following:

  Correctable Error Reporting Enable
  Non-Fatal Error Reporting Enable
  Fatal Error Reporting Enable
  Unsupported Request Reporting Enable

These bits are all in the PCIe Device Control register (the ECN omitted
"Reporting", but I think that's a typo), so by implication the _OSC AER bit
also applies to the error status bits in the PCIe Device Status register:

  Correctable Error Detected
  Non-Fatal Error Detected
  Fatal Error Detected
  Unsupported Request Detected

Clear the PCIe Device Status error bits only when the OS controls the AER
capability and related error enable/status bits.  If platform firmware
controls the AER capability, firmware is responsible for clearing these
bits.

One call path leading here is:

  ghes_do_proc
    ghes_handle_aer
      aer_recover_queue
        schedule_work(&aer_recover_work)
  ...
  aer_recover_work_func
    pcie_do_recovery
      pcie_clear_device_status

[1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24,
    2020, affecting PCI Firmware Specification, Rev. 3.2
    https://members.pcisig.com/wg/PCI-SIG/document/14076
[bhelgaas: commit log, move test from pcie_clear_device_status() to callers]
Link: https://lore.kernel.org/r/20200622113523.891666-1-Jonathan.Cameron@huawei.com
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-07-22 15:41:03 -05:00
Bjorn Helgaas 600a5b4fc8 PCI/ERR: Rename pci_aer_clear_device_status() to pcie_clear_device_status()
pci_aer_clear_device_status() clears the error bits in the PCIe Device
Status Register (PCI_EXP_DEVSTA).  Every PCIe device has this register,
regardless of whether it supports AER.

Rename pci_aer_clear_device_status() to pcie_clear_device_status() to make
clear that it is PCIe-specific but not AER-specific.  Move it to
drivers/pci/pci.c, again since it's not AER-specific.  No functional change
intended.

Link: https://lore.kernel.org/r/20200717195619.766662-1-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-07-22 15:38:35 -05:00
Luc Van Oostenryck 16d79cd4e2 PCI: Use 'pci_channel_state_t' instead of 'enum pci_channel_state'
The method struct pci_error_handlers.error_detected() is defined and
documented as taking an 'enum pci_channel_state' for the second argument,
but most drivers use 'pci_channel_state_t' instead.

This 'pci_channel_state_t' is not a typedef for the enum but a typedef for
a bitwise type in order to have better/stricter typechecking.

Consolidate everything by using 'pci_channel_state_t' in the method's
definition, in the related helpers and in the drivers.

Enforce use of 'pci_channel_state_t' by replacing 'enum pci_channel_state'
with an anonymous 'enum'.

Note: Currently, from a typechecking point of view this patch changes
nothing because only the constants defined by the enum are bitwise, not the
enum itself (sparse doesn't have the notion of 'bitwise enum'). This may
change in some not too far future, hence the patch.

[bhelgaas: squash in
  https://lore.kernel.org/r/20200702162651.49526-3-luc.vanoostenryck@gmail.com
  https://lore.kernel.org/r/20200702162651.49526-4-luc.vanoostenryck@gmail.com]
Link: https://lore.kernel.org/r/20200702162651.49526-2-luc.vanoostenryck@gmail.com
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-07-07 17:11:52 -05:00
Kuppuswamy Sathyanarayanan 894020fdd8 PCI/AER: Rationalize error status register clearing
The AER interfaces to clear error status registers were a confusing mess:

  - pci_cleanup_aer_uncorrect_error_status() cleared non-fatal errors
    from the Uncorrectable Error Status register.

  - pci_aer_clear_fatal_status() cleared fatal errors from the
    Uncorrectable Error Status register.

  - pci_cleanup_aer_error_status_regs() cleared the Root Error Status
    register (for Root Ports), the Uncorrectable Error Status register,
    and the Correctable Error Status register.

Rename them to make them consistent:

  From                                     To
  ---------------------------------------- -------------------------------
  pci_cleanup_aer_uncorrect_error_status() pci_aer_clear_nonfatal_status()
  pci_aer_clear_fatal_status()             pci_aer_clear_fatal_status()
  pci_cleanup_aer_error_status_regs()      pci_aer_clear_status()

Since pci_cleanup_aer_error_status_regs() (renamed to
pci_aer_clear_status()) is only used within drivers/pci/, move the
declaration from <linux/aer.h> to drivers/pci/pci.h.

[bhelgaas: commit log, add renames]
Link: https://lore.kernel.org/r/d1310a75dc3d28f7e8da4e99c45fbd3e60fe238e.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-03-28 13:19:05 -05:00
Kuppuswamy Sathyanarayanan e8e5ff2aee PCI/ERR: Return status of pcie_do_recovery()
As per the DPC Enhancements ECN [1], sec 4.5.1, table 4-4, if the OS
supports Error Disconnect Recover (EDR), it must invalidate the software
state associated with child devices of the port without attempting to
access the child device hardware. In addition, if the OS supports DPC, it
must attempt to recover the child devices if the port implements the DPC
Capability. If the OS continues operation, the OS must inform the firmware
of the status of the recovery operation via the _OST method.

Return the result of pcie_do_recovery() so we can report it to firmware via
_OST.

[1] Downstream Port Containment Related Enhancements ECN, Jan 28, 2019,
    affecting PCI Firmware Specification, Rev. 3.2
    https://members.pcisig.com/wg/PCI-SIG/document/12888

Link: https://lore.kernel.org/r/eb60ec89448769349c6722954ffbf2de163155b5.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-03-28 13:19:01 -05:00
Kuppuswamy Sathyanarayanan b6cf1a42f9 PCI/ERR: Remove service dependency in pcie_do_recovery()
Previously we passed the PCIe service type parameter to pcie_do_recovery(),
where reset_link() looked up the underlying pci_port_service_driver and its
.reset_link() function pointer. Instead of using this roundabout way, we
can just pass the driver-specific .reset_link() callback function when
calling pcie_do_recovery() function.

This allows us to call pcie_do_recovery() from code that is not a PCIe port
service driver, e.g., Error Disconnect Recover (EDR) support.

Remove pcie_port_find_service() and pcie_port_service_driver.reset_link
since they are now unused.

Link: https://lore.kernel.org/r/60e02b87b526cdf2930400059d98704bf0a147d1.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-03-28 13:18:54 -05:00
Kuppuswamy Sathyanarayanan 6d2c894415 PCI/ERR: Update error status after reset_link()
Commit bdb5ac8577 ("PCI/ERR: Handle fatal error recovery") uses
reset_link() to recover from fatal errors.  But during fatal error
recovery, if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using
reset_link()) pcie_do_recovery() will report the recovery result as
failure.  Update the status of error after reset_link().

You can reproduce this issue by triggering a SW DPC using "DPC Software
Trigger" bit in "DPC Control Register".  You should see recovery failed
dmesg log as below:

  pcieport 0000:00:16.0: DPC: containment event, status:0x1f27 source:0x0000
  pcieport 0000:00:16.0: DPC: software trigger detected
  pci 0000:04:00.0: AER: can't recover (no error_detected callback)
  pcieport 0000:00:16.0: AER: device recovery failed

Fixes: bdb5ac8577 ("PCI/ERR: Handle fatal error recovery")
Link: https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com
[bhelgaas: split pci_channel_io_frozen simplification to separate patch]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
2020-03-28 11:52:22 -05:00
Kuppuswamy Sathyanarayanan b5dfbeacf7 PCI/ERR: Combine pci_channel_io_frozen cases
pcie_do_recovery() had two "if (state == pci_channel_io_frozen)" cases
right after each other.  Combine them to make this easier to read.  No
functional change intended.

Link: https://lore.kernel.org/r/20200317170654.GA23125@infradead.org
[bhelgaas: split from https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-03-28 11:50:56 -05:00
Bjorn Helgaas 8d077c3ce0 PCI/AER: Factor message prefixes with dev_fmt()
Define dev_fmt() with the common prefix of log messages so we don't have to
repeat it in every printk.  No functional change intended.

Link: https://lore.kernel.org/r/20191213225709.GA213811@google.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-01-23 16:39:53 -06:00
Yicong Yang 01daacfb90 PCI/AER: Log which device prevents error recovery
PCI error recovery will fail if any device under the Root Port doesn't have
an error_detected callback.  Currently only the failure result is printed,
which is not enough to identify the driver that lacks the callback.

Log a message to identify the device with no error_detected callback.

[bhelgaas: tweak log message]
Link: https://lore.kernel.org/r/1576237474-32021-1-git-send-email-yangyicong@hisilicon.com
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-01-23 16:39:02 -06:00
Mika Westerberg ca78410403 PCI: Get rid of dev->has_secondary_link flag
In some systems, the Device/Port Type in the PCI Express Capabilities
register incorrectly identifies upstream ports as downstream ports.

d0751b98df ("PCI: Add dev->has_secondary_link to track downstream PCIe
links") addressed this by adding pci_dev.has_secondary_link, which is set
for downstream ports.  But this is confusing because pci_pcie_type()
sometimes gives the wrong answer, and it's not obvious that we should use
pci_dev.has_secondary_link instead.

Reduce the confusion by correcting the type of the port itself so that
pci_pcie_type() returns the actual type regardless of what the Device/Port
Type register claims it is.  Update the users to call pci_pcie_type() and
pcie_downstream_port() accordingly, and remove pci_dev.has_secondary_link
completely.

Link: https://lore.kernel.org/linux-pci/20190703133953.GK128603@google.com/
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20190822085553.62697-2-mika.westerberg@linux.intel.com
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2019-09-07 07:45:31 -05:00
YueHaibing 479e01a402 PCI/ERR: Remove duplicated include from err.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-10-02 16:04:40 -05:00
Keith Busch a6bd101b8f PCI: Unify device inaccessible
Bring surprise removals and permanent failures together so we no longer
need separate flags.  The implementation enforces that error handling will
not be able to override a surprise removal's permanent channel failure.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-10-02 16:04:40 -05:00
Keith Busch 7b42d97e99 PCI/ERR: Always report current recovery status for udev
A device still participates in error recovery even if it doesn't have
the error callbacks.

Always provide the status for user event watchers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-10-02 16:04:40 -05:00
Keith Busch 542aeb9c8f PCI/ERR: Simplify broadcast callouts
There is no point in having a generic broadcast function if it needs to
have special cases for each callback it broadcasts.

Abstract the error broadcast to only the necessary information and removes
the now unnecessary helper to walk the bus.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-10-02 16:04:40 -05:00
Keith Busch bfcb79fca1 PCI/ERR: Run error recovery callbacks for all affected devices
If an Endpoint reported an error with ERR_FATAL, we previously ran driver
error recovery callbacks only for the Endpoint's driver.  But if we reset a
Link to recover from the error, all downstream components are affected,
including the Endpoint, any multi-function peers, and children of those
peers.

Initiate the Link reset from the deepest Downstream Port that is
reliable, and call the error recovery callbacks for all its children.

If a Downstream Port (including a Root Port) reports an error, we assume
the Port itself is reliable and we need to reset its downstream Link.  In
all other cases (Switch Upstream Ports, Endpoints, Bridges, etc), we assume
the Link leading to the component needs to be reset, so we initiate the
reset at the parent Downstream Port.

This allows two other clean-ups.  First, we currently only use a Link
reset, which can only be initiated using a Downstream Port, so we can
remove checks for Endpoints.  Second, the Downstream Port where we initiate
the Link reset is reliable (unlike components downstream from it), so the
special cases for error detect and resume are no longer necessary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-09-26 14:23:15 -05:00
Keith Busch bdb5ac8577 PCI/ERR: Handle fatal error recovery
We don't need to be paranoid about the topology changing while handling an
error.  If the device has changed in a hotplug capable slot, we can rely on
the presence detection handling to react to a changing topology.

Restore the fatal error handling behavior that existed before merging DPC
with AER with 7e9084b367 ("PCI/AER: Handle ERR_FATAL with removal and
re-enumeration of devices").

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-09-26 14:23:14 -05:00
Keith Busch c4eed62a21 PCI/ERR: Use slot reset if available
The secondary bus reset may have link side effects that a hotplug capable
port may incorrectly react to.  Use the slot specific reset for hotplug
ports, fixing the undesirable link down-up handling during error
recovering.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[bhelgaas: fold in
https://lore.kernel.org/linux-pci/20180926152326.14821-1-keith.busch@intel.com
for issue reported by Stephen Rothwell <sfr@canb.auug.org.au>]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
2018-09-21 12:18:10 -05:00
Lukas Wunner a50ac6bfd6 PCI: Simplify disconnected marking
Commit 89ee9f7680 ("PCI: Add device disconnected state") iterates over
the devices on a parent bus, marks each as disconnected, then marks
each device's children as disconnected using pci_walk_bus().

The same can be achieved more succinctly by calling pci_walk_bus() on
the parent bus.  Moreover, this does not need to wait until acquiring
pci_lock_rescan_remove(), so move it out of that critical section.

The critical section in err.c contains a pci_dev_get() / pci_dev_put()
pair which was apparently copy-pasted from pciehp_pci.c.  In the latter
it serves the purpose of holding the struct pci_dev in place until the
Command register is updated.  err.c doesn't do anything like that, hence
the pair is unnecessary.  Remove it.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Oza Pawandeep <poza@codeaurora.org>
Cc: Sinan Kaya <okaya@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2018-09-17 16:34:35 -05:00
Bjorn Helgaas 3a48dc6fc2 Merge branch 'pci/virtualization'
- To avoid bus errors, enable PASID only if entire path supports End-End
    TLP prefixes (Sinan Kaya)

  - Unify slot and bus reset functions and remove hotplug knowledge from
    callers (Sinan Kaya)

  - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
    fix guest reboot issues (Alex Williamson)

  - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD Controller
    (Bjorn Helgaas)

* pci/virtualization:
  PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
  PCI: Delay after FLR of Intel DC P3700 NVMe
  PCI: Disable Samsung SM961/PM961 NVMe before FLR
  PCI: Export pcie_has_flr()
  PCI: Rename pci_try_reset_bus() to pci_reset_bus()
  PCI: Deprecate pci_reset_bus() and pci_reset_slot() functions
  PCI: Unify try slot and bus reset API
  PCI: Hide pci_reset_bridge_secondary_bus() from drivers
  IB/hfi1: Use pci_try_reset_bus() for initiating PCI Secondary Bus Reset
  PCI: Handle error return from pci_reset_bridge_secondary_bus()
  PCI/IOV: Tidy pci_sriov_set_totalvfs()
  PCI: Enable PASID only if entire path supports End-End TLP prefixes

# Conflicts:
#	drivers/pci/hotplug/pciehp_hpc.c
2018-08-15 14:59:06 -05:00
Bjorn Helgaas 3c3ab37f4c Merge branch 'pci/aer'
- Decode AER errors with names similar to "lspci" (Tyler Baicar)

  - Expose AER statistics in sysfs (Rajat Jain)

  - Clear AER status bits selectively based on the type of recovery (Oza
    Pawandeep)

  - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
    Gagniuc)

  - Don't clear AER status bits if we're using the "Firmware-First"
    strategy where firmware owns the registers (Alexandru Gagniuc)

* pci/aer:
  PCI/AER: Don't clear AER bits if error handling is Firmware-First
  PCI/AER: Remove duplicate PCI_EXP_AER_FLAGS definition
  PCI/portdrv: Remove pcie_portdrv_err_handler.slot_reset
  PCI/AER: Clear device status bits during ERR_COR handling
  PCI/AER: Clear device status bits during ERR_FATAL and ERR_NONFATAL
  PCI/AER: Remove ERR_FATAL code from ERR_NONFATAL path
  PCI/AER: Factor out ERR_NONFATAL status bit clearing
  PCI/AER: Clear only ERR_NONFATAL bits during non-fatal recovery
  PCI/AER: Clear only ERR_FATAL status bits during fatal recovery
  PCI/AER: Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST
  PCI/AER: Add sysfs attributes for rootport cumulative stats
  PCI/AER: Add sysfs attributes to provide AER stats and breakdown
  PCI/AER: Define aer_stats structure for AER capable devices
  PCI/AER: Move internal declarations to drivers/pci/pci.h
  PCI/AER: Adopt lspci names for AER error decoding
  PCI/AER: Expose internal API for obtaining AER information

# Conflicts:
#	drivers/pci/pci.h
2018-08-15 14:58:45 -05:00
Thomas Tai bd91b56cb3 PCI/AER: Work around use-after-free in pcie_do_fatal_recovery()
When an fatal error is received by a non-bridge device, the device is
removed, and pci_stop_and_remove_bus_device() deallocates the device
structure.  The freed device structure is used by subsequent code to send
uevents and print messages.

Hold a reference on the device until we're finished using it.  This is not
an ideal fix because pcie_do_fatal_recovery() should not use the device at
all after removing it, but that's too big a project for right now.

Fixes: 7e9084b367 ("PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices")
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
[bhelgaas: changelog, reduce get/put coverage]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-26 12:13:04 -05:00
Oza Pawandeep ec752f5d54 PCI/AER: Clear device status bits during ERR_FATAL and ERR_NONFATAL
Clear the device status bits while handling both ERR_FATAL and ERR_NONFATAL
cases.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
[bhelgaas: rename to pci_aer_clear_device_status(), declare internal to PCI
core instead of exposing it everywhere]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-20 15:27:11 -05:00
Oza Pawandeep 43ec03a9e5 PCI/AER: Remove ERR_FATAL code from ERR_NONFATAL path
broadcast_error_message() is only used for ERR_NONFATAL events, when the
state is always pci_channel_io_normal, so remove the unused alternate path.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-20 15:27:10 -05:00
Bjorn Helgaas 7ab92e89bf PCI/AER: Clear only ERR_FATAL status bits during fatal recovery
During recovery from fatal errors, we previously called
pci_cleanup_aer_uncorrect_error_status(), which cleared *all* uncorrectable
error status bits (both ERR_FATAL and ERR_NONFATAL).

Instead, call a new pci_aer_clear_fatal_status() that clears only the
ERR_FATAL bits (as indicated by the PCI_ERR_UNCOR_SEVER register).

Based-on-patch-by: Oza Pawandeep <poza@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-20 15:27:07 -05:00
Sinan Kaya 381634cad1 PCI: Hide pci_reset_bridge_secondary_bus() from drivers
Rename pci_reset_bridge_secondary_bus() to pci_bridge_secondary_bus_reset()
and move the declaration from linux/pci.h to drivers/pci.h to be used
internally in PCI directory only.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-19 18:04:23 -05:00
Sinan Kaya 1842623850 PCI: Handle error return from pci_reset_bridge_secondary_bus()
Commit 01fd61c0b9 ("PCI: Add a return type for
pci_reset_bridge_secondary_bus()") added a return value to the function to
return if a device is accessible following a reset.  Callers are not
checking the value.

Pass error code up high in the stack if device is not accessible.

Fixes: 01fd61c0b9 ("PCI: Add a return type for pci_reset_bridge_secondary_bus()")
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-19 18:04:23 -05:00
Oza Pawandeep 0b91439d35 PCI/AER: Pass service type to pcie_do_fatal_recovery()
Pass the service type to pcie_do_fatal_recovery() instead of assuming AER.
We will make DPC also use pcie_do_fatal_recovery(), and it needs to do
things a little differently for AER and DPC.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
[bhelgaas: split to separate patch]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-06-02 19:29:26 -05:00
Oza Pawandeep f252d0621a PCI/portdrv: Add generic pcie_port_find_service()
Add generic pcie_port_find_service() routine.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2018-05-17 16:49:30 -05:00
Oza Pawandeep 2e28bc84cf PCI/AER: Factor out error reporting to drivers/pci/pcie/err.c
Move the error reporting callbacks from aerdrv_core.c to err.c, where they
can be used by DPC in addition to AER.

As part of aerdrv_core.c, these callbacks were built under CONFIG_PCIEAER.
Moving them to the new err.c means they will now be built under
CONFIG_PCIEPORTBUS, so adjust the definition of pci_uevent_ers() to match.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
[bhelgaas: in reset_link(), initialize "driver" even if CONFIG_PCIEAER is
unset, update pci_uevent_ers() #ifdef wrapper]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-05-17 16:48:23 -05:00