PCI/ERR: Handle fatal error recovery

We don't need to be paranoid about the topology changing while handling an
error.  If the device has changed in a hotplug capable slot, we can rely on
the presence detection handling to react to a changing topology.

Restore the fatal error handling behavior that existed before merging DPC
with AER with 7e9084b367 ("PCI/AER: Handle ERR_FATAL with removal and
re-enumeration of devices").

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sinan Kaya <okaya@kernel.org>
This commit is contained in:
Keith Busch 2018-09-20 10:27:12 -06:00 committed by Bjorn Helgaas
parent c4eed62a21
commit bdb5ac8577
5 changed files with 28 additions and 102 deletions

View File

@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
event will be platform-dependent, but will follow the general event will be platform-dependent, but will follow the general
sequence described below. sequence described below.
STEP 0: Error Event: ERR_NONFATAL STEP 0: Error Event
------------------- -------------------
A PCI bus error is detected by the PCI hardware. On powerpc, the slot A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff, is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
proceeds to STEP 4 (Slot Reset) proceeds to STEP 4 (Slot Reset)
STEP 3: Slot Reset STEP 3: Link Reset
------------------
The platform resets the link. This is a PCI-Express specific step
and is done whenever a fatal error has been detected that can be
"solved" by resetting the link.
STEP 4: Slot Reset
------------------ ------------------
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@ -314,7 +320,7 @@ Failure).
>>> However, it probably should. >>> However, it probably should.
STEP 4: Resume Operations STEP 5: Resume Operations
------------------------- -------------------------
The platform will call the resume() callback on all affected device The platform will call the resume() callback on all affected device
drivers if all drivers on the segment have returned drivers if all drivers on the segment have returned
@ -326,7 +332,7 @@ a result code.
At this point, if a new error happens, the platform will restart At this point, if a new error happens, the platform will restart
a new error recovery sequence. a new error recovery sequence.
STEP 5: Permanent Failure STEP 6: Permanent Failure
------------------------- -------------------------
A "permanent failure" has occurred, and the platform cannot recover A "permanent failure" has occurred, and the platform cannot recover
the device. The platform will call error_detected() with a the device. The platform will call error_detected() with a
@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
for additional detail on real-life experience of the causes of for additional detail on real-life experience of the causes of
software errors. software errors.
STEP 0: Error Event: ERR_FATAL
-------------------
PCI bus error is detected by the PCI hardware. On powerpc, the slot is
isolated, in that all I/O is blocked: all reads return 0xffffffff, all
writes are ignored.
STEP 1: Remove devices
--------------------
Platform removes the devices depending on the error agent, it could be
this port for all subordinates or upstream component (likely downstream
port)
STEP 2: Reset link
--------------------
The platform resets the link. This is a PCI-Express specific step and is
done whenever a fatal error has been detected that can be "solved" by
resetting the link.
STEP 3: Re-enumerate the devices
--------------------
Initiates the re-enumeration.
Conclusion; General Remarks Conclusion; General Remarks
--------------------------- ---------------------------

View File

@ -433,8 +433,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
#endif #endif
/* PCI error reporting and recovery */ /* PCI error reporting and recovery */
void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service); void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
void pcie_do_nonfatal_recovery(struct pci_dev *dev); u32 service);
bool pcie_wait_for_link(struct pci_dev *pdev, bool active); bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
#ifdef CONFIG_PCIEASPM #ifdef CONFIG_PCIEASPM

View File

@ -1010,9 +1010,11 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
info->status); info->status);
pci_aer_clear_device_status(dev); pci_aer_clear_device_status(dev);
} else if (info->severity == AER_NONFATAL) } else if (info->severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(dev); pcie_do_recovery(dev, pci_channel_io_normal,
PCIE_PORT_SERVICE_AER);
else if (info->severity == AER_FATAL) else if (info->severity == AER_FATAL)
pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER); pcie_do_recovery(dev, pci_channel_io_frozen,
PCIE_PORT_SERVICE_AER);
pci_dev_put(dev); pci_dev_put(dev);
} }
@ -1048,9 +1050,11 @@ static void aer_recover_work_func(struct work_struct *work)
} }
cper_print_aer(pdev, entry.severity, entry.regs); cper_print_aer(pdev, entry.severity, entry.regs);
if (entry.severity == AER_NONFATAL) if (entry.severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(pdev); pcie_do_recovery(pdev, pci_channel_io_normal,
PCIE_PORT_SERVICE_AER);
else if (entry.severity == AER_FATAL) else if (entry.severity == AER_FATAL)
pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_AER); pcie_do_recovery(pdev, pci_channel_io_frozen,
PCIE_PORT_SERVICE_AER);
pci_dev_put(pdev); pci_dev_put(pdev);
} }
} }

View File

@ -216,7 +216,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1; reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5; ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
dev_warn(dev, "DPC %s detected, remove downstream devices\n", dev_warn(dev, "DPC %s detected\n",
(reason == 0) ? "unmasked uncorrectable error" : (reason == 0) ? "unmasked uncorrectable error" :
(reason == 1) ? "ERR_NONFATAL" : (reason == 1) ? "ERR_NONFATAL" :
(reason == 2) ? "ERR_FATAL" : (reason == 2) ? "ERR_FATAL" :
@ -233,7 +233,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
} }
/* We configure DPC so it only triggers on ERR_FATAL */ /* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC); pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
return IRQ_HANDLED; return IRQ_HANDLED;
} }

View File

@ -271,83 +271,20 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
return result_data.result; return result_data.result;
} }
/** void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
* pcie_do_fatal_recovery - handle fatal error recovery process u32 service)
* @dev: pointer to a pci_dev data structure of agent detecting an error
*
* Invoked when an error is fatal. Once being invoked, removes the devices
* beneath this AER agent, followed by reset link e.g. secondary bus reset
* followed by re-enumeration of devices.
*/
void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
{
struct pci_dev *udev;
struct pci_bus *parent;
struct pci_dev *pdev, *temp;
pci_ers_result_t result;
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
udev = dev;
else
udev = dev->bus->self;
parent = udev->subordinate;
pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
pci_lock_rescan_remove();
pci_dev_get(dev);
list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
bus_list) {
pci_stop_and_remove_bus_device(pdev);
}
result = reset_link(udev, service);
if ((service == PCIE_PORT_SERVICE_AER) &&
(dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
/*
* If the error is reported by a bridge, we think this error
* is related to the downstream link of the bridge, so we
* do error recovery on all subordinates of the bridge instead
* of the bridge and clear the error status of the bridge.
*/
pci_aer_clear_fatal_status(dev);
pci_aer_clear_device_status(dev);
}
if (result == PCI_ERS_RESULT_RECOVERED) {
if (pcie_wait_for_link(udev, true))
pci_rescan_bus(udev->bus);
pci_info(dev, "Device recovery from fatal error successful\n");
} else {
pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
pci_info(dev, "Device recovery from fatal error failed\n");
}
pci_dev_put(dev);
pci_unlock_rescan_remove();
}
/**
* pcie_do_nonfatal_recovery - handle nonfatal error recovery process
* @dev: pointer to a pci_dev data structure of agent detecting an error
*
* Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
* error detected message to all downstream drivers within a hierarchy in
* question and return the returned code.
*/
void pcie_do_nonfatal_recovery(struct pci_dev *dev)
{ {
pci_ers_result_t status; pci_ers_result_t status;
enum pci_channel_state state;
state = pci_channel_io_normal;
status = broadcast_error_message(dev, status = broadcast_error_message(dev,
state, state,
"error_detected", "error_detected",
report_error_detected); report_error_detected);
if (state == pci_channel_io_frozen &&
reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
goto failed;
if (status == PCI_ERS_RESULT_CAN_RECOVER) if (status == PCI_ERS_RESULT_CAN_RECOVER)
status = broadcast_error_message(dev, status = broadcast_error_message(dev,
state, state,