SAFE public projects git trees. - safe/jmp/linux-2.6/blob - Documentation/powerpc/eeh-pci-error-recovery.txt

   1
   2
   3                       PCI Bus EEH Error Recovery
   4                       --------------------------
   5                            Linas Vepstas
   6                        <linas@austin.ibm.com>
   7                           12 January 2005
   8
   9
  10 Overview:
  11 ---------
  12 The IBM POWER-based pSeries and iSeries computers include PCI bus
  13 controller chips that have extended capabilities for detecting and
  14 reporting a large variety of PCI bus error conditions.  These features
  15 go under the name of "EEH", for "Extended Error Handling".  The EEH
  16 hardware features allow PCI bus errors to be cleared and a PCI
  17 card to be "rebooted", without also having to reboot the operating
  18 system.
  19
  20 This is in contrast to traditional PCI error handling, where the
  21 PCI chip is wired directly to the CPU, and an error would cause
  22 a CPU machine-check/check-stop condition, halting the CPU entirely.
  23 Another "traditional" technique is to ignore such errors, which
  24 can lead to data corruption, both of user data or of kernel data,
  25 hung/unresponsive adapters, or system crashes/lockups.  Thus,
  26 the idea behind EEH is that the operating system can become more
  27 reliable and robust by protecting it from PCI errors, and giving
  28 the OS the ability to "reboot"/recover individual PCI devices.
  29
  30 Future systems from other vendors, based on the PCI-E specification,
  31 may contain similar features.
  32
  33
  34 Causes of EEH Errors
  35 --------------------
  36 EEH was originally designed to guard against hardware failure, such
  37 as PCI cards dying from heat, humidity, dust, vibration and bad
  38 electrical connections. The vast majority of EEH errors seen in
  39 "real life" are due to eithr poorly seated PCI cards, or,
  40 unfortunately quite commonly, due device driver bugs, device firmware
  41 bugs, and sometimes PCI card hardware bugs.
  42
  43 The most common software bug, is one that causes the device to
  44 attempt to DMA to a location in system memory that has not been
  45 reserved for DMA access for that card.  This is a powerful feature,
  46 as it prevents what; otherwise, would have been silent memory
  47 corruption caused by the bad DMA.  A number of device driver
  48 bugs have been found and fixed in this way over the past few
  49 years.  Other possible causes of EEH errors include data or
  50 address line parity errors (for example, due to poor electrical
  51 connectivity due to a poorly seated card), and PCI-X split-completion
  52 errors (due to software, device firmware, or device PCI hardware bugs).
  53 The vast majority of "true hardware failures" can be cured by
  54 physically removing and re-seating the PCI card.
  55
  56
  57 Detection and Recovery
  58 ----------------------
  59 In the following discussion, a generic overview of how to detect
  60 and recover from EEH errors will be presented. This is followed
  61 by an overview of how the current implementation in the Linux
  62 kernel does it.  The actual implementation is subject to change,
  63 and some of the finer points are still being debated.  These
  64 may in turn be swayed if or when other architectures implement
  65 similar functionality.
  66
  67 When a PCI Host Bridge (PHB, the bus controller connecting the
  68 PCI bus to the system CPU electronics complex) detects a PCI error
  69 condition, it will "isolate" the affected PCI card.  Isolation
  70 will block all writes (either to the card from the system, or
  71 from the card to the system), and it will cause all reads to
  72 return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
  73 This value was chosen because it is the same value you would
  74 get if the device was physically unplugged from the slot.
  75 This includes access to PCI memory, I/O space, and PCI config
  76 space.  Interrupts; however, will continued to be delivered.
  77
  78 Detection and recovery are performed with the aid of ppc64
  79 firmware.  The programming interfaces in the Linux kernel
  80 into the firmware are referred to as RTAS (Run-Time Abstraction
  81 Services).  The Linux kernel does not (should not) access
  82 the EEH function in the PCI chipsets directly, primarily because
  83 there are a number of different chipsets out there, each with
  84 different interfaces and quirks. The firmware provides a
  85 uniform abstraction layer that will work with all pSeries
  86 and iSeries hardware (and be forwards-compatible).
  87
  88 If the OS or device driver suspects that a PCI slot has been
  89 EEH-isolated, there is a firmware call it can make to determine if
  90 this is the case. If so, then the device driver should put itself
  91 into a consistent state (given that it won't be able to complete any
  92 pending work) and start recovery of the card.  Recovery normally
  93 would consist of reseting the PCI device (holding the PCI #RST
  94 line high for two seconds), followed by setting up the device
  95 config space (the base address registers (BAR's), latency timer,
  96 cache line size, interrupt line, and so on).  This is followed by a
  97 reinitialization of the device driver.  In a worst-case scenario,
  98 the power to the card can be toggled, at least on hot-plug-capable
  99 slots.  In principle, layers far above the device driver probably
 100 do not need to know that the PCI card has been "rebooted" in this
 101 way; ideally, there should be at most a pause in Ethernet/disk/USB
 102 I/O while the card is being reset.
 103
 104 If the card cannot be recovered after three or four resets, the
 105 kernel/device driver should assume the worst-case scenario, that the
 106 card has died completely, and report this error to the sysadmin.
 107 In addition, error messages are reported through RTAS and also through
 108 syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
 109 The correct way to deal with failed adapters is to use the standard
 110 PCI hotplug tools to remove and replace the dead card.
 111
 112
 113 Current PPC64 Linux EEH Implementation
 114 --------------------------------------
 115 At this time, a generic EEH recovery mechanism has been implemented,
 116 so that individual device drivers do not need to be modified to support
 117 EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
 118 infrastructure,  and percolates events up through the hotplug/udev
 119 infrastructure.  Followiing is a detailed description of how this is
 120 accomplished.
 121
 122 EEH must be enabled in the PHB's very early during the boot process,
 123 and if a PCI slot is hot-plugged. The former is performed by
 124 eeh_init() in arch/ppc64/kernel/eeh.c, and the later by
 125 drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
 126 EEH must be enabled before a PCI scan of the device can proceed.
 127 Current Power5 hardware will not work unless EEH is enabled;
 128 although older Power4 can run with it disabled.  Effectively,
 129 EEH can no longer be turned off.  PCI devices *must* be
 130 registered with the EEH code; the EEH code needs to know about
 131 the I/O address ranges of the PCI device in order to detect an
 132 error.  Given an arbitrary address, the routine
 133 pci_get_device_by_addr() will find the pci device associated
 134 with that address (if any).
 135
 136 The default include/asm-ppc64/io.h macros readb(), inb(), insb(),
 137 etc. include a check to see if the the i/o read returned all-0xff's.
 138 If so, these make a call to eeh_dn_check_failure(), which in turn
 139 asks the firmware if the all-ff's value is the sign of a true EEH
 140 error.  If it is not, processing continues as normal.  The grand
 141 total number of these false alarms or "false positives" can be
 142 seen in /proc/ppc64/eeh (subject to change).  Normally, almost
 143 all of these occur during boot, when the PCI bus is scanned, where
 144 a large number of 0xff reads are part of the bus scan procedure.
 145
 146 If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will
 147 print a stack trace to syslog (/var/log/messages).  This stack trace
 148 has proven to be very useful to device-driver authors for finding
 149 out at what point the EEH error was detected, as the error itself
 150 usually occurs slightly beforehand.
 151
 152 Next, it uses the Linux kernel notifier chain/work queue mechanism to
 153 allow any interested parties to find out about the failure.  Device
 154 drivers, or other parts of the kernel, can use
 155 eeh_register_notifier(struct notifier_block *) to find out about EEH
 156 events.  The event will include a pointer to the pci device, the
 157 device node and some state info.  Receivers of the event can "do as
 158 they wish"; the default handler will be described further in this
 159 section.
 160
 161 To assist in the recovery of the device, eeh.c exports the
 162 following functions:
 163
 164 rtas_set_slot_reset() -- assert the  PCI #RST line for 1/8th of a second
 165 rtas_configure_bridge() -- ask firmware to configure any PCI bridges
 166    located topologically under the pci slot.
 167 eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
 168    config-space info for a device and any devices under it.
 169
 170
 171 A handler for the EEH notifier_block events is implemented in
 172 drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
 173 It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
 174 This last call causes the device driver for the card to be stopped,
 175 which causes hotplug events to go out to user space. This triggers
 176 user-space scripts that might issue commands such as "ifdown eth0"
 177 for ethernet cards, and so on.  This handler then sleeps for 5 seconds,
 178 hoping to give the user-space scripts enough time to complete.
 179 It then resets the PCI card, reconfigures the device BAR's, and
 180 any bridges underneath. It then calls rpaphp_enable_pci_slot(),
 181 which restarts the device driver and triggers more user-space
 182 events (for example, calling "ifup eth0" for ethernet cards).
 183
 184
 185 Device Shutdown and User-Space Events
 186 -------------------------------------
 187 This section documents what happens when a pci slot is unconfigured,
 188 focusing on how the device driver gets shut down, and on how the
 189 events get delivered to user-space scripts.
 190
 191 Following is an example sequence of events that cause a device driver
 192 close function to be called during the first phase of an EEH reset.
 193 The following sequence is an example of the pcnet32 device driver.
 194
 195     rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
 196     {
 197       calls
 198       pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
 199       {
 200         calls
 201         pci_destroy_dev (struct pci_dev *)
 202         {
 203           calls
 204           device_unregister (&dev->dev) // in /drivers/base/core.c
 205           {
 206             calls
 207             device_del (struct device *)
 208             {
 209               calls
 210               bus_remove_device() // in /drivers/base/bus.c
 211               {
 212                 calls
 213                 device_release_driver()
 214                 {
 215                   calls
 216                   struct device_driver->remove() which is just
 217                   pci_device_remove()  // in /drivers/pci/pci_driver.c
 218                   {
 219                     calls
 220                     struct pci_driver->remove() which is just
 221                     pcnet32_remove_one() // in /drivers/net/pcnet32.c
 222                     {
 223                       calls
 224                       unregister_netdev() // in /net/core/dev.c
 225                       {
 226                         calls
 227                         dev_close()  // in /net/core/dev.c
 228                         {
 229                            calls dev->stop();
 230                            which is just pcnet32_close() // in pcnet32.c
 231                            {
 232                              which does what you wanted
 233                              to stop the device
 234                            }
 235                         }
 236                      }
 237                    which
 238                    frees pcnet32 device driver memory
 239                 }
 240      }}}}}}
 241
 242
 243     in drivers/pci/pci_driver.c,
 244     struct device_driver->remove() is just pci_device_remove()
 245     which calls struct pci_driver->remove() which is pcnet32_remove_one()
 246     which calls unregister_netdev()  (in net/core/dev.c)
 247     which calls dev_close()  (in net/core/dev.c)
 248     which calls dev->stop() which is pcnet32_close()
 249     which then does the appropriate shutdown.
 250
 251 ---
 252 Following is the analogous stack trace for events sent to user-space
 253 when the pci device is unconfigured.
 254
 255 rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
 256   calls
 257   pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
 258     calls
 259     pci_destroy_dev (struct pci_dev *) {
 260       calls
 261       device_unregister (&dev->dev) {      // in /drivers/base/core.c
 262         calls
 263         device_del(struct device * dev) {  // in /drivers/base/core.c
 264           calls
 265           kobject_del() {                  //in /libs/kobject.c
 266             calls
 267             kobject_hotplug() {            // in /libs/kobject.c
 268               calls
 269               kset_hotplug() {             // in /lib/kobject.c
 270                 calls
 271                 kset->hotplug_ops->hotplug() which is really just
 272                 a call to
 273                 dev_hotplug() {           // in /drivers/base/core.c
 274                   calls
 275                   dev->bus->hotplug() which is really just a call to
 276                   pci_hotplug () {      // in drivers/pci/hotplug.c
 277                     which prints device name, etc....
 278                  }
 279                }
 280                then kset_hotplug() calls
 281                 call_usermodehelper () with
 282                    argv[0]=hotplug_path[] which is "/sbin/hotplug"
 283              --> event to userspace,
 284            }
 285          }
 286          kobject_del() then calls sysfs_remove_dir(), which would
 287          trigger any user-space daemon that was watching /sysfs,
 288          and notice the delete event.
 289
 290
 291 Pro's and Con's of the Current Design
 292 -------------------------------------
 293 There are several issues with the current EEH software recovery design,
 294 which may be addressed in future revisions.  But first, note that the
 295 big plus of the current design is that no changes need to be made to
 296 individual device drivers, so that the current design throws a wide net.
 297 The biggest negative of the design is that it potentially disturbs
 298 network daemons and file systems that didn't need to be disturbed.
 299
 300 -- A minor complaint is that resetting the network card causes
 301    user-space back-to-back ifdown/ifup burps that potentially disturb
 302    network daemons, that didn't need to even know that the pci
 303    card was being rebooted.
 304
 305 -- A more serious concern is that the same reset, for SCSI devices,
 306    causes havoc to mounted file systems.  Scripts cannot post-facto
 307    unmount a file system without flushing pending buffers, but this
 308    is impossible, because I/O has already been stopped.  Thus,
 309    ideally, the reset should happen at or below the block layer,
 310    so that the file systems are not disturbed.
 311
 312    Reiserfs does not tolerate errors returned from the block device.
 313    Ext3fs seems to be tolerant, retrying reads/writes until it does
 314    succeed. Both have been only lightly tested in this scenario.
 315
 316    The SCSI-generic subsystem already has built-in code for performing
 317    SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
 318    (HBA) resets.  These are cascaded into a chain of attempted
 319    resets if a SCSI command fails. These are completely hidden
 320    from the block layer.  It would be very natural to add an EEH
 321    reset into this chain of events.
 322
 323 -- If a SCSI error occurs for the root device, all is lost unless
 324    the sysadmin had the foresight to run /bin, /sbin, /etc, /var
 325    and so on, out of ramdisk/tmpfs.
 326
 327
 328 Conclusions
 329 -----------
 330 There's forward progress ...
 331
 332