Managing Disk Failure and RAID Recovery

Understanding the Disk Failure and RAID Recovery

RAIDs 1, 4, and 5 can survive a disk failure. A RAID 1 device survives if all but one mirrored array fails. Its read performance is degraded without the multiple data sources available, but its write performance might actually improve when it does not write to the failed mirrors. During the synchronization of the replacement disk, write and read performance are both degraded. A RAID 5 can survive a single disk failure at a time. A RAID 4 can survive a single disk failure at a time if the disk is not the parity disk.

Disks can fail for many reasons such as the following:

  • Disk crash

  • Disk pulled from the system

  • Drive cable removed or loose

  • I/O errors

When a disk fails, the RAID removes the failed disk from membership in the RAID, and operates in a degraded mode until the failed disk is replaced by a spare. Degraded mode is resolved for a single disk failure in one of the following ways:

  • Spare Exists: If the RAID has been assigned a spare disk, the MD driver automatically activates the spare disk as a member of the RAID, then the RAID begins synchronizing (RAID 1) or reconstructing (RAID 4 or 5) the missing data.

  • No Spare Exists: If the RAID does not have a spare disk, the RAID operates in degraded mode until you configure and add a spare. When you add the spare, the MD driver detects the RAID’s degraded mode, automatically activates the spare as a member of the RAID, then begins synchronizing (RAID 1) or reconstructing (RAID 4 or 5) the missing data.

Identifying the Failed Drive

On failure, md automatically removes the failed drive as a component device in the RAID array. To determine which device is a problem, use mdadm and look for the device that has been reported as “removed”.

  • Enter the following a a terminal console prompt

    mdadm -D /dev/md1
    

    Replace /dev/md1 with the actual path for your RAID.

For example, an mdadm report for a RAID 1 device consisting of /dev/sda2 and /dev/sdb2 might look like this:

blue6:~ # mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Sun Jul  2 01:14:07 2006
     Raid Level : raid1
     Array Size : 180201024 (171.85 GiB 184.53 GB)
    Device Size : 180201024 (171.85 GiB 184.53 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent
    Update Time : Tue Aug 15 18:31:09 2006
          State : clean, degraded
 Active Devices : 1
Working Devices : 1 Failed Devices : 0
  Spare Devices : 0
           UUID : 8a9f3d46:3ec09d23:86e1ffbc:ee2d0dd8
         Events : 0.174164
    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2

The “Total Devices : 1”, “Active Devices : 1”, and “Working Devices : 1” indicate that only one of the two devices is currently active. The RAID is operating in a “degraded” state.

The “Failed Devices : 0” might be confusing. This setting has a non-zero number only for that brief period where the md driver finds a problem on the drive and prepares to remove it from the RAID. When the failed drive is removed, it reads “0” again.

In the devices list at the end of the report, the device with the “removed” state for Device 0 indicates that the device has been removed from the software RAID definition, not that the device has been physically removed from the system. It does not specifically identify the failed device. However, the working device (or devices) are listed. Hopefully, you have a record of which devices were members of the RAID. By the process of elimination, the failed device is /dev/sda2.

The “Spare Devices : 0” indicates that you do not have a spare assigned to the RAID. You must assign a spare device to the RAID so that it can be automatically added to the array and replace the failed device.

Replacing a Failed Device with a Spare

When a component device fails, the md driver replaces the failed device with a spare device assigned to the RAID. You can either keep a spare device assigned to the RAID as a hot standby to use as an automatic replacement, or assign a spare device to the RAID as needed.

[Important]Important

Even if you correct the problem that caused the problem disk to fail, the RAID does not automatically accept it back into the array because it is a “faulty object” in the RAID and is no longer synchronized with the RAID.

If a spare is available, md automatically removes the failed disk, replaces it with the spare disk, then begins to synchronize the data (for RAID 1) or reconstruct the data from parity (for RAIDs 4 or 5).

If a spare is not available, the RAID operates in degraded mode until you assign spare device to the RAID.

To assign a spare device to the RAID:

  1. Prepare the disk as needed to match the other members of the RAID.

  2. In EVMS, select the ActionsAddSpare Disk to a Region (the addspare plug-in for the EVMS GUI).

  3. Select the RAID device you want to manage from the list of Regions, then click Next.

  4. Select the device to use as the spare disk.

  5. Click Add.

    The md driver automatically begins the replacement and reconstruction or synchronization process.

  6. Monitor the status of the RAID to verify the process has begun.

    For information about how monitor RAID status, see Section 6.6, “Monitoring Status for a RAID”.

  7. Continue with Section 6.5.4, “Removing the Failed Disk”.

Removing the Failed Disk

You can remove the failed disk at any time after it has been replaced with the spare disk. EVMS does not make the device available for other use until you remove it from the RAID. After you remove it, the disk appears in the Available-Objects list in the EVMS GUI, where it can be used for any purpose.

[Note]Note

If you pull a disk or if it is totally unusable, EVMS no longer recognizes the failed disk as part of the RAID.

The RAID device can be active and in use when you remove its faulty object.

  1. In EVMS, select the ActionsRemoveFaulty Object from a Region (the remfaulty plug-in the EVMS GUI).

  2. Select the RAID device you want to manage from the list of Regions, then click Next.

  3. Select the failed disk.

  4. Click Remove.


SUSE® Linux Enterprise Server Storage Administration Guide 10