Monitoring Status for a RAID

Monitoring Status with EVMSGUI

The Regions tab in EVMS GUI (evmsgui) reports any software RAID devices that are defined and whether they are currently active.

Monitoring Status with /proc/mdstat

A summary of RAID and status information (active/not active) is available in the /proc/mdstat file.

  1. Open a terminal console, then log in as the root user or equivalent.

  2. View the /proc/mdstat file by entering the following at the console prompt:

    cat /proc/mdstat
    
  3. Evaluate the information.

    The following table shows an example output and how to interpret the information.

    Status Information

    Description

    Interpretation

    Personalities : [raid5] [raid4]

    List of the RAIDs on the server by RAID label.

    You have two RAIDs defined with labels of raid5 and raid4.

    md0 : active
    raid5
    sdg1[0] sdk1[4] sdj1[3] sdi1[2]
    <device> : <active | not active>
    <RAID label you specified>
    < storage object> [RAID order]
    The RAID is active and mounted at /dev/evms/md/md0.
    The RAID label is raid5.
    The active segments are sdg1, sdi1, sdj1, and sdk1, as ordered in the RAID.
    The RAID numbering of 0 to 4 indicates that the RAID has 5 segments, and the second segment [1] is missing from the list. Based on the segment names, the missing segment is sdh1.
    35535360 blocks level 5, 128k chunk, algorithm 2 [5/4] [U_UUU]
    
    <number of blocks> blocks
    level < 0 | 1 | 4 | 5 >
    <stripe size in KB> chunk
    algorithm <1 | 2 | 3 | 4 >
    [number of devices/number of working devices]
    [U-UUU]
    If the block size on the server is 4 KB, the total size of the RAID (including parity) is 142 GB, with a data capacity of 113.7 GB.
    The stripe size is 128 KB.
    The RAID is using left symmetric.
    algorithm <1 | 2 | 3 | 4 >
    [number of devices/number of working devices]
    [U-UUU]
    unused devices: <none> 
    

    All segments in the RAID are in use.

    There are no spare devices available on the server.

Monitoring Status with mdadm

To view the RAID status with the mdadm command, enter the following at a terminal prompt:

mdadm -D /dev/mdx

Replace mdx with the RAID device number.

Example 1: A Disk Fails

In the following example, only four of the five devices in the RAID are active (Raid Devices : 5, Total Devices : 4). When it was created, the component devices in the device were numbered 0 to 5 and are ordered according to their alphabetic appearance in the list where they were chosen, such as /dev/sdg1, /dev/sdh1, /dev/sdi1, /dev/sdj1, and /dev/sdk1. From the pattern of filenames of the other devices, you determine that the device that was removed was named /dev/sdh1.

/dev/md0:
Version : 00.90.03
Creation Time : Sun Apr 16 11:37:05 2006
Raid Level : raid5
Array Size : 35535360 (33.89 GiB 36.39 GB)
Device Size : 8883840 (8.47 GiB 9.10 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Apr 17 05:50:44 2006
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 128K
UUID : 2e686e87:1eb36d02:d3914df8:db197afe
Events : 0.189
Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 0 1 removed
2 8 129 2 active sync /dev/sdi1
3 8 45 3 active sync /dev/sdj1
4 8 161 4 active sync /dev/sdk1

Example 2: Spare Disk Replaces the Failed Disk

In the following mdadm report, only 4 of the 5 disks are active and in good condition (Active Devices : 4, Working Devices : 5). The failed disk was automatically detected and removed from the RAID (Failed Devices: 0). The spare was activated as the replacement disk, and has assumed the diskname of the failed disk (/dev/sdh1). The faulty object (the failed disk that was removed from the RAID) is not identified in the report. The RAID is running in degraded mode (State : clean, degraded, recovering). The data is being rebuilt (spare rebuilding /dev/sdh1), and the process is 3% complete (Rebuild Status : 3% complete ).

mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sun Apr 16 11:37:05 2006
Raid Level : raid5
Array Size : 35535360 (33.89 GiB 36.39 GB)
Device Size : 8883840 (8.47 GiB 9.10 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon Apr 17 05:50:44 2006
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 128K
Rebuild Status : 3% complete
UUID : 2e686e87:1eb36d02:d3914df8:db197afe
Events : 0.189
Number Major Minor RaidDevice State
0 8 97 0 active sync /dev/sdg1
1 8 113 1 spare rebuilding /dev/sdh1
2 8 129 2 active sync /dev/sdi1
3 8 145 3 active sync /dev/sdj1
4 8 161 4 active sync /dev/sdk1

Monitoring a Remirror or Reconstruction

You can follow the progress of the synchronization or reconstruction process by examining the /proc/mdstat file.

You can control the speed of synchronization by setting parameters in the /proc/sys/dev/raid/speed_limit_min and /proc/sys/dev/raid/speed_limit_max files. To speed up the process, echo a larger number into the speed_limit_min file.

Configuring mdadm to Send an E-Mail Alert for RAID Events

You might want to configure the mdadm service to send an e-mail alert for software RAID events. Monitoring is only meaningful for RAIDs 1, 4, 5, 6, 10 or multipath arrays because only these have missing, spare, or failed drives to monitor. RAID 0 and Linear RAIDs do not provide fault tolerance so they have no interesting states to monitor.

The following table identifies RAID events and indicates which events trigger e-mail alerts. All events cause the program to run. The program is run with two or three arguments: the event name, the array device (such as /dev/md1), and possibly a second device. For Fail, Fail Spare, and Spare Active, the second device is the relevant component device. For MoveSpare, the second device is the array that the spare was moved from.

Table 6.8. RAID Events in mdadm

RAID Event

Trigger E-Mail Alert

Description

Device Disappeared

No

An md array that was previously configured appears to no longer be configured. (syslog priority: Critical)

If mdadm was told to monitor an array which is RAID0 or Linear, then it reports DeviceDisappeared with the extra information Wrong-Level. This is because RAID0 and Linear do not support the device-failed, hot-spare, and resynchronize operations that are monitored.

Rebuild Started

No

An md array started reconstruction. (syslog priority: Warning)

Rebuild NN

No

Where NN is 20, 40, 60, or 80. This indicates the percent completed for the rebuild. (syslog priority: Warning)

Rebuild Finished

No

An md array that was rebuilding is no longer rebuilding, either because it finished normally or was aborted. (syslog priority: Warning)

Fail

Yes

An active component device of an array has been marked as faulty. (syslog priority: Critical)

Fail Spare

Yes

A spare component device that was being rebuilt to replace a faulty device has failed. (syslog priority: Critical)

Spare Active

No

A spare component device that was being rebuilt to replace a faulty device has been successfully rebuilt and has been made active. (syslog priority: Info)

New Array

No

A new md array has been detected in the /proc/mdstat file. (syslog priority: Info)

Degraded Array

Yes

A newly noticed array appears to be degraded. This message is not generated when mdadm notices a drive failure that causes degradation. It is generated only when mdadm notices that an array is degraded when it first sees the array. (syslog priority: Critical)

Move Spare

No

A spare drive has been moved from one array in a spare group to another to allow a failed drive to be replaced. (syslog priority: Info)

Spares Missing

Yes

The mdadm.conf file indicates that an array should have a certain number of spare devices, but mdadm detects that the array has fewer than this number when it first sees the array. (syslog priority: Warning)

Test Message

Yes

An array was found at startup, and the --test flag was given. (syslog priority: Info)


To configure an e-mail alert:

  1. At a terminal console, log in as the root user.

  2. Edit the /etc/mdadm/mdadm.conf file to add your e-mail address for receiving alerts. For example, specify the MAILADDR value (using your own e-mail address, of course):

    DEVICE partitions
    
    ARRAY /dev/md0 level=raid1 num-devices=2
    
         UUID=1c661ae4:818165c3:3f7a4661:af475fda 
    
         devices=/dev/sdb3,/dev/sdc3
    
    MAILADDR yourname@example.com
    

    The MAILADDR line gives an e-mail address that alerts should be sent to when mdadm is running in --monitor mode with the --scan option. There should be only one MAILADDR line in mdadm.conf, and it should have only one address.

  3. Start mdadm monitoring by entering the following at the terminal console prompt:

    mdadm --monitor --mail=yourname@example.com --delay=1800 /dev/md0
    

    The --monitor option causes mdadm to periodically poll a number of md arrays and to report on any events noticed. mdadm never exits once it decides that there are arrays to be checked, so it should normally be run in the background.

    In addition to reporting events in this mode, mdadm might move a spare drive from one array to another if they are in the same spare-group and if the destination array has a failed drive but no spares.

    Listing the devices to monitor is optional. If any devices are listed on the command line, mdadm monitors only those devices. Otherwise, all arrays listed in the configuration file are monitored. Further, if --scan option is added in the command, then any other md devices that appear in /proc/mdstat are also monitored.

    For more information about using mdadm, see the mdadm(8) and mdadm.conf(5) man pages.

  4. To configure the /etc/init.d/mdadmd service as a script:

    suse:~ # egrep 'MAIL|RAIDDEVICE' /etc/sysconfig/mdadm
    
    MDADM_MAIL="yourname@example.com"
    
    MDADM_RAIDDEVICES="/dev/md0"
    
    MDADM_SEND_MAIL_ON_START=no
    
    suse:~ # chkconfig mdadmd --list
    
    mdadmd      0:off  1:off  2:off  3:on   4:off  5:on 6:off
    

SUSE® Linux Enterprise Server Storage Administration Guide 10