software RAID

From Glee
Jump to: navigation, search

Array Information

mdadm --detail /dev/md0

Array Creation

Examples :

mdadm --create /dev/md2 --metadata=0.90 --level=raid1 --raid-devices=2 /dev/sda2 /dev/sdb2
mdadm --create /dev/md0 --metadata=1.1 --level=raid5 --chunk=256 --raid-devices=6 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 missing

Example /etc/mdadm.conf content (use mdadm --detail --scan to get the details) :

ARRAY /dev/md0 level=raid5 num-devices=6 UUID=d81538a6:6abe192b:a59b3942:49fa9370

Fixing Mismatch Warnings

On RHEL5 there is a weekly cron job which starts a check of all software RAID arrays and produces output if anything is wrong. If such is the case, you will be receiving emails like this one :

From: root (Cron Daemon)
To: root
Subject: Cron <root@h01> run-parts /etc/cron.weekly

/etc/cron.weekly/99-raid-check:

WARNING: mismatch_cnt is not 0 on /dev/md1

This means that there are mismatched blocks between RAID members. Typically, for a RAID-1 mirror, some data is different between both disks, which is not normal.

Diagnosing

For some reason, the mdadm tool doesn't include support for managing these mismatches. It all needs to be done using the /sys/block/md*/md/ pseudo-files.

To see the mismatch count for all of the software RAID arrays (not that the value is only updated when a check is run) :

# cat /sys/block/md*/md/mismatch_cnt
0
256

To force a repair of the software RAID array where the value is non zero (assuming it's md2) :

# echo repair > /sys/block/md2/md/sync_action

You can then follow the repair progress with something like this :

# watch cat /proc/mdstat

Which will result in something like this :

Every 2.0s: cat /proc/mdstat

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
      726802368 blocks [2/2] [UU]
      [>....................]  resync =  0.5% (4312832/726802368) finish=323.4min speed=37231K/sec

md0 : active raid1 sdb1[1] sda1[0]
      5242816 blocks [2/2] [UU]

unused devices: <none>

If you see that the repair is causing excessive I/O, then you can always force a lower speed for the operation. Just look at the above speed= value or /sys/block/md1/md/sync_speed :

# cat /sys/block/md1/md/sync_speed
31202
# echo 25000 > /sys/block/md2/md/sync_speed_max

At the very start of the repair, the mismatch count value should go back to zero, but it might increase when mismatches are found and corrected. This is why once it's finished, you should run a full check, after which the value should be back to zero again :

# echo check > /sys/block/md1/md/sync_action
# watch cat /proc/mdstat
# cat /sys/block/md*/md/mismatch_cnt
0
0

Replacing a Disk

Example commands for sdb2 member of md1.

Manually remove a disk which starts reporting errors :

mdadm --manage --fail /dev/md1 /dev/sdb2
mdadm --manage --remove /dev/md1 /dev/sdb2

Also see badblocks for testing the removed disk for errors.

Once a disk dies and gets replaced, manual operations are needed to get the RAID arrays running again. Partitions need to be prepared beforehand, then re-added :

mdadm --manage --add /dev/md1 /dev/sdb2