====== RAID - mdadm - Troubleshooting - Disk Failure ====== cat /proc/mdstat returns: Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sde[4] sdd[2](F) sdc[1] sdb[0] 3144192 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U] unused devices: **NOTE:** The **(F)**: Indicates the /dev/sdd device has failed. ---- ===== Confirm the failure ===== mdadm --detail /dev/md0 returns: /dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Sep 8 16:14:21 2011 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 75 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 2 0 0 2 removed 4 8 64 3 active sync /dev/sde 2 8 48 - faulty spare /dev/sdd ---- ===== Remove the failed device from the array ===== mdadm --manage --remove /dev/md0 /dev/sdd returns: mdadm: hot removed /dev/sdd from /dev/md0 ---- ===== Obtain the serial number of the failed device ===== This allows you to refer to your documentation to know which bay the failed disk in in. * Some systems do not have any visible indication of which bay a specific disk is loaded into, so unless this is documented at the time the RAID was created there is a risk that the wrong physical disk may be dealt with, instead of the actual broken disk. smartctl -a /dev/sdd | grep -i serial returns: Serial Number: VB455d882e-8013d7c9 ---- ===== Replace the failed drive ===== Physically remove the failed drive. Replace this failed drive with a new one. Add the replacement drive to the array: mdadm --add /dev/md0 /dev/sdd returns: mdadm: added /dev/sdd ---- ===== Check the status of the RAID ===== mdadm --detail /dev/md0 returns: /dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid5 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Sep 8 17:03:44 2011 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 36% complete Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 85 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 5 8 48 2 spare rebuilding /dev/sdd 4 8 64 3 active sync /dev/sde ---- **NOTE:** Wait for the RAID to rebuild. * This will result in a fully redundant array again. **WARNING:** If another drive fails before the point a new drive is added and the rebuild is complete, it could result in all data being lost! There are several ways around this: * Add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the "danger time". * This is simply adding another drive while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails. * Have another raid level, such as raid 6, which can survive 2 failures. mdadm --add /dev/md0 /dev/sdf returns: mdadm: added /dev/sdf ---- ===== Grow the array ===== mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/backup returns: mdadm level of /dev/md0 changed to raid6 **NOTE:** Notice the **backup-file** argument. Ensure: * The backup-file location must NOT be in the array. * There is sufficient space for this backup file. * This data will be deleted after the rebuild is complete. ---- ===== Check ===== cat /proc/mdstat returns: Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdf[6] sdd[5] sde[4] sdc[1] sdb[0] 3144192 blocks super 1.2 level 6, 512k chunk, algorithm 18 [5/4] [UUUU_] [==>..................] reshape = 13.2% (139264/1048064) finish=4.4min speed=3373K/sec unused devices: ---- ===== Another Check ===== mdadm --detail /dev/md0 returns: /dev/md0: Version : 1.2 Creation Time : Tue Sep 6 18:31:41 2011 Raid Level : raid6 Array Size : 3144192 (3.00 GiB 3.22 GB) Used Dev Size : 1048064 (1023.67 MiB 1073.22 MB) Raid Devices : 5 Total Devices : 5 Persistence : Superblock is persistent Update Time : Thu Sep 8 18:54:26 2011 State : clean Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : raidtest.loc:0 (local to host raidtest.loc) UUID : e0748cf9:be2ca997:0bc183a6:ba2c9ebf Events : 2058 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc 5 8 48 2 active sync /dev/sdd 4 8 64 3 active sync /dev/sde 6 8 80 4 active sync /dev/sdf **NOTE:** If for whatever reason you want to go from 6 drives to 5, you can also do that. * This will then end up with a hot spare.