====== ZFS - Troubleshooting - Replace a Disk ====== ===== Check the Pool ===== Verify that a disk is bad and that it needs to be replaced. zpool status returns: pool: testpool state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 ata-ST3300831A_5NF0552X UNAVAIL 0 0 0 ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors **NOTE:** This shows that one disk is unavailable. * This is ata-ST3300831A_5NF0552X. ---- ===== Add a New Disk ===== * Add a new disk. * Optionally remove the old disk. **NOTE:** The new disk is ata-ST3500320AS_9QM03ATQ. * This can be seen at /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ. * Only remove the old drive at this point if it is a redundant setup. ---- ===== Replace the Old Device ===== zpool replace testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ zpool offline testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X zpool detatch testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X **NOTE:** Here the old device is specified first followed by the new device. * If the pool is a redundant configuration, data will be copied from other good disks to the new disk. * If the pool is not redundant, data will be copied from the old device to the new device. * Once that is complete, the old device can be physically removed. ==== Potential Issues ==== If the bad device has already been removed from the system, this might fail with the following error. cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool * This is because the label of the drive that died does not exist in the system any more. * Therefore the bad device cannot be specified by ID. * If this case, try specifying it by device name or by GUID. ---- There are various ways to determine a GUID: zdb # Find GUID. zdb -l /dev/sda1 # In case the 'zdb' command does not work. zpool status -g # Find GUID. zpool status -L # Find device name, resolving links. ---- Try to get the GUID using zdb: zdb testpool: version: 28 name: 'testpool' state: 0 txg: 162804 pool_guid: 14829240649900366534 hostname: 'BigMamba' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 14829240649900366534 children[0]: type: 'raidz' id: 0 guid: 5355850150368902284 nparity: 1 metaslab_array: 31 metaslab_shift: 32 ashift: 9 asize: 791588896768 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 11426107064765252810 path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2' phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 15935140517898495532 path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2' phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[2]: type: 'disk' id: 2 guid: 7183706725091321492 path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2' phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 children[3]: type: 'disk' id: 3 guid: 17196042497722925662 path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2' phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0' whole_disk: 1 create_txg: 4 features_for_read: **NOTE:** The GUID can be ascertained as 15935140517898495532. Use the GUID to offline the old device: zpool offline testpool 15935140517898495532 And check this has worked: zpool status pool: testpool state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun 9 00:28:24 2013 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 ata-ST3300831A_5NF0552X OFFLINE 0 0 0 ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors and then replace the pool: zpool replace testpool 15935140517898495532 /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ And check again this has worked: zpool status pool: testpool state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Jun 9 01:44:36 2013 408M scanned out of 419G at 20,4M/s, 5h50m to go 101M resilvered, 0,10% done config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST3300620A_5QF0MJFP ONLINE 0 0 0 replacing-1 OFFLINE 0 0 0 ata-ST3300831A_5NF0552X OFFLINE 0 0 0 ata-ST3500320AS_9QM03ATQ ONLINE 0 0 0 (resilvering) ata-ST3200822A_5LJ1CHMS ONLINE 0 0 0 ata-ST3200822A_3LJ0189C ONLINE 0 0 0 errors: No known data errors **NOTE:** If the old disk is already removed from the system and a new device has replaced it with the same device name, the following command can be used instead: zpool offline testpool sdd zpool remove testpool sdd zpool attach -f testpool sdc sdd ---- ===== Wait For Resilvering to Complete ===== Before the pool will be back to normal it will need to sync data over to the new disk. * It will remain in a degraded status while the data syncs. * This data syncing process is called resilvering. * It may take a __very__ long time depending on the size of the disks and on how much data is on them. The status of the resilvering can be checked: zpool status testpool ---- ===== Physically Remove the Old Drive ===== Physically remove the old drive. * If it is hot-swappable then just pull it out. * Otherwise, shutdown the system, before removing the device. ---- ===== References ===== https://docs.joyent.com/private-cloud/troubleshooting/disk-replacement