User Tools

Site Tools


zfs:troubleshooting:replace_a_disk

ZFS - Troubleshooting - Replace a Disk

Check the Pool

Verify that a disk is bad and that it needs to be replaced.

zpool status

returns:

  pool: testpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun  9 00:28:24 2013
config:
 
        NAME                         STATE     READ WRITE CKSUM
        testpool                     DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP  ONLINE       0     0     0
            ata-ST3300831A_5NF0552X  UNAVAIL      0     0     0
            ata-ST3200822A_5LJ1CHMS  ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C  ONLINE       0     0     0
 
errors: No known data errors

NOTE: This shows that one disk is unavailable.

  • This is ata-ST3300831A_5NF0552X.

Add a New Disk

  • Add a new disk.
  • Optionally remove the old disk.

NOTE: The new disk is ata-ST3500320AS_9QM03ATQ.

  • This can be seen at /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ.
  • Only remove the old drive at this point if it is a redundant setup.

Replace the Old Device

zpool replace testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ
zpool offline testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X
zpool detatch testpool /dev/disk/by-id/ata-ST3300831A_5NF0552X

NOTE: Here the old device is specified first followed by the new device.

  • If the pool is a redundant configuration, data will be copied from other good disks to the new disk.
  • If the pool is not redundant, data will be copied from the old device to the new device.
  • Once that is complete, the old device can be physically removed.

Potential Issues

If the bad device has already been removed from the system, this might fail with the following error.

cannot offline /dev/disk/by-id/ata-ST3300831A_5NF0552X: no such device in pool
  • This is because the label of the drive that died does not exist in the system any more.
  • Therefore the bad device cannot be specified by ID.
  • If this case, try specifying it by device name or by GUID.

There are various ways to determine a GUID:

zdb               # Find GUID.
zdb -l /dev/sda1  # In case the 'zdb' command does not work.
zpool status -g   # Find GUID.
zpool status -L   # Find device name, resolving links.

Try to get the GUID using zdb:

zdb
testpool:
    version: 28
    name: 'testpool'
    state: 0
    txg: 162804
    pool_guid: 14829240649900366534
    hostname: 'BigMamba'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 14829240649900366534
        children[0]:
            type: 'raidz'
            id: 0
            guid: 5355850150368902284
            nparity: 1
            metaslab_array: 31
            metaslab_shift: 32
            ashift: 9
            asize: 791588896768
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 11426107064765252810
                path: '/dev/disk/by-id/ata-ST3300620A_5QF0MJFP-part2'
                phys_path: '/dev/gptid/73b31683-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 15935140517898495532
                path: '/dev/disk/by-id/ata-ST3300831A_5NF0552X-part2'
                phys_path: '/dev/gptid/746c949a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 7183706725091321492
                path: '/dev/disk/by-id/ata-ST3200822A_5LJ1CHMS-part2'
                phys_path: '/dev/gptid/7541115a-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
            children[3]:
                type: 'disk'
                id: 3
                guid: 17196042497722925662
                path: '/dev/disk/by-id/ata-ST3200822A_3LJ0189C-part2'
                phys_path: '/dev/gptid/760a94ee-537f-11e2-bad7-50465d4eb8b0'
                whole_disk: 1
                create_txg: 4
    features_for_read:

NOTE: The GUID can be ascertained as 15935140517898495532.

Use the GUID to offline the old device:

zpool offline testpool 15935140517898495532

And check this has worked:

zpool status
  pool: testpool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 2h4m with 0 errors on Sun Jun  9 00:28:24 2013
config:
 
        NAME                         STATE     READ WRITE CKSUM
        testpool                     DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP  ONLINE       0     0     0
            ata-ST3300831A_5NF0552X  OFFLINE      0     0     0
            ata-ST3200822A_5LJ1CHMS  ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C  ONLINE       0     0     0
 
errors: No known data errors

and then replace the pool:

zpool replace testpool 15935140517898495532 /dev/disk/by-id/ata-ST3500320AS_9QM03ATQ

And check again this has worked:

zpool status
  pool: testpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun  9 01:44:36 2013
    408M scanned out of 419G at 20,4M/s, 5h50m to go
    101M resilvered, 0,10% done
config:
 
        NAME                            STATE     READ WRITE CKSUM
        testpool                        DEGRADED     0     0     0
          raidz1-0                      DEGRADED     0     0     0
            ata-ST3300620A_5QF0MJFP     ONLINE       0     0     0
            replacing-1                 OFFLINE      0     0     0
              ata-ST3300831A_5NF0552X   OFFLINE      0     0     0
              ata-ST3500320AS_9QM03ATQ  ONLINE       0     0     0  (resilvering)
            ata-ST3200822A_5LJ1CHMS     ONLINE       0     0     0
            ata-ST3200822A_3LJ0189C     ONLINE       0     0     0
 
errors: No known data errors

NOTE: If the old disk is already removed from the system and a new device has replaced it with the same device name, the following command can be used instead:

zpool offline testpool sdd
zpool remove testpool sdd
zpool attach -f testpool sdc sdd

Wait For Resilvering to Complete

Before the pool will be back to normal it will need to sync data over to the new disk.

  • It will remain in a degraded status while the data syncs.
  • This data syncing process is called resilvering.
  • It may take a very long time depending on the size of the disks and on how much data is on them.

The status of the resilvering can be checked:

zpool status testpool

Physically Remove the Old Drive

Physically remove the old drive.

  • If it is hot-swappable then just pull it out.
  • Otherwise, shutdown the system, before removing the device.

References

zfs/troubleshooting/replace_a_disk.txt · Last modified: 2021/10/14 00:58 by peter

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki