OK, I think I broke it. Too much reliance on zfs being the fix for cheap componentry and haphazard procedures.

The home PC has an SSD boot drive and a zpool made of three disks – a 2x1TB mirror and 1x3TB, it was supposed to be four disks but the cabling and power supply never let me add the mirror disk to the second vdev. Backup is through the Ubuntu backup program, to an external USB drive.

A few weeks ago I started getting SMART errors telling me that one of the counters was increasing for one of the 1TB drives, then last week got the nasty messages saying:

Device: /dev/sdc [SAT], FAILED SMART self-check. BACK UP DATA NOW!

First attempt

So on the weekend I shutdown, swapped out the 1TB for a spare 2TB drive that I had lying around, booted up and did a zpool replace (via zpool replace zdata ${OLD} ${NEW}). All good so far. The zfs pool commenced resilvering.

Apart from being stopped for a few minutes while I worked out that I needed to add a new GPT partition to the new drive, it looked as though it was working… until.

During the remirroring I started getting some read errors from the other 1TB drive, so it looks as though I’d lost a few files.

Ideally I should have had the new drive connected up at the same time as the 2x1TB drives, but cabling and power wouldn’t let me. I’m not sure how I could have worked around that.

Second attempt

Bugger. Cancel current “zpool replace” operation with zpool detach ${NEW}, power the box down, replace the new drive with the old and bring it back up. It automatically resilvered, and despite the increasing error count, brought it back to a state where all files were readable.

Some reading and I realised I could boot without my SSD, so…

This time I shut the box down, removed the SSD boot drive and plugged the new 2TB drive in its place, then booted up from a USB key. All four hard drives now visible, so I ran zpool replace zdata ${OLD} ${NEW} and held my breath. Some time later it completed, after lots of read errors from the known-to-be-dying drive. I could now shut the box down, unplug the bad drive and plug in a second 2TB drive, then repeat the zpool replace to replace the other 1TB drive. Once again this completed successfully, so after removing the 1TB drive, plugging tbe SSD back in and rebooting I had my system back, now with a 2x2TB mirror and a 1x3TB vdev.

Third time unlucky or perhaps lucky

A week passed and one of the two “new” 2TB drives suddenly developed a number of checksum errors so I decided to replace it with a third 2TB drive I had lying around. A zpool scrub reported no problems, and I decided to again boot from USB and replace the drive while the bad one was available, to avoid any occurrence of problems such as I’d experienced in my first attempt.

No such luck, I’d run “zpool replace” and stepped back to wait… and wait… and wait. After ten days into “zpool replace” and with a “zpool status” that kept alternating between reporting 200GB done and 1.5GB done I gave up, zpool detach on the third new drive to kill the replace, powered down, unplugged the bad drive and brought the box back up with one half a mirror and one new drive. This time “zpool replace” completed in about 45 minutes!

Done

I think I’m done. The box is in no worse I condition than when I started, I still “hate my data” according to all the zfs conventions. I should either add a mirror to the unmirrored vdev, or replace the current pool with either a single 2 drive mirror or 3 drive raidz1 set. Now if only zfs allowed removal of vdevs.

Lessons? Don’t just make a backup, check that I can read it before doing games like this. Don’t mix your home development mucking-about box with your home server.