raid

Using mdadm to recover from a dead disk in a Linux RAID-1 array

2.5
Yes, it's that time of the year again. A disk in my desktop-replacement laptop with 2 disks and a RAID-1 has died. Time for recovery.

This laptop has been running 24/7 for the last 3 years or such, so it's not too surprising that a disk dies. Surprisingly though, for the first time in a long series of dead disks, smartctl -a does indeed show errors for this disk. Here's a short snippet of those:

  $ smartctl -a /dev/sda
  [...]
  Error 1341 occurred at disk power-on lifetime: 17614 hours (733 days + 22 hours)
   When the command that caused the error occurred, the device was active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   40 41 02 1f c0 9c 40  Error: UNC at LBA = 0x009cc01f = 10272799

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   60 f8 08 20 c0 9c 40 00  41d+01:51:50.974  READ FPDMA QUEUED
   60 08 00 18 c0 9c 40 00  41d+01:51:50.972  READ FPDMA QUEUED
   ef 10 02 00 00 00 a0 00  41d+01:51:50.972  SET FEATURES [Reserved for Serial ATA]
   ec 00 00 00 00 00 a0 00  41d+01:51:50.971  IDENTIFY DEVICE
   ef 03 45 00 00 00 a0 00  41d+01:51:50.971  SET FEATURES [Set transfer mode]

  SMART Self-test log structure revision number 1
  Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
  # 1  Short offline       Completed: read failure       90%     20511         156170102
  [...]

The status of the degraded RAID array looks like this:

  $ cat /proc/mdstat
  Personalities : [raid1] 
  md1 : active raid1 sdb7[1]
       409845696 blocks [2/1] [_U]
  md0 : active raid1 sda6[0] sdb6[1]
       291776 blocks [2/2] [UU]

The [_U] means that one of two disks has failed, it should normally be [UU]. There are two RAID-1s actually, a small md0 (sda6 + sdb6) for /boot and the main md1 (sda7 + sdb7) which holds the OS and my data. Apparently (at first at least), only sda7 was faulty and got kicked out of the array:

  $ dmesg | grep kick
  md: kicking non-fresh sda7 from array!

Anyway, so I ordered a replacement disk, removed the dead disk (I checked the serial number and brand before, so I don't accidentally remove the wrong one), inserted the new disk and rebooted.

Note: In order for this to work you have to have (previously) installed the bootloader (usually GRUB) onto both disks, otherwise you won't be able to boot from either of them (which you'll want to do if one of them dies, of course). In my case, sda was now dead, so I put sdb into its place (physically, by using the other SATA connector/port) and the new replacement disk would become the new sdb.

After the reboot, the new disk needs to be partitioned like the other RAID disk. This can be done easily by copying the partition layout of the "good" disk (now sda after the reboot) onto the empty disk (sdb):

  $ sfdisk -d /dev/sda | sfdisk /dev/sdb

Specifically, the RAID disks/partitions need to have the type/ID "fd" ("Linux raid autodetect"), check if that is the case. Then, you can add the new disk to the RAIDs:

  $ mdadm /dev/md0 --add /dev/sdb6
  $ mdadm /dev/md1 --add /dev/sdb7

After a few hours the RAID will be re-synced properly and all is good again. You can check the progress via:

  $ watch -n 1 cat /proc/mdstat

You should probably not reboot during the resync (though I'm not 100% sure if that would be an issue in practice; please leave a comment if you know).

Also, don't forget to install GRUB on the new disk so you can still boot when the next disk dies:

  $ grub-mkdevicemap
  $ grub-install /dev/sdb

And it might be a good idea to use S.M.A.R.T. to check the new disk, just in case. I did a quick run for the new disk via:

  $ smartctl -t short /dev/sdb # Wait a few minutes after this.
  $ smartctl -a /dev/sdb
  [...]
  SMART Self-test log structure revision number 1
  Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
  # 1  Short offline       Completed without error       00%        22         -
  [...]

Looks good. So far.

Using the HP Pavilion dv7-3127eg laptop with Debian GNU/Linux

HP Pavilion dv7-3127eg

Yep, so I bought a new laptop recently, my IBM/Lenovo Thinkpad T40p was slowly getting really unbearably sloooow (Celeron 1.5 GHz, 2 GB RAM max). After comparing some models I set out to buy a certain laptop in a local store, which they didn't have in stock, so I spontaneously got another model, the HP Pavilion dv7-3127eg (HP product number VY554EA).

Why this one? Well, the killer feature for me was that it has two SATA disks, hence allows me to run a RAID-1 in my laptop. This allows me to sleep better at night, knowing that the next dying disk will not necessarily lead to data loss (yes, I do still perform regular backups, of course).

Other pros: Much faster than the old notebook, this one is an AMD Turion II Dual-Core Mobile M520 at 2.3 GHz per core, it has 4 GB RAM (8 GB max), and uses an AMD RS780 / SB700 chipset which is supported by the Free-Software / Open-Source BIOS / firmware project coreboot, so this might make the laptop a good coreboot-target on the long run. I'll probably start working on that when I'm willing to open / dissect it or when the warranty expires, whichever happens first.

Anyway, I set up a page at randomprojects.org which contains lots more details about using Linux on this laptop:

http://randomprojects.org/wiki/HP_Pavilion_dv7-3127eg

Most of the hardware is supported out of the box, though I haven't yet tested everything. There may be issues with suspend-to-disk / suspend-to-RAM, sometimes it seems to hang (may be just a simple config change is needed in /etc/hibernate/disk.cfg).

Cons: Pretty big and heavy (but that's OK, I use it mostly as "semi-mobile desktop replacement"), glossy screen, loud fans (probably due to the two disks).

For reference, here's an lspci of the box:

  $ lspci -tvnn
  -[0000:00]-+-00.0  Advanced Micro Devices [AMD] RS780 Host Bridge Alternate [1022:9601]
           +-02.0-[01]--+-00.0  ATI Technologies Inc M96 [Mobility Radeon HD 4650] [1002:9480]
           |            \-00.1  ATI Technologies Inc RV710/730 [1002:aa38]
           +-04.0-[02-07]--
           +-05.0-[08]----00.0  Atheros Communications Inc. AR9285 Wireless Network Adapter (PCI-Express) [168c:002b]
           +-06.0-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller [10ec:8168]
           +-0a.0-[0a]--
           +-11.0  ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode] [1002:4391]
           +-12.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller [1002:4397]
           +-12.1  ATI Technologies Inc SB700 USB OHCI1 Controller [1002:4398]
           +-12.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller [1002:4396]
           +-13.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller [1002:4397]
           +-13.1  ATI Technologies Inc SB700 USB OHCI1 Controller [1002:4398]
           +-13.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller [1002:4396]
           +-14.0  ATI Technologies Inc SBx00 SMBus Controller [1002:4385]
           +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA) [1002:4383]
           +-14.3  ATI Technologies Inc SB700/SB800 LPC host controller [1002:439d]
           +-14.4-[0b]--
           +-18.0  Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration [1022:1200]
           +-18.1  Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Address Map [1022:1201]
           +-18.2  Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] DRAM Controller [1022:1202]
           +-18.3  Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Miscellaneous Control [1022:1203]
           \-18.4  Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] Link Control [1022:1204]

Full lspci -vvvxxxxnnn, lsusb -vvv, and a much more detailed list of tested hardware components is available in the wiki.

How to setup an encrypted USB-disk software-RAID-1 on Debian GNU/Linux using mdadm and cryptsetup

This is what I set up for backups recently using a cheap USB-enclosure which can house 2 SATA disks and shows them as 2 USB mass-storage devices to my system (using only one USB cable). Without any further introduction, here goes the HOWTO:

First, create one big partition on each of the two disks (/dev/sdc and /dev/sdd in my case) of the exact same size. The cfdisk details are omitted here.

  $ cfdisk /dev/sdc
  $ cfdisk /dev/sdd

Then, create a new RAID array using the mdadm utility:

  $ mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1

The array is named md0, consists of the two devices (--raid-devices=2) /dev/sdc1 and /dev/sdd1, and it's a RAID-1 array, i.e. data is simply mirrored on both disks so if one of them fails you don't lose data (--level=1). After this has been done the array will be synchronized so that both disks contain the same data (this process will take a long time). You can watch the current status via:

  $ cat /proc/mdstat
  Personalities : [raid1]
  md0 : active raid1 sdd1[1] sdc1[0]
        1465135869 blocks super 1.1 [2/2] [UU]
        [>....................]  resync =  0.0% (70016/1465135869) finish=2440.6min speed=10002K/sec
  unused devices: 

Some more info is also available from mdadm:

  $ mdadm --detail --scan
  ARRAY /dev/md0 metadata=1.01 name=foobar:0 UUID=1234578:1234578:1234578:1234578

  $ mdadm --detail /dev/md0
  /dev/md0:
          Version : 1.01
    Creation Time : Sat Feb  6 23:58:51 2010
       Raid Level : raid1
       Array Size : 1465135869 (1397.26 GiB 1500.30 GB)
    Used Dev Size : 1465135869 (1397.26 GiB 1500.30 GB)
     Raid Devices : 2
    Total Devices : 2
      Persistence : Superblock is persistent
      Update Time : Sun Feb  7 00:03:21 2010
            State : active, resyncing
   Active Devices : 2
  Working Devices : 2
   Failed Devices : 0
    Spare Devices : 0
   Rebuild Status : 0% complete
             Name : foobar:0  (local to host foobar)
             UUID : 1234578:1234578:1234578:1234578
           Events : 1
      Number   Major   Minor   RaidDevice State
         0       8       33        0      active sync   /dev/sdc1
         1       8       49        1      active sync   /dev/sdd1

Next, you'll want to create a big partition on the RAID device (cfdisk details omitted)...

  $ cfdisk /dev/md0

...and then encrypt all the (future) data on the device using dm-crypt+LUKS and cryptsetup:

  $ cryptsetup --verbose --verify-passphrase luksFormat /dev/md0p1
  Enter your desired pasphrase here (twice)
  $ cryptsetup luksOpen /dev/md0p1 myraid

After opening the encrypted container with cryptsetup luksOpen you can create a filesystem on it (ext3 in my case):

  $ mkfs.ext3 -j -m 0 /dev/mapper/myraid

That's about it. In future you can access the RAID data by using the steps below.

Starting the RAID and mouting the drive:

  $ mdadm --assemble /dev/md0 /dev/sdc1 /dev/sdd1
  $ cryptsetup luksOpen /dev/md0p1 myraid
  $ mount -t ext3 /dev/mapper/myraid /mnt

Shutting down the RAID:

  $ umount /mnt
  $ cryptsetup luksClose myraid
  $ mdadm --stop /dev/md0

That's all. Performance is shitty due to all the data being shoved out over one USB cable (and USB itself being too slow for these amounts of data), but I don't care too much about that as this setup is meant for backups, not performance-critical stuff.

Update 04/2011: Thanks to Bohdan Zograf there's a Belorussian translation of this article now!

Recovering from a dead disk in a Linux software-RAID5 system using mdadm

RAID5 failure

As I wrote quite a while ago, I set up a RAID5 with three
IDE disks at home, which I'm using as backup (yes, I know that
RAID != backup) and storage space.

A few days ago, the RAID was put to a real-life test for the first time, as one of the disks died. Here's what that looks like in dmesg:

raid5: raid level 5 set md1 active with 3 out of 3 devices, algorithm 2
RAID5 conf printout:
 --- rd:3 wd:3
 disk 0, o:1, dev:hda2
 disk 1, o:1, dev:hdg2
 disk 2, o:1, dev:hde2
[...]
hdg: dma_timer_expiry: dma status == 0x21
hdg: DMA timeout error
hdg: 4 bytes in FIFO
hdg: dma timeout error: status=0x50 { DriveReady SeekComplete }
ide: failed opcode was: unknown
hdg: dma_timer_expiry: dma status == 0x21
hdg: DMA timeout error
hdg: 252 bytes in FIFO
hdg: dma timeout error: status=0x50 { DriveReady SeekComplete }
ide: failed opcode was: unknown
hdg: dma_timer_expiry: dma status == 0x21
hdg: DMA timeout error
hdg: 252 bytes in FIFO
hdg: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdg: DMA disabled
ide3: reset: success
hdg: dma_timer_expiry: dma status == 0x21
hdg: DMA timeout error
hdg: 252 bytes in FIFO
hdg: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdg: DMA disabled
ide3: reset: success
hdg: status timeout: status=0x80 { Busy }
ide: failed opcode was: 0xea
hdg: drive not ready for command
hdg: lost interrupt
hdg: task_out_intr: status=0x50 { DriveReady SeekComplete }
ide: failed opcode was: unknown
hdg: lost interrupt
hdg: task_out_intr: status=0x50 { DriveReady SeekComplete }
ide: failed opcode was: unknown

That's when I realized that something was horribly wrong.

Not long after that, these messages appeared in dmesg. As you can see the software-RAID automatically realized that a drive died and removed the faulty disk from the array. I did not lose any data, and the system did not freeze up; I could continue working as if nothing happened (as it should be).

 md: super_written gets error=-5, uptodate=0
 raid5: Disk failure on hdg2, disabling device.
 raid5: Operation continuing on 2 devices.
 RAID5 conf printout:
  --- rd:3 wd:2
  disk 0, o:1, dev:hda2
  disk 1, o:0, dev:hdg2
  disk 2, o:1, dev:hde2
 RAID5 conf printout:
  --- rd:3 wd:2
  disk 0, o:1, dev:hda2
  disk 2, o:1, dev:hde2

This is how you can check the current RAID status:

 $ cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4] 
 md1 : active raid5 hda2[0] hde2[2] hdg2[3](F)
       584107136 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]

The "U_U" means two of the disks are OK, and one is faulty/removed. The desired state is "UUU", which means all three disks are OK.

The next steps are to replace the dead drive with a new one, but first you should know exactly which disk you need to remove (in my case: hda, hde, or hdg). If you remove the wrong one, you're screwed. The RAID will be dead and all your data will be lost (RAID5 can survive only one dead disk at a time).

The safest way (IMHO) to know which disk to remove is to write down the serial number of the disk, e.g. using smartctl, and then check the back side of each disk for the matching serial number.

 $ smartctl -i /dev/hda | grep Serial
 $ smartctl -i /dev/hde | grep Serial
 $ smartctl -i /dev/hdg | grep Serial

(ideally you should get the serial numbers before one of the disks dies)

Now power down the PC and remove the correct drive. Get a new drive which is at least as big as the one you removed. As this is software-RAID you have quite a lot of flexibility; the new drive doesn't have to be from the same vendor / series, it doesn't even have to be of the same type (e.g. I got a SATA disk instead of another IDE one).

Insert the drive into some other PC in order to partition it correctly (e.g. using fdisk or cfdisk). In my case I needed a 1 GB /boot partition for GRUB, and the rest of the drive is another partition of the type "Linux RAID auto", which the software-RAID will then recognize.

Then, put the drive into the RAID PC and power it up. After a successful boot (remember, 2 out of 3 disks in RAID5 are sufficient for a working system) you'll have to hook-up the new drive into the RAID:

 $ mdadm --manage /dev/md1 --add /dev/sda2
 mdadm: added /dev/sda2

My new SATA drive ended up being /dev/sda2, which I added using mdadm. The RAID immediately starts restoring/resyncing all data on that drive, which may take a while (2-3 hours, depends on the RAID size and some other factors). You can check the current progress with:

 $ cat /proc/mdstat 
 Personalities : [raid6] [raid5] [raid4] 
 md1 : active raid5 sda2[3] hda2[0] hde2[2]
       584107136 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
       [>....................]  recovery =  0.1% (473692/292053568) finish=92.3min speed=52632K/sec

As soon as this process is finished you'll see this in dmesg:

 md: md1: recovery done.
 RAID5 conf printout:
  --- rd:3 wd:3
  disk 0, o:1, dev:hda2
  disk 1, o:1, dev:sda2
  disk 2, o:1, dev:hde2

In /proc/mdstat you'll see "UUU" again, which means your RAID is fully functional and redundant (with three disks) again. Yay.

 $ cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4] 
 md1 : active raid5 sda2[1] hda2[0] hde2[2]
       584107136 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

Btw, another nice utility you might find useful is hddtemp, which can check the temperature of the drives. You should take care that they don't get too hot, especially so if the RAID runs 24/7.

 $ hddtemp /dev/hda
 dev/hda: SAMSUNG HD300LD: 38 °C
 $ hddtemp /dev/hde
 dev/hde: SAMSUNG HD300LD: 44 °C
 $ hddtemp /dev/sda
 dev/sda: SAMSUNG HD322HJ: 32 °C

RAID5 + dm-crypt + LVM + ext3 Debian install and benchmarks

OK, so I've setup a RAID5 at home because I'm getting tired of failed disk drives and data losses.

Some notes:

  • The system consists of 3 x 300 GB IDE drives in software RAID5 (standard Linux kernel and mdadm), thus ca. 600 GB usable storage space.
  • I've used the stock Debian installer to set up all of this, no custom hacks or anything needed.
  • Each drive is on an extra IDE bus/controller (1x onboard/internal, 2x on an PCI IDE controller card), as broken IDE disks (lacking hot-swap capabilities) often take down the whole IDE bus with them; it's not a good idea to put two disks on one IDE bus.
  • The software stack is: RAID5 at the botton, dm-crypt on top of that to encrypt the whole RAID, LVM on top of that to partition the system into /, /usr, /var, /tmp, /home, and swap.
  • /boot is on an extra 1 GB partition (replicated on each drive) as GRUB doesn't work on RAIDed disks and I want to use GRUB, not LILO. GRUB is also installed on the MBR of each drive, so if one of them fails, the other two can still come up.
  • I installed and configured smartmontools to check the status of the drives, and hddtemp to check their temperature.
  • Stability tests so far: While the system is running, pull out one of the IDE drives (yes, they're not hot-swappable and that may not be such a good idea, usually). The system survived without data loss. Time for rebuilding the array: ca. 1 hour. Second test: while the system is running, pull the plug. The system survived that, too.

Some stats from bonnie++ if anybody cares:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
bonsai           2G 26727  72 39426  19 16690   7 28725  65 34164   7 215.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
bonsai,2G,26727,72,39426,19,16690,7,28725,65,34164,7,215.3,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

(Now, if I only knew what all those figures mean ;-)

No, neither the software RAID5, nor the dm-crypt layer nor LVM cause any measurable performance degradation whatsoever (from my subjective feeling). I don't care enough to measure anything. The CPU is idling all the time.

Power consumption is rather high (partly due to the mainboard and CPU, but also because of the disks + fans) and the system is pretty loud, which both sucks on the long run. I plan an ultra-silent, ultra-low-power RAID5 with 2.5" disks attached via USB to a (silent, low-power) NSLU2 for later.

Syndicate content