Using mdadm to recover from a dead disk in a Linux RAID-1 array

Yes, it's that time of the year again. A disk in my desktop-replacement laptop with 2 disks and a RAID-1 has died. Time for recovery.

This laptop has been running 24/7 for the last 3 years or such, so it's not too surprising that a disk dies. Surprisingly though, for the first time in a long series of dead disks, smartctl -a does indeed show errors for this disk. Here's a short snippet of those:

  $ smartctl -a /dev/sda
  Error 1341 occurred at disk power-on lifetime: 17614 hours (733 days + 22 hours)
   When the command that caused the error occurred, the device was active or idle.

   After command completion occurred, registers were:
   -- -- -- -- -- -- --
   40 41 02 1f c0 9c 40  Error: UNC at LBA = 0x009cc01f = 10272799

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   60 f8 08 20 c0 9c 40 00  41d+01:51:50.974  READ FPDMA QUEUED
   60 08 00 18 c0 9c 40 00  41d+01:51:50.972  READ FPDMA QUEUED
   ef 10 02 00 00 00 a0 00  41d+01:51:50.972  SET FEATURES [Reserved for Serial ATA]
   ec 00 00 00 00 00 a0 00  41d+01:51:50.971  IDENTIFY DEVICE
   ef 03 45 00 00 00 a0 00  41d+01:51:50.971  SET FEATURES [Set transfer mode]

  SMART Self-test log structure revision number 1
  Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
  # 1  Short offline       Completed: read failure       90%     20511         156170102

The status of the degraded RAID array looks like this:

  $ cat /proc/mdstat
  Personalities : [raid1] 
  md1 : active raid1 sdb7[1]
       409845696 blocks [2/1] [_U]
  md0 : active raid1 sda6[0] sdb6[1]
       291776 blocks [2/2] [UU]

The [_U] means that one of two disks has failed, it should normally be [UU]. There are two RAID-1s actually, a small md0 (sda6 + sdb6) for /boot and the main md1 (sda7 + sdb7) which holds the OS and my data. Apparently (at first at least), only sda7 was faulty and got kicked out of the array:

  $ dmesg | grep kick
  md: kicking non-fresh sda7 from array!

Anyway, so I ordered a replacement disk, removed the dead disk (I checked the serial number and brand before, so I don't accidentally remove the wrong one), inserted the new disk and rebooted.

Note: In order for this to work you have to have (previously) installed the bootloader (usually GRUB) onto both disks, otherwise you won't be able to boot from either of them (which you'll want to do if one of them dies, of course). In my case, sda was now dead, so I put sdb into its place (physically, by using the other SATA connector/port) and the new replacement disk would become the new sdb.

After the reboot, the new disk needs to be partitioned like the other RAID disk. This can be done easily by copying the partition layout of the "good" disk (now sda after the reboot) onto the empty disk (sdb):

  $ sfdisk -d /dev/sda | sfdisk /dev/sdb

Specifically, the RAID disks/partitions need to have the type/ID "fd" ("Linux raid autodetect"), check if that is the case. Then, you can add the new disk to the RAIDs:

  $ mdadm /dev/md0 --add /dev/sdb6
  $ mdadm /dev/md1 --add /dev/sdb7

After a few hours the RAID will be re-synced properly and all is good again. You can check the progress via:

  $ watch -n 1 cat /proc/mdstat

You should probably not reboot during the resync (though I'm not 100% sure if that would be an issue in practice; please leave a comment if you know).

Also, don't forget to install GRUB on the new disk so you can still boot when the next disk dies:

  $ grub-mkdevicemap
  $ grub-install /dev/sdb

And it might be a good idea to use S.M.A.R.T. to check the new disk, just in case. I did a quick run for the new disk via:

  $ smartctl -t short /dev/sdb # Wait a few minutes after this.
  $ smartctl -a /dev/sdb
  SMART Self-test log structure revision number 1
  Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
  # 1  Short offline       Completed without error       00%        22         -

Looks good. So far.

How to setup an encrypted USB-disk software-RAID-1 on Debian GNU/Linux using mdadm and cryptsetup

This is what I set up for backups recently using a cheap USB-enclosure which can house 2 SATA disks and shows them as 2 USB mass-storage devices to my system (using only one USB cable). Without any further introduction, here goes the HOWTO:

First, create one big partition on each of the two disks (/dev/sdc and /dev/sdd in my case) of the exact same size. The cfdisk details are omitted here.

  $ cfdisk /dev/sdc
  $ cfdisk /dev/sdd

Then, create a new RAID array using the mdadm utility:

  $ mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1

The array is named md0, consists of the two devices (--raid-devices=2) /dev/sdc1 and /dev/sdd1, and it's a RAID-1 array, i.e. data is simply mirrored on both disks so if one of them fails you don't lose data (--level=1). After this has been done the array will be synchronized so that both disks contain the same data (this process will take a long time). You can watch the current status via:

  $ cat /proc/mdstat
  Personalities : [raid1]
  md0 : active raid1 sdd1[1] sdc1[0]
        1465135869 blocks super 1.1 [2/2] [UU]
        [>....................]  resync =  0.0% (70016/1465135869) finish=2440.6min speed=10002K/sec
  unused devices: 

Some more info is also available from mdadm:

  $ mdadm --detail --scan
  ARRAY /dev/md0 metadata=1.01 name=foobar:0 UUID=1234578:1234578:1234578:1234578

  $ mdadm --detail /dev/md0
          Version : 1.01
    Creation Time : Sat Feb  6 23:58:51 2010
       Raid Level : raid1
       Array Size : 1465135869 (1397.26 GiB 1500.30 GB)
    Used Dev Size : 1465135869 (1397.26 GiB 1500.30 GB)
     Raid Devices : 2
    Total Devices : 2
      Persistence : Superblock is persistent
      Update Time : Sun Feb  7 00:03:21 2010
            State : active, resyncing
   Active Devices : 2
  Working Devices : 2
   Failed Devices : 0
    Spare Devices : 0
   Rebuild Status : 0% complete
             Name : foobar:0  (local to host foobar)
             UUID : 1234578:1234578:1234578:1234578
           Events : 1
      Number   Major   Minor   RaidDevice State
         0       8       33        0      active sync   /dev/sdc1
         1       8       49        1      active sync   /dev/sdd1

Next, you'll want to create a big partition on the RAID device (cfdisk details omitted)...

  $ cfdisk /dev/md0

...and then encrypt all the (future) data on the device using dm-crypt+LUKS and cryptsetup:

  $ cryptsetup --verbose --verify-passphrase luksFormat /dev/md0p1
  Enter your desired pasphrase here (twice)
  $ cryptsetup luksOpen /dev/md0p1 myraid

After opening the encrypted container with cryptsetup luksOpen you can create a filesystem on it (ext3 in my case):

  $ mkfs.ext3 -j -m 0 /dev/mapper/myraid

That's about it. In future you can access the RAID data by using the steps below.

Starting the RAID and mouting the drive:

  $ mdadm --assemble /dev/md0 /dev/sdc1 /dev/sdd1
  $ cryptsetup luksOpen /dev/md0p1 myraid
  $ mount -t ext3 /dev/mapper/myraid /mnt

Shutting down the RAID:

  $ umount /mnt
  $ cryptsetup luksClose myraid
  $ mdadm --stop /dev/md0

That's all. Performance is shitty due to all the data being shoved out over one USB cable (and USB itself being too slow for these amounts of data), but I don't care too much about that as this setup is meant for backups, not performance-critical stuff.

Update 04/2011: Thanks to Bohdan Zograf there's a Belorussian translation of this article now!

Note to self: Missing lvm2 and cryptsetup packages lead to non-working initrd very, very soon

I recently almost died from a heart attack because after a really horrible crash (don't ask), Debian unstable on my laptop wouldn't boot anymore. The system hung at "Waiting for root filesystem...", and I was in panic mode as I feared I lost all my data (and as usual my backups were waaay too old).

At first I was suspecting that something actually got erased or mangled due to the crash, either at the dm-crypt layer, or the LVM layer, or the ext3 filesystem on top of those. After various hours of messing with live CDs, cryptsetup, lvm commands (such as pvscan, pvs, vgchange, vgs, vgck) and finally fsck I still had not managed to successfully boot my laptop.

I finally was able to boot by changing the initrd from initrd.img-2.6.30-2-686 to initrd.img-2.6.30-2-686.bak in the GRUB2 menu (at boot-time), at which point it was clear that something was wrong with my current initrd. A bit of debugging and some initrd comparisons revealed the cause:

Both, the cryptsetup and lvm2 packages were no longer installed on my laptop, which made all update-initramfs invokations (e.g. upon kernel package updates) create initrds which did not contain the proper dm-crypt and lvm functionality support. Hence, no booting for me. I only noticed because of the crash, as I usually do not reboot the laptop very often (two or three times per year maybe).

Now, as to why those packages were removed I have absolutely no idea. I did not remove them knowingly, so I suspect some dist-upgrade did it and I didn't notice (but I do carefully check which packages dist-upgrade tries to remove, usually)...

Resizing ext3-on-LVM-on-dmcrypt file systems, moving disk space from one LV to another

Back in 2008 I wrote a small article about resizing LVM physical volumes. I had to do something similar, but slighly more complicated, recently. My /usr logical volume (LV) was getting full on my laptop disk, thus I wanted to shrink another LV and move some of that space to /usr. Here's one way you can do that.

Requirements: a Live CD containing all required utilities (cryptsetup, LVM tools, resize2fs), I used grml.

Important: If you plan to perform any of these steps, make sure you have recent backups! I take no responsibility for any data loss you might experience. You have been warned!

First, shutdown the laptop and boot using the Live CD. Then, open the dm-crypt device (/dev/hda3 in my case) by entering your passphrase:

  $ cryptsetup luksOpen /dev/hda3 foo

Activate all (newly available) LVM volume groups in that encrypted device:

  $ vgchange -a y

(maybe you also need a vgscan and/or lvscan, not sure)

Check how much free space we have for putting into our /usr LV:

  $ vgdisplay | grep Free
  Free  PE / Size       0 / 0   

OK, so we have none. Thus, we need to shrink another LV (/home, in my case) and put that newly freed space into the /usr LV. In order to do that, we have to check the current size of the /home LV:

  $ mount -t ext3 /dev/vg-whole/lv-home /mnt
  $ df --block-size=1M | grep -C 1 /mnt
  $ umount /mnt

(if you know how to find out the size of an ext3 file system without mounting it, please let me know) Update: See comments for suggestions.

Write down the total amount of 1M chunks of space on the file system (116857 in my case), we'll need that later. Now run 'fsck' on the /home LVM logical volume, which is needed for the 'resize2fs' step afterwards. This will take quite a while.

  $ fsck -f /dev/vg-whole/lv-home

Next step is resizing the ext3 file system in the /home LVM logical volume, making it 1GB smaller than before (of course you must have >= 1 GB of free space on /home for that to work). We use fancy bash calculations to do the math.

Note: I'm not so sure about the sizes here, in my first attempt something went wrong and resize2fs said "filesystem too small" or the like. Maybe I'm confusing the size units from 'df' and 'resize2fs', or the bash calculation goes wrong? Please leave a comment if you know more!

  $ resize2fs /dev/vg-whole/lv-home $((116857-1024))M

Then, we can safely reduce the LV itself. Note: order is very important here, you must shrink the ext3 filesystem first, and then shrink the LV! Doing it the other way around will destroy your filesystem!

  $ lvreduce -L -1G /dev/vg-whole/lv-home

Now that we have 1 GB of free space to spend on LVs, we assign that space to the /usr LVM logical volume like this:

  $ lvextend -L +1G /dev/vg-whole/lv-usr

As usual, we then run 'fsck' on the filesystem in order to be able to use 'resize2fs' to resize it to the biggest possible size (that's the default if resize2fs gets no parameters):

  $ fsck -f /dev/vg-whole/lv-usr
  $ resize2fs /dev/vg-whole/lv-usr

That's it. You can now shutdown the Live CD system and boot into the normal OS with the new space allocations:

  $ vgchange -a n
  $ cryptsetup luksClose foo
  $ halt

Towards a moderately paranoid Debian laptop setup [Update]

I was planning to set up my laptop from scratch for a while now... so I did.


  • First, go home. No, really! Do all of this at home in a non-hostile, firewalled network. You don't want to be in a crowded place such as a conference where people can shoulder-surf your passwords, nor do you want your network traffic sniffed or MITM'd in a hostile network.
  • Backup all your data! You'll be wiping your whole drive soon, so make sure you have recent, tested backups.
  • Get the most recent Debian-installer ISO image (currently etch-beta3), as well as the MD5SUMS and MD5SUMS.sign files:
  • Run gpg --verify MD5SUMS.sign, which will fail but tell you the signing key ID (88C7C1F7 in this case). Get the key and re-run the verification: gpg --recv-key --keyserver 88C7C1F7 && gpg --verify MD5SUMS.sign. The output should now say "Good signature from [...]".
  • Now check the MD5 checksums via md5sum -c MD5SUMS. The output should contain debian-testing-i386-binary-1.iso: OK.
  • As you now have (somewhat) verified the integrity of the ISO image, burn it on a CD-R: wodim debian-testing-i386-binary-1.iso.
  • Put trusted versions of some files on a USB thumb drive (or CD-ROM); at least a firewall script, but maybe also your bashrc, bash_logout, inputrc, vimrc, muttrc.
  • Disconnect your laptop from any kinds of networks. Pull all ethernet cables. Disable WLAN (via hardware killswitch). Disable Bluetooth. Disable/remove Firewire, USB, serial, whatever.
  • Put on your tin-foil hat (optional).


  • Set a good BIOS boot password (which you need to boot any OS). Set a (different) good BIOS boot setup password (which you need to enter the BIOS).
  • Disable all boot possibilities in the BIOS, except for CD-ROM. This means it should not be possible to boot via USB, hard drive, network, PXE, Firewire, floppy, whatever. The BIOS setup password helps to prevent tampering with this setting.
  • Finally, never rely on BIOS passwords alone for security! They can often be circumvented very easily.

Installation / Setting up full-disk encryption using dm-crypt

  • Insert the installer CD and boot in expert-mode (don't hit ENTER when you boot, but rather type "expert").
  • As for networking: select "Do not configure the network at this time". We'll fix and enable networking later.
  • Partitioning:

    • Select manual partitioning. Remove all partitions (if any). Create a 100 MB /boot (ext3) as primary partition, and make the rest of the hard drive one huge partition which has "Use as:" set to "physical volume for encryption".
    • The standard options for cipher, key size, IV mode etc. should be fine (AES, 256 bit, CBC-ESSIV-SHA256, dm-crypt).
    • After the erasing is done (this is important!), use the whole encrypted space as "physical volume for LVM". Then select "Configure the Logical Volume Manager". Create one big volume group and a bunch of logical volumes for the various partitions we'll use (lv-root, lv-usr, lv-var, lv-tmp, lv-swap, lv-home).
    • It is extremely important that your swap space is encrypted (in this case it is, as all partitions except for /boot reside on a dm-crypt device)! Never set up unencrypted swap!
  • Enable shadow passwords. Allow login as root (I feel confident that I won't do stupid things as root).
  • Choose a good root password, and a (different) good user password. Don't enter a full name for the user.
  • Choose the latest kernel (old kernels might have security issues). Do not participate in popcon.
  • Do not install any tasks (no "desktop", no "base system"). We want the smallest installation possible, and add only the packages we really need. Fewer packages means fewer security issues (statistically).
  • That's it. Eject the CD-ROM, reboot, change the BIOS to only allow booting from hard drive.

Post-installation tasks

  • Enter the USB thumb drive, copy all config-files to /root and /home/uwe. Log out and log in again to make ~/.bashrc and ~/.inputrc take effect.
  • Enable the firewall: mkdir /etc/rc.boot && cp fw_laptop /etc/rc.boot && chmod 700 /etc/rc.boot/fw_laptop && sh /etc/rc.boot/fw_laptop
  • Shut down all networked daemons (if any): /etc/init.d/foo stop.
  • Tighten home-directory permissions: chmod 700 /root /home/uwe.
  • Edit /etc/passwd: give all users except for root, sync, uucp and your user account /usr/sbin/nologin as login shell. None of these accounts really needs a valid login shell (nologin will log any login attempts for those accounts).
  • Edit /etc/group: remove your user account from the dialout, cdrom, and floppy group. The groups audio, video, and plugdev can stay.
  • Edit /etc/fstab: add some mount options such as ro, nosuid, noexec, or nodev as you see fit. Example:
    /dev/mapper/vg--whole-lv--root /     ext3 defaults,errors=remount-ro      0 0
    /dev/sda2                      /boot ext3 defaults,nodev,nosuid,noexec,ro 0 0
    /dev/mapper/vg--whole-lv--home /home ext3 defaults,nodev,nosuid           0 0
    /dev/mapper/vg--whole-lv--tmp  /tmp  ext3 defaults,nodev,nosuid           0 0
    /dev/mapper/vg--whole-lv--usr  /usr  ext3 defaults,nodev,ro               0 0
    /dev/mapper/vg--whole-lv--var  /var  ext3 defaults,nodev                  0 0
    /dev/mapper/vg--whole-lv--swap none  swap sw                              0 0
    /dev/scd0 /media/cdrom iso9660 noauto,nodev,nosuid,noexec,uid=uwe,gid=uwe 0 0
  • If you have read-only (ro) file systems, configure Apt so that it can remount them read-write when installing/removing packages. Add this to /etc/apt/apt.conf:
      Pre-Invoke { "mount -o remount,rw /usr; mount -o remount,rw /boot"; }
      Post-Invoke { "mount -o remount,ro /usr; mount -o remount,ro /boot"; }
  • Fix the GRUB configuration. Replace the "password foo" line (which contains the GRUB password in plain-text) from your /boot/grub/menu.lst with a "password --md5 $1$1234567890..." line, where the MD5 hash ($1$1234567890...) can be generated with grub-md5-crypt. Additionally, add such a password line after each "title" line in the GRUB config-file, so that nobody can boot any OS installed on the laptop without a password!

Networking, Upgrading and Apt-secure

  • Now that we have a small, hardened system, it should be reasonably safe to enable networking. Add this to /etc/network/interfaces:
    auto eth0
    iface eth0 inet dhcp
      pre-up /etc/rc.boot/fw_laptop

    Run /etc/init.d/networking restart. The firewall script will run every time the network is started.

  • Now add this (tweak as you see fit) to /etc/apt/sources.list:
    deb unstable main
    deb-src unstable main
  • Time for upgrading: apt-get update && apt-get dist-upgrade. All packages are GnuPG-signed and will be verified by Apt. The installer already ships the required key (for 2006), so everything should just work. Still, you should read about SecureApt.
  • Install the rest of your system now, and restore your data from backups.
  • Use sysv-rc-conf to disable all daemons you don't want to start per default: sysv-rc-conf foo off.
  • Install and set up Samhain (or any other file integrity checker): apt-get install samhain. You want to be notified if your system files are being tampered with (e.g. replaced by a rootkit).
  • Install and configure Tor for anonymous browsing. More details here.
  • Install and configure more security-related programs, e.g. logcheck, snort, rkhunter, chkrootkit, tiger, sxid, etc.


Now install and set up SELinux. This section is based on notes from Erich Schubert (thanks!), and will soon appear in the SELinuxSetup wiki page, too.

  • Install the base packages and an SELinux policy: apt-get install selinux-basics selinux-policy-refpolicy-targeted.
  • Edit /boot/grub/menu.lst and add selinux=1 to your kernel command line to enable SELinux upon booting.
  • In /etc/pam.d/login uncomment the "session required multiple" line. Do the same in /etc/pam.d/ssh if you have ssh installed.
  • In /etc/default/rcS set FSCKFIX=yes.
  • In /etc/init.d/ search for "Update motd" and comment the two lines below that line. Then rm /var/run/motd.
  • If you have exim installed, you must either install postfix or write an exim policy, as none currently exists. But even postfix needs some fixing (no pun intended ;-). Disable chroot-support (change all "chroot" fields to "n" in /etc/postfix/ and execute echo 'SYNC_CHROOT="n" >> /etc/default/postfix').
  • Use check-selinux-installation to check for common SELinux problems on Debian (such as the above mentioned).
  • touch /.autorelabel. Reboot. touch /.autorelabel (again). Reboot (again).
  • Done. You should now have a working SELinux system. If no critical audit errors appear and you feel comfortable with SELinux, enable enforcing mode via setenforce 1 or by adding enforcing=1 to the kernel command line in /boot/grub/menu.lst.


  • Never leave your laptop unattended!
  • Always lock your terminal (using vlock) when you move more than 30 cm away from the laptop!
  • Don't run insecure and/or closed-source software (which you can never trust!). No NVIDIA/ATI drivers, no VMware, no Google Earth, no Flash Plugin (except for Gnash maybe), no Adobe Acrobat. You get the idea.
  • Keep the number of installed packages small and try to configure each of them as secure as possible.
  • Never enable networking or WLAN or Bluetooth if you don't absolutely have to.
  • Trust no one. Don't let other people use you laptop, don't give out shell accounts.

Further ideas

  • The /boot partition is still unencrypted, so an attacker can tamper with it. Boot from a CD-R, forbid booting from hard drive (BIOS). Sign/mark the CD-R physically, so you'll know when someone replaced your CD-R with his own, back-doored one.
  • Another idea is to use an additionaly USB thumb drive or CD-ROM or smartcard for two-factor authentication.
  • Install another Debian into a QEMU image. Use it as a sandbox for stuff you don't trust: qemu -snapshot -net none foo.img.
  • At all costs, disable Firewire! If possible via hardware or BIOS, or at least don't load the drivers and/or fix them (page 19).
  • Replace the proprietary, closed-source BIOS with LinuxBIOS, if possible.

That's it. You can take off that stupid tin-foil hat now.

Update 2006-09-29: Fixed typos. Mentioned sxid. Added two-factor authentication.

Syndicate content