Rebuild software RAID1 after disk failure – cent os

If you are running a server 24/7, there’s a good chance that after few years one of the hard drives will fail.
What that happens, you’ll be glad to have data stored in RAID, but you probably won’t know how to repair it.
Here’s how I did it:

1and1 dedicated server example

rescue:~# cat /proc/mdstat
md1 : active raid1 sda1[2] sdb1[1]
      4194240 blocks [2/1] [_U]
md3 : active raid1 sda3[2] sdb3[1]
      1947222016 blocks [2/1] [_U]
unused devices: 

The important part is [2/1] [_U] – this clearly tells us that one out of two drives is not in the raid any more.
Correct status is [2/2] [UU] and that’s what we need to see at the end of this article.
To correct this, we need to:

  1. Find which drive is the new one
  2. Recreate partition table on it
  3. Put it into the raid
  4. Wait for the RAID to rebuild( could take several hours, depends on disk size )
  5. Add GRUB

Find the broken drive

When 1and1 told me they have replaced a drive, I didn’t know which one.

rescue:~# lsblk 
sda               8:0    0  1.8T  0 disk  
sdb               8:16   0  1.8T  0 disk  
|-sdb1            8:17   0    4G  0 part  
| `-md1           9:1    0    4G  0 raid1 /mnt/md1
|-sdb2            8:18   0    2G  0 part  
`-sdb3            8:19   0  1.8T  0 part  
  `-md3           9:3    0  1.8T  0 raid1 
    |-vg00-usr  253:0    0    4G  0 lvm   
    |-vg00-var  253:1    0  904G  0 lvm   
    `-vg00-home 253:2    0    4G  0 lvm   

This output tells us that it’s sda that has been replaced. It has no partition table and therefore no data on it. This is very important to know.

Re-create partition table

sfdisk -d /dev/sdb | sfdisk /dev/sda

This will recreate the exactly same partition table on disk /dev/sda as we have on /dev/sdb right now.
That’s why it was importat to find out which disk has failed – if you do this the other way around, you loose all your data.

sfdisk -d /path/to/working/disk | sfdisk /path/to/new/disk

Extend RAID

Following part is rather complicated – using cat /proc/mdstat command above, you’ll see that I have 2 raid partitions.
md1 : active raid1 sda1[2] sdb1[1]
md3 : active raid1 sda3[2] sdb3[1]

Partition md1 made out of partitions sda1 from 1st drive (/dev/sda ) and partition sdb1 from 2nd drive ( /dev/sdb ).
And same for md3 which is made out of 2 identical partitions, sda3 and sdb3.

And since it’s /dev/sda that needs to be plugged in, these are the commands:

mdadm --manage /dev/md1 --add /dev/sda1
mdadm --manage /dev/md3 --add /dev/sda3

You can read this as “add /dev/sda1/ partition into /dev/md1 raid and add /dev/sda3 partition into /dev/md3 raid”

Wait for it..

Now if you look at mdstat again, you should see the array is being rebuilt – basically data from /dev/sdb drive is being copied to /dev/sda drive ( which is the point of RAID1 ).

rescue:~# cat /proc/mdstat
md1 : active raid1 sda1[2] sdb1[1]
      4194240 blocks [2/1] [_U]
      [=====>...............]  recovery = 25.0% (1050112/4194240) finish=0.4min speed=116679K/sec
md3 : active raid1 sda3[2] sdb3[1]
      1947222016 blocks [2/1] [_U]
unused devices: 


This step is optional – it really depends on where you had your GRUB installed. If you even had one..
Best way to find out is to simply restart the machine and see if CentOS boots. If it does, you’re done.
If it doesn’t, you need to tell the server where to look for operating system on its hard drives.
This was my case – GRUB was installed on /dev/sda so I ended with none.

You need to mount the drive with operating system ( CentOS in my case located on the smaller partition – /dev/md1 )

rescue:~# mount /dev/md1 /mnt

Chroot into it – that way you will be doing changes directly to the CentOS and not to the rescue linux you are currently on.

rescue:~# chroot /mnt

And finally, install GRUB on both drives ( that way, even if /dev/sda fails again, we will still be able to boot from /dev/sdb ).

rescue:~# grub
grub> device (hd0) /dev/sda 
grub> root (hd0,0)
grub> setup (hd0)
grub> device (hd1) /dev/sdb
grub> root (hd1,0)
grub> setup (hd1)

Don’t worry about data loss – if you get the drives names and paths right (2nd step) there really isn’t anything to break.