If you are running a server 24/7, there’s a good chance that after few years one of the hard drives will fail.
What that happens, you’ll be glad to have data stored in RAID, but you probably won’t know how to repair it.
Here’s how I did it:
1and1 dedicated server example
rescue:~# cat /proc/mdstat md1 : active raid1 sda1[2] sdb1[1] 4194240 blocks [2/1] [_U] md3 : active raid1 sda3[2] sdb3[1] 1947222016 blocks [2/1] [_U] unused devices:
The important part is [2/1] [_U] – this clearly tells us that one out of two drives is not in the raid any more.
Correct status is [2/2] [UU] and that’s what we need to see at the end of this article.
To correct this, we need to:
- Find which drive is the new one
- Recreate partition table on it
- Put it into the raid
- Wait for the RAID to rebuild( could take several hours, depends on disk size )
- Add GRUB
Find the broken drive
When 1and1 told me they have replaced a drive, I didn’t know which one.
rescue:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.8T 0 disk sdb 8:16 0 1.8T 0 disk |-sdb1 8:17 0 4G 0 part | `-md1 9:1 0 4G 0 raid1 /mnt/md1 |-sdb2 8:18 0 2G 0 part `-sdb3 8:19 0 1.8T 0 part `-md3 9:3 0 1.8T 0 raid1 |-vg00-usr 253:0 0 4G 0 lvm |-vg00-var 253:1 0 904G 0 lvm `-vg00-home 253:2 0 4G 0 lvm
This output tells us that it’s sda that has been replaced. It has no partition table and therefore no data on it. This is very important to know.
Re-create partition table
sfdisk -d /dev/sdb | sfdisk /dev/sda
This will recreate the exactly same partition table on disk /dev/sda as we have on /dev/sdb right now.
That’s why it was importat to find out which disk has failed – if you do this the other way around, you loose all your data.
sfdisk -d /path/to/working/disk | sfdisk /path/to/new/disk
Extend RAID
Following part is rather complicated – using cat /proc/mdstat command above, you’ll see that I have 2 raid partitions.
md1 : active raid1 sda1[2] sdb1[1]
md3 : active raid1 sda3[2] sdb3[1]
Partition md1 made out of partitions sda1 from 1st drive (/dev/sda ) and partition sdb1 from 2nd drive ( /dev/sdb ).
And same for md3 which is made out of 2 identical partitions, sda3 and sdb3.
And since it’s /dev/sda that needs to be plugged in, these are the commands:
mdadm --manage /dev/md1 --add /dev/sda1 mdadm --manage /dev/md3 --add /dev/sda3
You can read this as “add /dev/sda1/ partition into /dev/md1 raid and add /dev/sda3 partition into /dev/md3 raid”
Wait for it..
Now if you look at mdstat again, you should see the array is being rebuilt – basically data from /dev/sdb drive is being copied to /dev/sda drive ( which is the point of RAID1 ).
rescue:~# cat /proc/mdstat md1 : active raid1 sda1[2] sdb1[1] 4194240 blocks [2/1] [_U] [=====>...............] recovery = 25.0% (1050112/4194240) finish=0.4min speed=116679K/sec md3 : active raid1 sda3[2] sdb3[1] 1947222016 blocks [2/1] [_U] resync=DELAYED unused devices:
GRUB
This step is optional – it really depends on where you had your GRUB installed. If you even had one..
Best way to find out is to simply restart the machine and see if CentOS boots. If it does, you’re done.
If it doesn’t, you need to tell the server where to look for operating system on its hard drives.
This was my case – GRUB was installed on /dev/sda so I ended with none.
You need to mount the drive with operating system ( CentOS in my case located on the smaller partition – /dev/md1 )
rescue:~# mount /dev/md1 /mnt
Chroot into it – that way you will be doing changes directly to the CentOS and not to the rescue linux you are currently on.
rescue:~# chroot /mnt
And finally, install GRUB on both drives ( that way, even if /dev/sda fails again, we will still be able to boot from /dev/sdb ).
rescue:~# grub grub> device (hd0) /dev/sda grub> root (hd0,0) grub> setup (hd0) grub> device (hd1) /dev/sdb grub> root (hd1,0) grub> setup (hd1)
Don’t worry about data loss – if you get the drives names and paths right (2nd step) there really isn’t anything to break.