Search

User login

Poll

What is your favorite DB Server ?:
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
6 + 14 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

My YM

Author Information

sentono
Offline
Last seen: 6 days 17 hours ago
Joined: 09/21/2007

Alexa Rank

Who's online

There are currently 0 users and 6 guests online.
Home
  • warning: Invalid argument supplied for foreach() in /usr/home/wowtutorial/public_html/sites/all/modules/adsense_injector/adsense_injector.module on line 352.
  • warning: Invalid argument supplied for foreach() in /usr/home/wowtutorial/public_html/sites/all/modules/adsense_injector/adsense_injector.module on line 35.

How To Replace a Failed Drive in Linux RAID

Sometime back I had the distinct displeasure of waking up to a series of e-mail messages indicating that a series of RAID arrays on a remote system had degraded. The remote system was still running, but one of the hard drives was pretty much dead.

Upon logging in, it was found that four out of six RAID devices for a particular drive match were running in degraded mode: four partitions of the /dev/sdf device had failed; the two operational partitions still working were the /boot and swap partitions (the system is running three RAID1 mirrored drives; a total of six physical drives).

Checking the SMART status of /dev/sdf showed that SMART information on the drive could not be read. It was absolutely on its last legs. Luckily, I had a spare 300GB drive with which to replace it, so the removal and restructure of the RAID devices would be easy.

Still remote, I had to mark the two operational partitions on /dev/sdf as faulty, which was done using:

# mdadm --manage /dev/md0 --fail /dev/sdf2
# mdadm --manage /dev/md1 --fail /dev/sdf3

Checking the RAID status output, I verified all of the RAID devices associated with /dev/sdf were in a failed state:

# cat /proc/mdstat

Personalities : [raid1]

md6 : active raid1 sdc1[1] sda1[0]
      312568576 blocks [2/2] [UU]
...

md0 : active raid1 sdf2[1] (F) sde2[0]
      1959808 blocks [2/1] [U_]

The output above is shortened for brevity as there are eight md devices.

The next step was to remove /dev/sdf from all of the RAID devices:
# mdadm --manage /dev/md0 --remove /dev/sdf2
# mdadm --manage /dev/md1 --remove /dev/sdf3
# mdadm --manage /dev/md2 --remove /dev/sdf5
...

Once all of the /dev/sdf devices were removed, the system could be halted and the physical drive replaced. If you do not have a drive of the exact same size, then you need to use a larger drive; if the replacement drive is smaller, rebuilding the arrays will fail.

When the drive was replaced and the system turned back on, the system booted and from there it was a matter of creating a similar partition layout on the new drive as was on the old drive. Because this was a mirrored RAID1 series of arrays, we could use the working drive (/dev/sde) as a template:

# sfdisk -d /dev/sde | sfdisk /dev/sdf

This creates the exact same partition layout on /dev/sdf as exists on /dev/sde. Once this is done, run fdisk -l on each drive to verify the partition layout is identical. The next and final step is to add all of the new partitions to the existing RAID arrays. This is done using:

# mdadm --manage /dev/md0 --add /dev/sdf2
# mdadm --manage /dev/md1 --add /dev/sdf3
# mdadm --manage /dev/md2 --add /dev/sdf5
...

As you add the new devices to the existing array, the information in the array will be properly reconstructed. Depending on the size of the partition, the re-sync could take a few minutes to a few hours. You can cat /proc/mdstat to see the progress.

With the size of drives available today, my primary concern is data integrity, and for that, nothing beats RAID1. The hardest part in replacing and reconstructing the RAID arrays was figuring out which of the six drives in the system was the faulty one and replacing it. The longest part was the reconstruction, but this runs in the background and may make the system run a little sluggish, but it's still online and available.

The total downtime of this exercise was perhaps 20 minutes. If uptime and data integrity are important, seriously consider using RAID1. It has saved me numerous times from dying or faulty hardware and the effort required to use it is minimal.

Comments

andrew,.. heii please email

andrew,.. heii please email me at sentono[at]wowtutorial[dot]org to discuss things :)

Hey there, do you accept

Hey there, do you accept advertising on your website? I would be interested in a couple of text ads. - ways to lose weight after pregnancy

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

Recent comments

Facebook Fans

Sponsors

Online Store

Tag Cloud