Storage Hard drive problem in a RAID5 array

Discussion in 'Hardware' started by dullonien, 13 Apr 2011.

  1. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    One of my drives in my RAID5 array has come up with an error today. It was working fine all day, untill I restarted after installing windows updates (including an optional update of the RAID driver). I first tried a system restore just incase the new driver had messed things up, but no luck.

    I removed the defective drive from the array and went to test it in another pc. According to speedfan, the S.M.A.R.T readings are ok (if not great) at around 90% fitness, with the only worrying values being the 'Power on hours count' at 88, and the 'seek error rate' at 49. I also downloaded Seagate SeaTools to see what that would make of it. The Short Generic test passed without any problems, and it's in the middle of doing a Long Generit test.

    I have also tried formatting the drive (quick format mind you) and re-inserting it back into the RAID array, and trying to rebuild it, but the drive still shows an error. I'm using 4 1.5TB Seagate LP drives, and using the on-board RAID controller of my EVGA 680i sli motherboard in my server. The drives get plenty of airflow with 2 120mm gentle typhoons blowing air over them in my antec 300 case.

    What should I do now. The drive is still under warrenty, but considering it passed the short generic test in SeaTools, it doesn't look like Seagate would replace the drive, because there doesn't seem to be anything hugely wrong (wait and see if the long generic test picks something up).

    If there is nothing wrong with the drive, how should I go about wiping the drive and trying to put it back into the RAID array. I it just a case of re-formatting the drive, or do I have to wipe it in some other way? I'd prefer not to have to buy a new drive to replace it with. The RAID array is running with just 3 drives at the moment, but if another drive fails, I'll have lost all the data, so I need this fixed as quickly as possible.

    Thanks in advance for any info anyone has.

    Edit: here's a link to the S.M.A.R.T report for the drive, if it helps.
     
    Last edited: 13 Apr 2011
  2. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    Just realised that this should be in the Tech Support sub-forum, sorry. Feel free to move it if a mod wishes.
     
  3. tehBoris

    tehBoris What's a Dremel?

    Joined:
    30 Jan 2011
    Posts:
    616
    Likes Received:
    25
    Don't rely on SMART, it's not as reliable as actually testing the drive (for e.g. the OS hard disk on my computer has had 'bad' SMART status for months and is fine).

    Use a tool that will check for bad sectors or other access errors, it's rely your only choice. I'd use SpinRite personally, but it's not free.
     
  4. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    Well I've deceided to use SeaTools, as that's what Seagate sugest using, and is what they'd probably test the drive with if I sent it in for warrenty replacement.

    Yeah I know that problems in SMART don't always matter. The OS drive in my server is a 6 year old 80GB seagate drive that's been showing that the 'spin retry count' has passed it's threshold for years. It's been in multiple builds and is still going strong.

    I forgot to mention that the 'failing' drive is only 4 months old (I think, although I had one of the 4 in my IcyBox NAS for a year prior, but I think I can see signs of the tape that was holding down the temp sensor on one of the other drives). Any way of checking when each drive was manufactured? If it is the drive from the NAS that's failing, then that would make sense, and the power brick on my NAS was half broken for quite a while before I built this srver. The power brick used to hiss, and take between 5 and 10 times before it would power the NAS up. Finally the hard drives stopped spinning up and I had to power the hard drives from a spare desktop PSU to get the thing to work. Certainly could have caused some damage.
     
  5. Deders

    Deders Modder

    Joined:
    14 Nov 2010
    Posts:
    4,053
    Likes Received:
    106
    Both my drives, one of which is fairly new, give confusing readings from the smart monitor tool i've got, a few of them are above the supposed threshold but it always says they are fine, might be worth looking up what the readings actually mean before scaring yourself?
     
  6. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    This isn't really about the readings from the SMART monitor. The RAID controller simply refuses to use this drive, so I've got to actually fix it or replace it to get my RAID array back working propperly.

    If it passes the two tests in SeaTools, then I believe Seagate will refuse to replace it under warrenty.

    I suppose the other possibilities of why the drive is being reported as faulty by the RAID controller could be a dodgy SATA cable or power connector. I'll have to try and replace these when I've got a bit more time (got a crit in uni on Friday so not got time at the min). Not an easy job in the Antec 300 with 6 drives running, doesn't give me much room inside the case.

    With the drive removed, the RAID array is back working with 3 drives, but if another fails I'm screwed.
     
  7. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    Right, I've done a bit more investigation here, and I think for some very weird reason that the RAID array has somehow split into two, one with 3 drives and the other with just the single drive (showing as an error). Here's a screenshot:

    [​IMG]

    Am I correct in saying that they are infact two RAID5 arrays? If I delete the top RAID 5 array with the single hard drive, then I will hopefully be able to rebuild the propper array using the hard drive. I'm just a bit scared that I've not got it right, so don't want to delete the array untill I'm 100% sure it won't affect the other one (which is still working atm).

    Why would it have done this, very odd!
     
  8. tehBoris

    tehBoris What's a Dremel?

    Joined:
    30 Jan 2011
    Posts:
    616
    Likes Received:
    25
    It would appear that that the controller has decided that the other disk was part of a different array. Delete it, add it to the existing array, that's what has to be done. Remember to have your backups at hand should it all go wrong.
     
    dullonien likes this.
  9. dullonien

    dullonien Master of the unfinished.

    Joined:
    22 Dec 2005
    Posts:
    1,282
    Likes Received:
    29
    Yeh, that's what it looks like. I simply haven't got any back-ups of this data (apart from the music), the rest is tv shows and movies. Can't afford the hard discs required to backup 3TB. It was one of the reasons to go RAID 5, since it adds a degree of redundancy, and the data won't be lost if one hard drive failed. I'll give it a go and pray nothing goes wrong. I don't see how it could affect the working array, considering it thinks they're seperate. But then agin something like this shouldn't have happened in the first place. I'll refrain from upgrading the RAID drivers from now-on.

    Edit: good nothing went wrong. The array is in the process of rebuilding now. Thanks alot.
     
  10. Pookeyhead

    Pookeyhead It's big, and it's clever.

    Joined:
    30 Jan 2004
    Posts:
    10,962
    Likes Received:
    573

    :duh:

    Someone sticky this thread... so when people argue that RAID is back up (which they do all the time), we can just point them here.


    Ok.. one reason your RAID spit out the drive could be that you are not using enterprise grade server drives. The problem is ERC, Error Recovery Control. This feature is also called CCTL (Command Completion Time Limit) by Samsung and Hitachi and TLER (Time-Limited Error Recovery) by Western Digital. All drives suffer from errors occasionally, and the drive will re-allocate a sector once in a while. Normally this is not a problem, but home consumer drives go to greater lengths to recover and re-allocate this data because home systems have no redundancy. This sounds like a good thing, but in a RAID environment, the controller will usually drop a disk from the RAID after a few seconds and report the RAID as degraded. This is because in a RAID environment with redundancy, the individual disk doesn't have to manage data re-allocation at all, as the host controller will just rebuild the missing data from the other disks.

    Seagate drives are a complete b4stard for re-allocating sectors. I've had brand new ones with re-allocated sectors on them! If the drive went to it's usual lengths to recover data and re-allocate it, the RAID controller may well have just decided to drop it from the RAID.

    Enterprise drives do not do this... they'll just instantly pass the error to the host controller and not time out like consumer drives do.

    This doesn't explain why you had 2 RAIDs... but it does seem to explain why you have lost an otherwise normally functioning drive in the first place, and possibly why the controller refuses to recognise the disk.

    Anyway.. looks like the problem is over... for now... so...

    Back up!

    If you have 3TB of data in one place, you really need to back it up. RAID is NOT back up. It's redundancy, and only redundancy so long as the controller is playing nice. It may be worth offloading some data somewhere else, splitting these drives and making a smaller RAID5 and a JBOD mirror in another NAS. That way you will have back up. Or sell something and buy more drives. Having 3TB of data on a on-board hosted RAID array with no back up would scare me.
     

Share This Page