1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Other Pooched Dell Server...

Discussion in 'Hardware' started by Dozer42, 22 Mar 2009.

  1. Dozer42

    Dozer42 What's a Dremel?

    Joined:
    22 Feb 2009
    Posts:
    29
    Likes Received:
    2
    Hi guys,

    Got a real bad problem here. I have an accountant with a relatively
    new Dell Poweredge 2900 server, less than a year old.

    It's been running great until this morning.

    Today they came in and the workstations weren't getting any real
    response from the server. He went to check the server and it was
    responding extremely sluggishly, on the verge of total lockup. It
    couldn't even pull task manager up, couldn't reboot, had to take it
    down hard with the power button.

    I just spent all day working on it with Dell support, and it's still
    pooched.

    The strange part is, after a reboot it will work fine for 30 minutes
    or so, and then sluggishly grind to a halt, making it very difficult
    to figure out if what we're doing to fix it is working or not.

    I ran every test on the planet that I could think of. CPU temp
    monitors while stressing all 4 Xeon cores with Prime95, Everest
    Corporate's stress test, Dell Poweredge Diagnostics 2.9, even burned
    their 32-bit bootable diag CD and that's running overnight currently.

    Details: Windows 2003 SBS, 4GB Ram, Quad Core Xeon, dual SAS 15k hard
    drives in a raid 1 setup, backups onto a WD mybook, additional backups
    on another machine, plus we were planning to add some Mozy.com action
    in there. It's running exchange server, DNS, DHCP, NOD32, and not too
    much else. There have been no recent hardware/software changes beyond
    loading the latest Malicious Software Removal Tool.

    Dell insists it's a software problem and it's going to take some work
    to get them to dispatch someone if it's hardware (especially with all
    the tests coming back clean). Nothing extraordinary in the event logs
    either.

    They had me go through and grab all the latest drivers and firmware
    for everything. Updated the computer's BIOS to 2.5.0, the firmware on
    the SAS RAID card, firmware on the hard drives themselves, drivers for
    the SAS card, drivers for the modem, etc. Nothing helped.

    We did, at one point think that we had fixed the problem. Dell had me
    pull out the aftermarket PCI modem we had added, and BOOM, the system
    was back up to full speed. Booting in 5 mins instead of 15 minutes,
    etc. I also made sure to reseat all the connections while I was in
    there. But then, 30 minutes later, it ground to a halt again. :(

    I'm pretty positive it's a hardware issue, and likely the motherboard
    or SAS controller. On one of the bootups it took about 2 minutes to
    get through the first POST screen, and another 2-3 minutes to light up
    the drives and recognize the raid array, then it took another 10
    minutes to boot. If it were software, it wouldn't be having so much
    trouble before it even gets to the OS. Another time after firmware
    updates it got stuck at Applying Computer Settings for 5-10 minutes.

    I'm exhausted and gave up on it for the night. Those tests will take a
    few hours to run anyway, so I told them I'd be back at noon tomorrow.
    He's called his employees that were supposed to be working tomorrow
    and told them not to come in. Worst possible time of the year for this
    to happen, and this is why we replaced the server last year to prevent
    things like this. Dell says they wouldn't be able to get someone out
    till Tuesday, as he has 'Next Business Day' support, and they can't
    enter the request until Monday. Ugh.

    Any suggestions on other ways to troubleshoot this?

    Thanks.
     
  2. mm vr

    mm vr The cheesecake is a lie

    Joined:
    18 Nov 2007
    Posts:
    2,968
    Likes Received:
    84
    Maybe it's bad RAM? I have no idea how FB-DIMMs work, but it could be that the RAM is bad, but it gives no erros because of the error correcting in the memory, it just keeps trying until it gets right.
     
  3. Burnout21

    Burnout21 Mmmm biscuits

    Joined:
    9 Sep 2005
    Posts:
    8,616
    Likes Received:
    197
    seems like you have covered alot of things.

    Going back too temps, how hot is the RAID controller running, dose it burn your finger around about the point it starts to slow down.

    It could be flying at times because its been off and has cooled down for a period of time.
     
  4. Dozer42

    Dozer42 What's a Dremel?

    Joined:
    22 Feb 2009
    Posts:
    29
    Likes Received:
    2
    Tried all the excellent suggestions here, such as unplugging all external USB devices, stopping all backup services, and I removed the cover and blew a strong fan at it (plus the built in fans go to super high noisy mode with the cover off).

    And 30 minutes later, it crashed.

    What led up to this was one of the people couldn't access an accounting/payroll program they use called CFS. So, I decided to back that directory up on the server so we could reinstall.

    Boom, halfway through copying the files it locked up.

    I thought for sure we'd found the problem, a corruption on the hard drive causing the drive to re-read constantly to correct the errors.

    Did a chkdsk /r and waited a couple hours, it checked every single sector on the hard drive, and had no problems at all. Plus I suspect the Dell 32-bit utility CD that I booted off of and ran all last night also flogged every sector of the hard drive, and no errors.

    So, we still don't honestly know what the problem is, but we were able to rename the CFS directory and then reinstall, and everything appears stable for now. It was up and running perfectly for about 3 hours, so we're just going to pray it stays that way.

    Now that we think about it, the crashes always happened when we tried to get into CFS or get at that data. Either way, we're hoping the problem is solved!
     
  5. Splynncryth

    Splynncryth 0x665E3FF6,0x46CC,...

    Joined:
    31 Dec 2002
    Posts:
    1,510
    Likes Received:
    18
    I'm not sure what Dell uses in place of the BMC Intel uses in their 5000 series chipset servers, but that would be the thing to check. It should log thermal events, fan issues, power problems, etc.

    Have you set the BIOS to display POST messages instead of a logo to see if the BIOS reports anything? Does dell have an error manager or event viewer in their BIOS?

    The fact that it takes the amount of time you mention makes it sound like it is either thermal, or software. It could be a service has a memory leak, or some other bad behavior. You will have to check the Windows event logs to see if there are any problems there.
     
  6. trigger

    trigger Procrastinator

    Joined:
    22 Mar 2004
    Posts:
    1,106
    Likes Received:
    37
    Have you tried booting it up with a Linux rescue CD and leaving it for 30 mins or so and seeing whether the problem persists? If it still happens at least that's some pretty strong evidence to shove in Dell's face that it is the hardware.
     

Share This Page