Hi guys, Got a real bad problem here. I have an accountant with a relatively new Dell Poweredge 2900 server, less than a year old. It's been running great until this morning. Today they came in and the workstations weren't getting any real response from the server. He went to check the server and it was responding extremely sluggishly, on the verge of total lockup. It couldn't even pull task manager up, couldn't reboot, had to take it down hard with the power button. I just spent all day working on it with Dell support, and it's still pooched. The strange part is, after a reboot it will work fine for 30 minutes or so, and then sluggishly grind to a halt, making it very difficult to figure out if what we're doing to fix it is working or not. I ran every test on the planet that I could think of. CPU temp monitors while stressing all 4 Xeon cores with Prime95, Everest Corporate's stress test, Dell Poweredge Diagnostics 2.9, even burned their 32-bit bootable diag CD and that's running overnight currently. Details: Windows 2003 SBS, 4GB Ram, Quad Core Xeon, dual SAS 15k hard drives in a raid 1 setup, backups onto a WD mybook, additional backups on another machine, plus we were planning to add some Mozy.com action in there. It's running exchange server, DNS, DHCP, NOD32, and not too much else. There have been no recent hardware/software changes beyond loading the latest Malicious Software Removal Tool. Dell insists it's a software problem and it's going to take some work to get them to dispatch someone if it's hardware (especially with all the tests coming back clean). Nothing extraordinary in the event logs either. They had me go through and grab all the latest drivers and firmware for everything. Updated the computer's BIOS to 2.5.0, the firmware on the SAS RAID card, firmware on the hard drives themselves, drivers for the SAS card, drivers for the modem, etc. Nothing helped. We did, at one point think that we had fixed the problem. Dell had me pull out the aftermarket PCI modem we had added, and BOOM, the system was back up to full speed. Booting in 5 mins instead of 15 minutes, etc. I also made sure to reseat all the connections while I was in there. But then, 30 minutes later, it ground to a halt again. I'm pretty positive it's a hardware issue, and likely the motherboard or SAS controller. On one of the bootups it took about 2 minutes to get through the first POST screen, and another 2-3 minutes to light up the drives and recognize the raid array, then it took another 10 minutes to boot. If it were software, it wouldn't be having so much trouble before it even gets to the OS. Another time after firmware updates it got stuck at Applying Computer Settings for 5-10 minutes. I'm exhausted and gave up on it for the night. Those tests will take a few hours to run anyway, so I told them I'd be back at noon tomorrow. He's called his employees that were supposed to be working tomorrow and told them not to come in. Worst possible time of the year for this to happen, and this is why we replaced the server last year to prevent things like this. Dell says they wouldn't be able to get someone out till Tuesday, as he has 'Next Business Day' support, and they can't enter the request until Monday. Ugh. Any suggestions on other ways to troubleshoot this? Thanks.
Maybe it's bad RAM? I have no idea how FB-DIMMs work, but it could be that the RAM is bad, but it gives no erros because of the error correcting in the memory, it just keeps trying until it gets right.
seems like you have covered alot of things. Going back too temps, how hot is the RAID controller running, dose it burn your finger around about the point it starts to slow down. It could be flying at times because its been off and has cooled down for a period of time.
Tried all the excellent suggestions here, such as unplugging all external USB devices, stopping all backup services, and I removed the cover and blew a strong fan at it (plus the built in fans go to super high noisy mode with the cover off). And 30 minutes later, it crashed. What led up to this was one of the people couldn't access an accounting/payroll program they use called CFS. So, I decided to back that directory up on the server so we could reinstall. Boom, halfway through copying the files it locked up. I thought for sure we'd found the problem, a corruption on the hard drive causing the drive to re-read constantly to correct the errors. Did a chkdsk /r and waited a couple hours, it checked every single sector on the hard drive, and had no problems at all. Plus I suspect the Dell 32-bit utility CD that I booted off of and ran all last night also flogged every sector of the hard drive, and no errors. So, we still don't honestly know what the problem is, but we were able to rename the CFS directory and then reinstall, and everything appears stable for now. It was up and running perfectly for about 3 hours, so we're just going to pray it stays that way. Now that we think about it, the crashes always happened when we tried to get into CFS or get at that data. Either way, we're hoping the problem is solved!
I'm not sure what Dell uses in place of the BMC Intel uses in their 5000 series chipset servers, but that would be the thing to check. It should log thermal events, fan issues, power problems, etc. Have you set the BIOS to display POST messages instead of a logo to see if the BIOS reports anything? Does dell have an error manager or event viewer in their BIOS? The fact that it takes the amount of time you mention makes it sound like it is either thermal, or software. It could be a service has a memory leak, or some other bad behavior. You will have to check the Windows event logs to see if there are any problems there.
Have you tried booting it up with a Linux rescue CD and leaving it for 30 mins or so and seeing whether the problem persists? If it still happens at least that's some pretty strong evidence to shove in Dell's face that it is the hardware.