No matter what I try I just cant get my main rigs to be 100% stable. You may have seen some of my other recent threads about it but ive just put together another dedicated smaller folding machine and thats now crashing randomly as well! Main rig is running for about 2 days and then rebooting itself randomly, its running vista x64, MSI K9A2, AMD 7750 and 3 x 9800GX2's. Like I say will run fine for a couple of days and reboot itself, everything is turned off that would make it reboot, IE auto updates, screen saver, hibernate etc etc Theres no info in the event viewer besides "the recent shutdown at xx:xx was unexpected". Ive stressed it with various things inc IntelBurnTest which is hardcore and it passes fine. That one aside the other rig ive just built with win XP, Asus P5K, E8400 with 2 x 9800GT's. Again stress tested hard and no errors, will run happily but again XP will just crash randomly and it has done so in the early hours of the morning for the last 4 days again with no explanation in the event log Over the last month or so it been really annoying that I cant just leave the machines to crack on and they take constant attention and messing about with to try and stop them crashing! I cant see that im missing anything at all, Ive tried swapping hardware about, reinstalled windows and started again a couple of times, just getting fed up with it again! When then run, they run sweet but then just crash for what seems like no reason! Can someone who is a guru at big-rigs give me a rundown of how they set up a dedicated machine from scratch, like I say im sure im not missing anything but a rundown of everything will confirm im doing it all ok. Also do any of you run AV on folding machines? Cheers for any help
Not a guru in anyone's book, but here's my tuppence. I've not been able to get 4 cards to run stably on my K9A2, starting to think that its just that some of them have better power management than others. I know its a simple one and you've probably checked it, but automatic update launching at 3am was causing my machine problems until I realized the second time it happened (on XP).
I'm seriously useless at this sort of stuff but a couple of things to check might be are you installing from the same windows vista disk each time ? if so is it an original or an "Acquired" copy? have you checked that the bios is'nt set to turn the pc off for some reason temperature or something. you say that you've rebuilt it the system did you use any of the original parts from the first system might be worth have a look at those parts first if you did. sorry couldnt be more help but hopefully it might help a little
silly question.. could it be the northbridge over heating on ur mobo?? i have had that problem a while back on my pc, so i just took the side of the case off... no other part of my pc was over heating just the mobo
Where do they run/ Are they in the server room? Having seen that your new rig on the P5K is crashing as well I would be pointing the problem at a dodgy power supply (Not PSU, but the supply itself) or a UPS maybe? This sucks because you have some awesome hardware to fold with and I know how frustrating it must be not being able to leave them to fold unattended.
Thanks for your replies people, Vista x64 & XP im using are legit work VLE's and these 2 machines are in work in our air-conned server room so I have no issues with temperatures, CPUs (and NB's) dont go over 50 degrees and GPUs never go over 60 degrees. Unicorn - ive thought about the power source mate but cant see why it would be an issue, we have a large UPS which conditions and surge-protects the power. We have a dedicated 100A 3-phase supply to the room and we're not even using half its capacity plus we never ever have servers on the same supply crash and reboot. Its so strange, everything I normally build im a bit of a perfectionist over and always make sure everything is done properly and normally never get problems but this is starting to drive me up the wall lol I asked about AV as ive not bothered installing any on the folding rigs as I didnt want the scanning to take CPU time haha I will install AVG on one of the boxes now and make sure ive not got any virus learking about although other machines on this little network has AV and nothing has been picked up. Both machines have been running all day absolutely fine, temps are excellent but I bet when I check tomorrow the E8400 rig would have crashed in the night and it will be a day or 2 before the 3 x GX2 rig will reboot itself lol oh what fun Anyway if anyone has any more suggestions chuck em my way, not sure where to go from here.......
Rob, I don't really have anything to add except, assuming that is the P5K that came from me, I've never had an unexplained reboot on it. It's was never used in a dedicated folding PC, but it has been folding - CPU, as well as a testbed for GPU's - it's had a pair of GTX295's, a 9800GX2 and 8800GTS at the same time, as well as 8800GS. It's run for weeks at a time without any issues. Prior to that it had 3 months continuous uptime as a Linux router without a reboot. Not that me saying any of that helps and I don't really have any suggestions.
I assume you've run StressCPU2 and the latest nVidia GPU mem checker? How much juice is you're machine sucking from the wall, and what is the PSU rated at? Lastly, make sure no-ones accessing you're lovely GPU-ladden machine and trying to play games on it - causes mine to crash every time!!
That's a good call actually Doc. I have had problems with VNC being insecure before, so what network monitoring utility (if any) Do you use on the machine?
Clive - This machine is locking up not rebooting mate - I cant see it being the board but it is running an overclock, I am going to set bios to defaults tomorrow and run the E8400 @ stock speed and see what happens, saying that though ive stressed the crap out of it with Prime 95 and then tried IntelBurnTest which made it go nearly 10 degrees hotter than Prime 95 but it still passed after letting it run for a few hours. This machine has an Hi-Power 800watt PSU in but its only running 2 x 9800 GT cards so not exactly pushing it. My main rig has a 1250w Coolermaster PSU which has 6 x 28amp 12v rails so running 3 x GX2's still isnt pushing it as hard as it could handle, could add a 4th GX2 as well I'm using dameware mini-remote control to get remotely onto the boxes and ive wondered if that could be responsible but im not connected at 2/3/4am when they crash/reboot!
Never tried StressCPU2 but ive read everywhere IntelBurnTest is one if not the most harsh on CPU's? Ive ran the Nvidia Memtest proggy on the 2 9800GT's in the other machine for a few hours and no errors have been detected although ive not ran the memtest on any of the GX2 graphics cards yet....
Sorry, got it into my head that it was randomly rebooting. Not that it makes a lot of difference. As you say, take everything back to stock, kill the overclocks for the moment and see how you go from there.
IntelBurnTest is positively evil! But stresscpu2 is based on Gromacs code, so you're actually comparing apples with apples. If I can 'pass' an overclock using stresscpu2 for a couple of hours, it won't have problems folding. Can't say the same about prime. I remember one instance of an overnight pass with prime - started folding the next day and got a MCE within about an hour. So I kind of trust stresscpu2 more than the other stress tests - well, at least for validating stability for folding.
Haven't read your other posts but after trying the 8400 at stock, try stressing the machine one GPU at a time? So one 9800 running at 100% load for 3/4 days? and then try the next one and then the next one and so on. Would rule out a GPU issue which in my experience is the most likely component to fail (after mobo of course).
Im HOPING that ive solved the problem on the E8400 rig, I ran IntelBurnTest for 4 hours on the overclocked 3.6ghz E8400 and it never missed a beat but I dropped it back to 3GHz anyway just to see and it still locked up in the night so not CPU related... So yesterday I ran FurMark on the 2 9800GT's for many hours and found after over 3-4 hours one of the cards would lock up with pixels all over the screen. I reset it and tried again and after the same sort of time of 3 hours + it locked again! Took that card out and put another 9800GT from another PC in and its ran happily all night and still going now so will leave it going over the weekend, if all is ok the CPU is going back to 3.6ghz on Mon Now need to do the same on the 3 x 9800GX2 rig which could take some time lol think I will let it finish the CPChimps challenge first though as we are close now. It only reboots itself every few days so not as bad at the other rig!