At work we have been using Copperegg to monitor EC2 instances with AWS. It's been fine up until last week when and alert didn't trigger at when the CPU load had been above 80% for over 5 minutes. The alert should have triggered after 1 minute. The support team weren't helpful at all, merely agreeing that it should have triggered but that there wasn't a bug so that was it. Unfortunately this is unacceptable as now I can not trust it to be reliable. I could deal with the phantom alerts that were triggered for now reason, but no alerts is a deal breaker. So, I'm looking for similar alternatives that can do the same job and also integrate with Pager Duty. I've been looking at New Relic and Datadog but neither are quite up to standard. Both have a minimum of 5 minutes of server downtime before alerting me about it. I need this to be 1 minute, like I get with Copperegg. So, what's out there for me to try?
What hardware do you run? I made a switch to Dell r805's and Dells own DRAC seems to do monitoring/alerts just fine.
OP says EC2 instances on AWS, this is Amazon Cloud Computing so no direct access to the underlying hardware as it's all virtual.
I use the fork of Nagios, Icinga. Not using it for CPU levels, mainly used for checking services like HTTP/SSH/Ping etc. There are more things you can do with it, but our network has SMTP locked out so we are limited to what we can monitor. Must speak to the Comms team about that. Found it easy to setup and it is very customisable. Not tried setting alerts as low as 1 minute, sometimes our mail server takes longer to process the alert than that Interested to see what else there is out there. Best of luck.