Ultron - Cloud Watch Slack bot

Ultron is quite well known for its dedication to conquests and exterminations. So there was no better name for our Slack Bot that is to find the AWS EC2 instances that are not performing well by either with low or high CPU Utilisation compared to others behind an ELB.

AWS Cloud Watch is great at providing detailed monitoring down to a minute over ELB, EC2 instances and all their provided services. We at Wego use it extensively to monitor our servers' performance and have alerts configured that integrate into Slack and send out e-mails informing us about any issue. Though it is of great help it lacked something that we needed which prompted us to hack our own solution using AWS SDK and Slack for notificaitons.

To give you a backdrop of what we required that lacked in Cloud Watch, we have multiple EC2 instances running behind an ELB. Even though we can configure Alerts to trigger if CPU Utilisation goes beyond a threshold, it doesn't lets us know if anyone or more of our server(s) is outperforming other by a given threshold value. If we set a narrow threshold to identify if any server went out of it, that could cause a lot of false positive and trigger unnecessary notifications during peak or off-peak hours. At times any instance behind an ELB could get a request that takes longer time and consumes more CPU than other instances or might have some Garbage collection issue causing memory locks.

This gave us a good opportunity to write our own solution and hook it up with a Slack bot for notifications, considering bots are the most happening thing nowadays. We created a solution to measure the difference between CPU Utilisation for a period of N minutes for all instances on an ELB every M minute (N=5 and M=2 currently in our setup). The alert was configured to be raised if for 80% of N minutes the usage is beyond a predefined threshold. This helped us in not only identifying the instances which were having problem but also services which were not stable and were having spikes in their usage.

The initial result are to satisfaction as it has given us insights into our servers which lacked on Cloud Watch and would usually be identified by someone experiencing slow service or downtime caused by troublesome instances.

Since writing our initial version of Ultron, we just couldn't sit and take back seat on the bot bandwagon, so we made Ultron 2.0 per se. The next version of Ultron is based on Slack Real Time Messaging API. so its a two way communication now with stubbornness and attitude.

At times an Instance would go down and not all devs have access to AWS Web Console to check which instance is Out of Service or Instance(s) are having high CPU usage. Now they can just ask Ultron with Ultron cloudwatch <elb-name> and get status and usage of the Instance(s) behind an elb.