Ultron is quite well known for its dedication to conquests and exterminations. So there was no better name for our Slack Bot that is to find the AWS EC2 instances that are not performing well by either with low or high CPU Utilisation compared to others behind an ELB.
AWS Cloud Watch is great at providing detailed monitoring down to a minute over
EC2 instances and all their provided services. We at Wego use it extensively to monitor our servers' performance and have alerts configured that integrate into
Slack and send out
e-mails informing us about any issue. Though it is of great help it lacked something that we needed which prompted us to hack our own solution using AWS SDK and
Slack for notificaitons.
To give you a backdrop of what we required that lacked in
Cloud Watch, we have multiple
EC2 instances running behind an
ELB. Even though we can configure
Alerts to trigger if
CPU Utilisation goes beyond a threshold, it doesn't lets us know if anyone or more of our server(s) is outperforming other by a given threshold value. If we set a narrow threshold to identify if any server went out of it, that could cause a lot of false positive and trigger unnecessary notifications during peak or off-peak hours. At times any instance behind an
ELB could get a request that takes longer time and consumes more CPU than other instances or might have some Garbage collection issue causing memory locks.
This gave us a good opportunity to write our own solution and hook it up with a Slack
bot for notifications, considering bots are the most happening thing nowadays. We created a solution to measure the difference between
CPU Utilisation for a period of
N minutes for all
instances on an
M minute (
2 currently in our setup). The alert was configured to be raised if for
N minutes the usage is beyond a predefined
threshold. This helped us in not only identifying the
instances which were having problem but also services which were not stable and were having spikes in their usage.
The initial result are to satisfaction as it has given us insights into our servers which lacked on
Cloud Watch and would usually be identified by someone experiencing slow service or downtime caused by troublesome
Since writing our initial version of
Ultron, we just couldn't sit and take back seat on the bot bandwagon, so we made
Ultron 2.0 per se. The next version of
Ultron is based on
Real Time Messaging API. so its a two way communication now with stubbornness and attitude.
At times an Instance would go down and not all devs have access to AWS Web Console to check which instance is
Out of Service or Instance(s) are having high
CPU usage. Now they can just ask Ultron with
Ultron cloudwatch <elb-name> and get status and usage of the Instance(s) behind an