Ultron - Cloud Watch Slack bot
Ultron is quite well known for its dedication to conquests and exterminations. So there was no better name for our Slack Bot that is to find the AWS EC2 instances that are not performing well by either with low or high CPU Utilisation compared to others behind an ELB.
AWS Cloud Watch is great at providing detailed monitoring down to a minute over ELB
, EC2
instances and all their provided services. We at Wego use it extensively to monitor our servers' performance and have alerts configured that integrate into Slack
and send out e-mails
informing us about any issue. Though it is of great help it lacked something that we needed which prompted us to hack our own solution using AWS SDK and Slack
for notificaitons.
To give you a backdrop of what we required that lacked in Cloud Watch
, we have multiple EC2
instances running behind an ELB
. Even though we can configure Alerts
to trigger if CPU Utilisation
goes beyond a threshold, it doesn't lets us know if anyone or more of our server(s) is outperforming other by a given threshold value. If we set a narrow threshold to identify if any server went out of it, that could cause a lot of false positive and trigger unnecessary notifications during peak or off-peak hours. At times any instance behind an ELB
could get a request that takes longer time and consumes more CPU than other instances or might have some Garbage collection issue causing memory locks.
This gave us a good opportunity to write our own solution and hook it up with a Slack bot
for notifications, considering bots are the most happening thing nowadays. We created a solution to measure the difference between CPU Utilisation
for a period of N
minutes for all instances
on an ELB
every M
minute (N
=5
and M
=2
currently in our setup). The alert was configured to be raised if for 80%
of N
minutes the usage is beyond a predefined threshold
. This helped us in not only identifying the instances
which were having problem but also services which were not stable and were having spikes in their usage.
The initial result are to satisfaction as it has given us insights into our servers which lacked on Cloud Watch
and would usually be identified by someone experiencing slow service or downtime caused by troublesome instances
.
Since writing our initial version of Ultron
, we just couldn't sit and take back seat on the bot bandwagon, so we made Ultron 2.0
per se. The next version of Ultron
is based on Slack
Real Time Messaging
API. so its a two way communication now with stubbornness and attitude.
At times an Instance would go down and not all devs have access to AWS Web Console to check which instance is Out of Service
or Instance(s) are having high CPU
usage. Now they can just ask Ultron with Ultron cloudwatch <elb-name>
and get status and usage of the Instance(s) behind an elb
.