Stark - The Indigenous Manager
As a service provider that works with partners across the globe at Wego we use Amazon EC2 for all our services. We have a VPC in which we have 100s of servers powering multiple services and websites.
When working with dataset in TBs as Wego does, at times we commission EC2 instances just to process this data and then decommission them at the end of the task. As these are long running tasks and require instances to run for days, they require manual intervention of starting and stopping them every time.
For a developer to overlook the start/stop process requires dedicated time for a minuscule operation which is beyond the access control of business unit to action anything on the EC2 console or the servers. The time of servers staying up and running can vary and can't be assured to finish in office hours to be monitored. Being huge servers with RAM as big as 128GB or more, the cost of keeping the instances up and not utilised incur unnecessary cost. Hence we needed a solution which would be flexible enough for all developers to add the tasks that they support and for the business to use, to run those tasks without technical intervention.
To develop such a solution we decided to use Jenkins as most of the business unit is familiar with it's UI and executing tasks through it. To use Jenkins for the purpose we wrote a Ruby project. Initially, prototyping a solution to run our Spark cluster.
We started off with using AWS CLI tool to manage the instance start/stop cycle, with CLI tool we had to write our own polling functions to make sure the server had started before we initialised the task and stopped gracefully before we terminated the job. One might think why not use the AWS SDK for Ruby, which we eventually did, just not initially as V2
was a failing build that meant it being unreliable. The AWS SDK is a comprehensive kit but at times confusing with lack of documentation or examples to work with, but it took away the hassle of maintaining our own polling feature.
Once we had the instance management automated , then came the tricky part of how to monitor if the task had finished on a remote machine, for us to stop the instance(s). We all have used SSH, mostly for logging into remote machines securely. but SSH is a secure shell, which means not only can it allow you to login securely in to a remote machine you can also run commands remotely in a secure manner also. So running a command such as ssh user@wego ps aux | grep spark
would get us a list of all the processes which are for Spark, this makes it quiet simple to know if a service is still running on the remote machine or has finished in which case we should get an empty process list.
Combining together the instance management and process listing gave us a manager for our periodic tasks that our business unit could execute from Jenkins without developer intervention. Being an indigenous solution we decided to name it Stark.