AWS S3 to Google BigQuery

Wego uses the cloud extensively, be it server architecture or cloud storage. We live in the cloud, leveraging its power to provide the best service to our users.

Every cloud service has solutions that cater to all or most needs of any company, from storage to servers to analytics tools. It's all about locking your client down in to their architecture. At Wego, the Dev team uses AWS Cloud computing extensively. We have all our server in EC2 inside a VPC, with all data stored in S3. Where as our Analytics team loves their tools , foremost the BigQuery to perform all analysis from finding trends to identifying patterns.

For our teams to work harmoniously, devs need to make sure the data geeks get their data in BigQuery from S3, S3 allows you to easily download your data and BigQuery has an easy interface to add that in to a table under a dataset. Sounds simple, but not when you need to do it from around 2 dozen S3 buckets and for few hundred gig of data each day. Looking at that situation feels like a union not meant to last, should we move our servers and storage to Google or our analytics team to start using Amazon tools. Using the tools we use is because we don't want to make any compromise with the quality of work that we do and deliver.

There are some commercial solutions available that would import your S3 data to BigQuery table, but the amount of data meant huge bills only to transfer data from one source to another. Hence we decided to take up this challenge and write our own service that would transfer data from S3 to BigQuery.

To make sure our data was standardised to be imported into BigQuery, we used Protos, to define Schemas, according to which we log data and then use the same to generate BiqQuery table schema, so that data can be imported seamlessly. Google allows you to import data into BigQuery from their Cloud Storage through jobs that you can schedule.

Fluentd is our main tool to collate logs and upload them as compressed archives to S3. Once they are up in S3, then our service downloads it using S3 tools to our server after which they are then uploaded to Google Cloud Storage using GsUtil. Once the data is in Google Cloud Storage, we are able to schedule jobs to import the uploaded data to BigQuery tables under a dataset.

To cater to different needs and multiple developers who might need to upload data from the servers they are working on, the service needed to be flexible enough for anyone to add their task.The final service was flexible enough to import data into BigQuery from S3, Local machine or Google Cloud Storage. for any date or hour of the day from timestamped datasets. All manageable by specific config files and no code to add for each new dataset in the cloud.

We scheduled the service to run hourly allowing us to push data at the end of each hour instead of waiting for the day to end, previously we use to put things up for processing at day's end, cutting the wait our data geeks had to bare before they could start crunching on all that data.

View Comments