When I joined Quby three years ago, we were serving 25.000 connected devices from a couple of servers in two 19" racks in a hosted data center. Soon after that, the amount had grown to 200.000 and we slowly started to experience difficulties with our IT landscape.
The explosive growth had resulted in an overcomplicated collection of snowflake systems. An increasing amount of outages, also on the server side application for our mobile app, were hard to investigate and seemed to result in more complications with every fix.
Now, the heating season of 2016 knocked at our doors. It caused an enormous increase in the usage of our back-end systems. Our technical debt had piled up. With this instability, it was only a matter of time before it would crumble under the ever-increasing load. It was time for drastic measures. That's why we decided to prepare our services for a migration to the cloud of Amazon.
A simple "lift and shift" migration wasn't going to cut it, since it would just move our problems along. The ever increasing number of connected devices, now ranking up to 300.000, meant we needed to prepare for one of the largest scale operations in the history of Quby.
To ensure the success of the migration, we defined a set of architecture principles that would govern the infrastructure in AWS and we used a test-driven migration approach.
The first principle being "infrastructure as a code".
This meant that any configuration of the services provided by AWS is scripted - mainly using Cloud Formation - and put under version control. Test, acceptance and production environments are provisioned using the same scripts ensuring that changes are properly tested and that each environment is identical. A huge improvement of the current data center.
Within Quby we have a long lasting tradition of using Linux based Open Source software and we have the intention of continuing that after the migration. We also value our freedom and try to prevent too much lock-in to our public cloud provider. Most of our software is written in Java and uses MySQL to persist data. As the whole operation has an enormous impact, we decided to minimize the impact on our applications and keep the software changes to a minimum. Based on these requirements, we designed an infrastructure around Docker containers running on Apache Mesos. The container orchestration is taken care of by Apache Marathon, while service discovery is implemented using Consul from Hashicorp. All these components run on AWS EC2 using Elastic Load Balancers to provide access to the services and AWS RDS as relational database back-end. The combination of Mesos and Marathon, automatically deployed using scripts, yields a basic infrastructure for our software applications that both scales horizontally and is also self-healing in the sense that Marathon kills and restarts any misbehaving or crashed application container without human intervention. Another major improvement.
The second architecture principle was "Build and Run": the ability of our software developers to independently deploy software, monitor its performance and intervene when necessary.
This was accomplished by configuring a solid Continuous Integration / Continuous Deployment (CI/CD) pipeline based on Jenkins built servers and enforcing all deployments to use it. The transparency needed to have the developers responsible for the operations provided by monitoring and log analysis services. For our monitoring, we use a standard solution offered by Sysdig. This Software as a Service offering only requires a kernel module to be installed on each of the hosts running Mesos. The logging service posed a bit more of a challenge. Since our software applications were never written with deployment on public cloud in mind, several of our applications log personal data in their error logs. As we don't want customer data to leave the EU, that posed a restriction to our logging solution. Alas, we have not been able to find a SaaS solution that could guarantee that. Until we do, we run our own ELK stack.
Architecture principle number three was dubbed as our "Test Driven Migration" approach.
Under this approach, each application underwent, besides the usual functional testing, an elaborate set of performance and resilience tests. This was something that our data center based acceptance environment was not fit for and since it shared some key network components with production, never without risks either. Our new cloud based acceptance environment didn't suffer from these limitations and resulted in major improvements in our software stability.
Last month, after just under five months of preparation, we redirected the DNS end-points of our services from our legacy services towards the new and improved infrastructure in AWS. After a mere 20 minutes, 300.000 devices were using the newly deployed services. Since then, it has never run smoother. We have happier customers, because of the improved availability and we have happier developers because of the improved transparency and the ability to perform rolling zero-downtime upgrades, if necessary daily.
Our engineers worked around the clock to make all of this happen, and continue to work hard on delivering the best we can, while learning along the way. I am thrilled and proud to be a part of such a dedicated team.
If you want to be a part of our challenging environment by solving technical puzzles like we do every day, check out our career page: http://careers.quby.com/
P.S. If anyone is interested in a couple of decent second-hand Dell servers, drop me a mail: email@example.com