Immerse is a VR startup based in London, UK. Immerse provides an enterprise virtual reality CMS and Unity SDK . I joined the company in 2017 as their lead DevOps Engineer to lead the company’s effort in becoming a global VR platform.
When I arrived at Immerse, their deployments were a bi-monthly event that would sometimes take an entire day or more. Their Virtual Reality CMS system was a simple AngularJS application with a NodeJS backend. They were not writing or running any automated tests, and were relying entirely on manual QA before releasing to their customers.
Deployment Frequency
The first thing I wanted to address was their deployment frequency, but more specifically, the size of their releases. Because the team was batching massive changes into a single release, it was difficult to identify the root cause of any issues that occurred (and there were plenty!). It made deployment day something we dreaded, instead of something we were looking forward to, and put unfair stress on our QA team.
The original approach used Hashicorp Packer to rebuild an entire EC2 image from scratch for every new release. This meant developers had to wait for an Ubuntu OS to completely install if they wanted to ship a single line of bug fixes. It also meant our build artifacts were ~3Gb in size! We were installing all our build tools and development dependencies into the EC2 instances, and that was making it into our build artifacts.
I did an audit of our EC2 instances and identified all necessary packages and services required to build and run our application, and segregated them. I then designed a multi-stage Docker build (this was a new Docker feature at the time), and used CircleCI to build and push our Docker images to our private Docker registry.
I orchestrated AWS ECS cluster with Terraform and deployed our application into it.
Our 45 minute platform deployment pipeline now took less than 4 minutes! Not only that, it was fully automated, instead of requiring a complicated runbook and manual AWS Console access.
The result? We could release multiple times per week, and our QA team could work with a better release cadence.
Testing & QA
The team were struggling with the stability of their releases, and in particular, cross-browser QA was a major pain point. So the next thing I did was introduce unit tests to the team. This was added to their CI/CD pipeline too and ran automatically for every Pull Request.
I then setup a Selenium Grid to run a series of simple acceptance tests in parallel across multiple browsers. Every major page of our platform would get a screenshot taken, and if any browsers differed too much from the others, it would be posted and flagged in the PR. Our QA team could also review these screenshots without having to run the application themselves.
Finally I worked with the QA team to introduce capybara testing to their toolkit. After a series of workshops, this enabled them to codify their own QA test flows into Cucumber features to run automatically as part of our CI/CD pipeline.
Unity3D SDK Builds & CI
The Unity team was struggling with version control of the large assets in their Unity3D Projects. Not only this, they were struggling with the complexity of automated builds in their TeamCity setup.
I wrote about some of our specific challenges here: Continuous Integration Woes With Unity 3d.
I decided to enable Git LFS to version control the large assets in their Unity3D Projects, but retroactively migrating assets and rewriting massive Git repositories was a daunting task. Not only this, but our developers were not familiar with advanced Git commands and some were not super comfortable with the command line.
We needed a way to audit, check and repair git repositories before a commit even took place. You can’t easily do that with traditional CI/CD pipelines.
Team CLI With OCLIF
Instead, I decided to go with a tool called OCLIF. OCLIF is a CLI framework for NodeJS written by the fantastic engineers at Heroku. What’s great about OCLIF is that it allows the author to embed a NodeJS runtime, and create a multi-platform executable that can be installed and run on Windows, MacOS and Linux, all from a single source. This means I could provide our developers with a GUI installer, and never have to worry about what software versions they had installed on their machines.
What’s ever better, is the OCLIF plugin system, and the fantastic update plugin. With this plugin enabled, I could deploy our Immerse CLI to an S3 bucket automatically whenever we merged to master. Then, the CLI would routinely “phone home” and check for the newest version of our CLI, and prompt our developers to update. This meant our team always had the latest version of the CLI, and we never had to annoy people with annoying update messages in Slack.
The result, was a CLI tool which could unify our MacOS-based web developers, our Windows-based Unity developers, and our Linux-based Docker containers and servers.
OCLIF allowed me to develop a company-specific CLI tool which could:
- install Git LFS and configure it for a project
- safely rewrite an entire Git repository, removing large assets and rebuilding using Git LFS
- check a Git project for correct LFS usage
- fix corrupted assets and project files
- run a series of tests to ensure a Unity3D project was safe to commit
- provide templates for initializing new Unity3D projects using our Unity3D SDK and cloud infrastructure
- create and lint our platform-specific config files
- securely onboard developers into our AWS infrastructure
- provide simple shortcut commands for common tasks
- debugging tools for our cloud infrastructure and tooling
- provide CLI access to platform logs and metrics
Our developers loved it, and it quickly became a staple of their workflow.
Platform Logs & Metrics
I setup the ELK Stack to collect logs and metrics from our platform. I then built a simple Grafana dashboard to visualize the data. This allowed us to see the health of our platform at a glance, and to alert on any issues.
Global Low Latency Infrastructure
Multiplayer Virtual Reality requires extremely low latency infrastructure. If you have too much latency, players can actually get physically sick and throw up (we learned this the hard way). We needed global infrastructure, not just because we had a diverse customer base, but because our sales staff would often fly internationally to meet with customers and do demos, and we needed the ability to create and destroy a new region on the fly.
Not only this, we were ISO 27001 certified, and needed to ensure our infrastructure was compliant with the standard.
In order to achieve this, I designed a global hub-and-spoke architecture in AWS. The idea was this: we would maintain a single “core” cluster in London where our headquarters was, and all the heavy stuff would run in our core region. This included our Redis cache, our Postgres databases, our NodeJS application and authentication systems… All of our metrics, logging and monitoring would run there too.
Then we would have a series of disposable, stateless satellite clusters in each of our supported AWS regions that would connect to our core via a peered VPC. We would deploy a lightweight Golang message broker at the edge, allowing our game clients to send and route UDP and TCP packets via their nearest satellite cluster. Satellites would communicate with the core cluster via a secure lightweight NATS connection. This approach meant our memory footprint was kept to a minimum, and we could scale up and down as needed within a matter of seconds.
Using an automatically configured Terraform module, we could spin up an entirely new region in less than 15 minutes with no configuration except for the region name. This connection was entirely secure using AWS VPC Peering and AWS Transit Gateways.
Finally, how would we discover satellites if they would join and leave the network at arbitrary times? We needed a way to dynamically discover and connect to satellites. Did we need some form of service discovery?
After some prototyping and testing we settled on a rather low-tech (and I think underrated) solution: simple latency-based routing via AWS Route 53. It’s a simple, elegant solution that works perfectly for our needs; each satellite would, as its final provisioning step, register its presence by adding a nearest.immerse.io entry to our DNS zone in Route 53, and our game clients would then use this DNS record to connect to a satellite. In Route 53, when there are multiple records for a given domain, it will automatically route to the nearest record based on latency.
Finally, my work went through an official AWS Well-Architected Review, where our application and infrastructure was carefully reviewed by AWS-approved consultants and auditors. They gave us a detailed report with a series of recommendations, and we were able to address all of them. We also pen-tested the system for ISO 27001 compliance, and passed with flying colours.