Global Entertainment Company

Restructuring & Cost Reduction

I began consulting for a global entertainment company in 2021. I lead the project to restructure and dismantle a Kafka-based event-sourcing system costing the business tens of thousands of pounds every quarter in infrastructure costs.

When I arrived, the team was struggling to maintain the system and were constantly firefighting issues. It was a sprawling CQRS/ES microservice system with 25+ services, a Kafka cluster, a Redis cluster, a Postgres cluster, a legacy MySQL cluster, a custom logging system, a custom configuration system, and a custom CI/CD pipeline. The system was extremely fragile and prone to bugs and outages.

Logging System

The first thing I addressed was their logging system. It was a custom microservice that would hang at arbitrary times and cause Kubernetes nodes to suddenly go quiet with no alerts or errors. We were losing almost half of our logs and we didn’t even know it!

The fix was simple. I deployed a Datadog daemonset into the Kubernetes cluster, and configured it to collect logs from all the containers in the cluster. This allowed us to see all the logs in one place, and to alert on any issues.

Datadog

I can confidently say that he was a transformative presence to the team and system, leaving a legacy of modern elixir practices, CI/CD excellence and empowered developers.

Brett joined the team at a difficult time, with the system being a sprawling Frankenstein of a naïve CQRS/ES architecture bolted onto a barely functional legacy database fractured across tens of services, with next to no documentation on how it all fit together. [...]

Despite that, within his first month he had not only rebuilt the entire stack but turned his efforts into a standardised CLI which was quickly adopted by the rest of the team, and is still a central part of our developer experience today.

Backend Software Engineer

CI/CD Pipeline

The second thing I addressed was their CI/CD pipeline. It would sometimes test & deploy the incorrect commits (!!!) to production, causing confusion and chaos. It was overly rigid, prescribing a specific structure across all services, preventing developers from adding their own custom steps or logic to any of the individual microservices without changing the CI/CD pipeline across the entire team.

To fix this, I replicated the current functionality with a new, more flexible system. This gave power back to the developers, and allowed them to add additional tests to specific services without affecting other services. This fix also allowed me to implement additional acceptance tests before our critical refactoring work.

Mass Event Processing with Elixir Broadway

The team had terabytes of encrypted event data in cold storage. Unfortunately, when the system had originally been designed, core account data was not properly segregated from unimportant telemetry events.

Storage costs were sky-high and growing rapidly, but the team had no viable way to filter these events in a realistic timeframe, and given the historical instability of the system, had become too terrified to touch it.

To address this issue I built a prototype Elixir Broadway pipeline to process the events. We were able to run more than 10 data streams in parallel with backpressure, and a data processing task that would have taken ~30 days to complete could now be done within about 3.

After significant testing, the team decided to proceed and I built a simple monitoring application using Phoenix LiveView to monitor the progress of the pipeline.

We were able to process all the events without so much as a hiccup, and we were able to reduce the cold storage usage by ~90%.

Removing the Event Sourcing System

The event sourcing system was Kafka-based and extremely fragile. It was prone to bugs and outages, and was a major source of downtime for the team. Not only this, but the Kafka infrastructure was costing the business insane amounts of money across Kubernetes persistent volumes, Kafka cluster nodes, high-performance Redis instances, and more.

I began my approach with an Acceptance Test Suite.

I deployed a local replica of the system in a Docker Compose project, and wrote acceptance tests against it. Using the new, more flexible CI/CD pipeline, I was able to run the acceptance tests as part of our team’s CI process using GitLab’s remote-project-trigger feature. This meant whenever a microservice was updated, the acceptance test project would be triggered to run with the latest code.

Once I verified the behaviour of the system with the rest of the team and stakeholders, I began to refactor it one event at a time.

We started with the simplest microservice, the Rewards microservice. This was a simple microservice that would process events and update the user rewards database. This allows users to earn digital items and other rewards that can be redeemed either in-app or in their digital store.

Over the course of 2 months, I replaced each CQRS event with a simple database transaction, and the team used GCP PubSub as an input buffer where high volume was expected. By leveraging Elixir Broadway (which the team now loved), we were able to process high-volume events with backpressure via GCP PubSub. This prevented spikes in traffic from overwhelming the system while avoiding the insane infrastructure costs of maintaining a highly-available Kafka cluster. I think when we were done our monthly PubSub costs were less than £60/month, a saving of more than 99%!

Once the approach had been proven with the Rewards microservice, my colleague began to replicate the approach across the other core services.

While he did this I turned my sights on the infrastructure costs.

Infrastructure Cost Reduction

The team was wasting lots of money on cloud infrastructure. I spent a week or so analysing usage and costs, and identified a number of areas where we could save money.

We combined some of our non-essential Postgres databases into a single cluster, and downsized to a single high-performance Redis cluster.

As we were able to remove the CQRS system piece-by-piece, individual projection microservices and event handlers were no longer needed. Eventually our suite of 25+ microservices was reduced to a cohesive 5 Elixir mini-services.

This rearchitecture meant that our Elixir VM overheads were significantly reduced, and we were able to more effectively manage our Kubernetes cluster resources. The memory pressure on the cluster was predominantly caused by idling CQRS projection services, and by removing them we were able to reduce the memory usage of the cluster by ~70% resulting in a significant cost reduction.

As this project concluded, I was able to leave knowing I had saved them multiple salaries worth of infrastructure costs, and that would continue to save them money every month for years to come.

[...] Brett is also a great personality to have in the team, and someone I consider myself very lucky to have had as an engineer senior to me. He has a curious mind and a demonstrable passion for software engineering. This together with his personable and collaborative nature made him an excellent mentor, and through our regular discussion and dissection of what we were doing in Elixir, we were always improving and refining our skills and processes.

During his time, Brett was responsible for massive sweeping improvements as to how we do things at [company]. A non-exhaustive list of these include:

  • completely rebuilding our extremely high risk CI deployment pipeline
  • creating a cross-system acceptance test suite (which is now our gold standard[...])
  • creating a standard Elixir Ecto migration process
  • migrating our Rewards system from a CQRS architecture to a RESTful API
  • migrating half a billion telemetry events and in doing so introducing Broadway as a tool to the team

Backend Software Engineer