Until you have found product-market fit, your production environment is often just another test environment. It’s not testing your code, it’s testing your customers.
At a previous startup company, we had a few hundred thousand users. Not all that many, but enough to get some interesting data and be in a position to use it to optimize the business. We had previously built a recommendation engine in Elasticsearch which all our users were piped through, and we user it to provide daily recommendations based on our most recent data. It wasn’t all that complex at first, and served us really well for a while.
Eventually we decided we needed to rank users not just based on a simple “score” but on a “relative score” taking into account the scores and compatibility of similar and nearby users. When presented with the idea, the whole development team understood the possible weight of the operation, and how much headache might be involved. It had the potential to balloon out of control, considering the number of nested operations we wanted to perform.
Even if an in-depth comparison of 2 users’ compatibility might take just 100 milliseconds, when repeated for every possible combination, you would quickly reach hours of CPU time. A simple proof of concept was built in a few hours and it took ages to run; about an hour or so. That was good enough for us to ship the first iteration of the feature and it appeared to work reasonably well, so we kept it.
If your business is not profitable, your production environment and your product are disposable. The free market will see to that.
After a few days though, friction started to appear. This operation was the base upon which the majority of development occurred. With such a heavy operation at the core of our application, development would frequently get halted waiting for this behemoth to complete.
It was cached nightly in our production environment. The argument was that we could cache the data on production, and if the operation ran in the early hours of the morning who would notice the lost CPU cycles?
That’s fine, but in the world of startups your production environment should be the least of your worries. In fact, until you have found product-market fit, your production environment is often just another test environment. It’s not testing your code, it’s testing your customers.
Far more important than a fast and responsive website or service is a fast and responsive development team.
Software and the businesses built around it have the odd property of being relatively cheap to run and maintain as-is.
In the startup world this is compounded with the fact your current business model is very likely the wrong one and you are still a couple pivots away from profitability.
Development teams are, on the other hand, NOT cheap to maintain. What this leads to, and what I’m trying to communicate here is the unintuitive disposability of your production environment. If your business is not yet profitable, your production environment and your product are disposable. The free market will see to that.
So what is important?
Well it may come as no surprise, but your development and testing environments. Maybe our algorithm ran at 4am while we were all asleep, and it was cached for the rest of the day, but we had managed to tie one hand behind our back in the process. Our development process had become molasses and our tests either avoided the damned thing entirely by mocking, or ran terribly slowly.
In some development teams I might have heard “its working, don’t fix what ain’t broken”. It’s easy to move on to the next thing and pick up the next ticket… start building the next new and exciting feature… We had tested and decided to keep the feature. I put to you that there is a very strong reason to optimize it, despite the fact that no actual end user was affected by the slowness.
Eventually we had a chance to optimize the algorithm. Turns out the “proof of concept” that had been running all this time had some hideously unoptimized database operations. Not a surprise. After half a day of work, the hour-long operation was running in less than a second. We were finally able to iterate on the algorithm, make improvements where previously we had avoided it. We quickly found, with the proper testing we were able to do now, that the results it was producing were actually pretty inaccurate too.
The cost of development overhead had been overshadowed by the antiquated idea that the final product was all that mattered. We were paying with our time and our patience. It’s a common issue, and one I see often in various guises.
Another common example of this pattern is allowing databases to balloon with redundant data. Data is notoriously hard to manage these days as we move towards containerized applications running in the cloud. Everything you build is disposable and instant. Everything except your data. It hangs around like some bulky ball and chain. So it’s not always apparent the damage you’re allowing to occur when you let your data bloat.
Relational databases in particular are sensitive to this issue. In general, storing non-relational data in relational databases is something I advise against. So storing historical data, logs, metrics, caches, sessions and other non-relational or volatile data is an anti-pattern. Keeping incomplete or bogus placeholder objects. Perhaps allowing numerous duplicates, empty rows or God knows what else to nest and procreate in the recesses of your tables and columns.
It’s development overhead. That migration to a new server now takes 5 hours when it should have taken you 30 minutes. Corrupted data? It now takes you hours to pull down a local copy to debug the issue.
Good hygiene practices like running migrations locally on production data slowly die as the overhead grows. The weight of the project slowly gets heavier, and the development friction earns interest. But it doesn’t matter, because we can always buy a thousand production servers each with a terabyte of RAM. The end user gets their webpage in less than a second, and that’s all that matters right?