The biggest mess in my professional career - Part 2

If only I could press F5

Dec 27, 2023

This is a follow-up from last week's post that introduced the context of the biggest mess in my career. If you haven't done so, I recommend you to go back and check it out before reading further.

In that article I told you how we set out to a major challenge: rebuilding the web stack while radically revamping the product and UX side of it as well as introducing a new branding.

The Beginning: A Promising Start

Like it's often the case, the initial phase of the project looked quite promising.

👍🏼 The UX team was making very good progress with user testing and coming up with new designs. The team was really excited and full of energy at the perspective of working on something radically better than what we had to deal with until then.

👍🏼 The Product team was coming up with a list of all those annoying things we finally wanted to fix in the new and improved product. It was like going through a bucket list of sort that included all the improvements that we had constantly put off to for a later moment.

👍🏼 The Marketing team had some very good concepts for the new logo and identity, and a lot of creative people were hard at work on different concepts and ideas.

👍🏼 The Engineering team was setting up the new stack with the excitement that often characterises a phase of exploration and experimentation with new technologies. Software Engineers were finally getting a chance to experiment with new technologies and having a strong impact on how the platform will operate in the future

💪🏼 In other words, there was good momentum… until there wasn't any longer.

The Unraveling: When Challenges Emerged

A few weeks into the execution, we started to see the first challenges emerging:

👎🏼 The new tech stack turned out to be more complicated to implement than originally estimated. This should not be a surprise for anyone, but here is the deal. As the technical refactoring was bundled together with major product and branding improvement, any delay had massive implications and generated a lot of tension and frustration.

👎🏼 As more delays accumulated, the pressure on the team grew, leading us to accept more and more shortcuts as a way to reduce further delays. We entered a vicious cycle of incurring short-term technical debt to build new capabilities.

👎🏼 At some point we realised some major performance issues with the API layer which had never been tested with the amount of traffic we were expecting to support on the web site. We needed to spend time addressing some of the major performance bottlenecks before going live, which delayed the launch even further.

This was turning into a perfect textbook example. We could have completely revisited the original plan for something more realistic.

Instead we decided to stick with the original plan, applying only minor adjustments to the scope to shave a few weeks from the launch schedule.

We finally launched the first version, and allowed our users to try it out, while keeping the old version still available as the default option.

The initial feedback was not very encouraging, yet the cost of maintaining the two stacks in parallel forced us to plough ahead with the original plan for the rollout, while dealing with bug fixes and tons of other adjustments in the user flow as they were emerging.

We started believing our own narrative that it was just a matter of users getting used to it. Like a new pair of shoes, it takes a few days to get used to them.

In hindsight there was a lot of sunk cost fallacy at play. This project had become proverbially too big to fail, and any major reconsideration at this point in time would have been seen as an admission of a costly mistake.

Nobody ever said that, but when I reflect back to those weeks and months it seems like a good example of group thinking.

As we shifted more and more traffic to the new version, engagement metrics were tanking.

This is where the worst case scenario became an undeniable reality: we had changed so many things at the branding, product, UX and technical level that we had no idea where to look for the root causes.

Were users leaving us due to performance issues? Or was it due to bugs? Or maybe they were just lost navigating the new interface? Or was it maybe because they felt the old design more familiar? 🤷🏼

We did not have a way to reliably attribute symptoms to root causes.

So we did what everybody does in such cases: we started shooting in the dark.

The teams went into a long marathon during which they frantically juggled between improving performances, fixing bugs, evolving the product and making adjustments in the user flow. This marathon lasted for months, during which the key metrics kept declining.

Even when most of the outstanding issues with the new stack had been addressed, the sad reality was that the business was still performing worse than prior to the migration.

Given how long it took to go through the entire journey, we must factor in changes in the overall market and competitive landscape. We can't know for sure that the business would have continued to perform better without the Phoenix project.

What we know though is that the project that was supposed to instil new life into the business might have been one of the proverbial last nails in its coffin.

Not something I'm proud of, but not all is lost as I've learned a lot of valuable lessons from this experience.

If I could do it again

The main thing I regret was not pushing back as hard as I could on the idea of bundling together a massive technical refactoring with all the product and branding changes.

If there is only one thing I'd like to do differently, it would be this one.

I could see various approaches to that

Start with the technical refactoring.
In an ideal world, this would be my favourite approach.
The refactoring was justified by a lot of reasons, as the current stack was making the team very unproductive in an undeniable way. Since these changes would have been transparent to end-users, we could have released them incrementally rather than in one large release.

This would have allowed us to spot performance problems early and quickly, reducing their impact and the time to resolution.
With an improved stack, subsequent product enhancements could have been implemented more quickly than with the previous one.
The cost of future changes would have gone down drastically, justifying the need to delay their implementation.
Start with product improvements, but no technical refactoring.
This approach is often what the reality of your business might call for.
If you're in a need for urgent changes in the product - a situation that is quite common in the industry - you might want to take this approach.
Making changes to the product will take a relative long time, as you'll be doing it on the legacy stack, but you'll get the benefit of reduced uncertainty. Most likely your team is very familiar with the quirks and limitations of your current platform, so they will know how to work around them.
You will keep complexity down and segregate the uncertainty at the product level. If key metrics don't trend in the right direction after deploying product improvements, it'll be clearer where to look for solutions.
Bundle Tech and Product changes, but be very deliberate at shipping them incrementally, page by page, and measure each change religiously before moving to the next one.
If you are not successful in convincing your peers of the benefits of decoupling major changes, then you'll need to do you best to reduce risk and build confidence as you make progress.
You will need to fight for one thing though: getting your stakeholders to accept that you will make a series of incremental launches.
Each one of them should be self-contained on a single app / page. This way you'll be able to narrow down the sources of any problem or surprise very quickly, improving your chances of being able to solve them in a reasonable amount of time.
You will want to setup strong controls on your deployment capabilities, ideally with canary and blue / green deployment strategies, and fine grained observability.
Last but not least, you will want to recruit loyal users and give them access to early version of the new product.
This will both force you to think “production first” from the early stages of the development, while also providing low risk valuable feedback on the changes.

Conclusions

In this article I shared with you the story of the biggest mess in my professional career.

The story is nothing fancy, and probably quite common. That's indeed the reason why I wanted to share it broadly, as I'm sure many fellow practitioners have dealt or are dealing with similar situations more often than they'd liked.

As I said, being lucky a few times might increase your confidence, which in turn increases the likelihood of going through the really painful experience at some point.

What has been your Phoenix project?

What are the key lessons you've drawn from it?

I'd be delighted to hear more stories in the comment sections!