P2 Magazine | The King Is Dead, Long Live The King

I stood in front of the board, asking unpopular questions. “When was the last time we stopped offering a service that was losing money? Do we even know which ones make money?” I could see confusion and anger on the faces of the board members. It’s not a pleasant feeling standing in a hostile room telling people that they should stop offering services that now, because of me, no longer exist.

My name is Monika and to understand what I was talking about and how I ended up in the boardroom I have to go back 2 years to the very start of project Reboot. The project was an 84-million dollar effort to rewrite our core systems. The original system had a lots of bugs in it and there were no automated tests. We had over 300 systems covering generations of technology. Like ancient rock, we had layers: Mainframes, Visual Basic, .NET 1, 2, 3 and 4 – all integrated through a tangle of service buses and queues.

Within a month we had serious problems. Each release into test resulted in regressions. We’d spend more time retesting than we would testing. We were going backwards. I was not sleeping.

We knew we had to bring in test automation. But the environment was too brittle. We’d get a new build on Thursday and the automation would break. The we’d waste time getting it fixed before we could start testing any of the new work. The problem, it turned out, was twofold. Our tests were reliant on data that didn’t always exist, and of course the neighbouring systems also had their own bugs.

My days were spent reacting to the latest testing problem. My nights were spent reading books and blogs trying to solve the bigger problem.

I was ragged. I met my friend Maggie for a coffee.

“Mon, you need to sleep. You spend your nights trying to solve these problems. Hire a consultant to do that during the day so you can sleep.” Maggie was right. I just needed some of her decisiveness and a new perspective.

The consultant told me the tests would be fixed by creating golden test data and by building a fake versions of the neighbouring systems. As consultants do, she solved one problem by slicing out lots of little problems. Golden data took time to curate and each fake service could only serve a handful of tests.

This worked well until we had hundreds of these micro fake services. They were growing like a garden of weeds. We couldn’t keep this up. \“I don’t want any more single-use fakes.\” I yelled across the development floor.

My team devised a façade that would proxy the messages to the correct micro fake. This collapsed our smorgasbord of fakes down to one per system. We wanted to do things properly so we wrote a fake consumer to test our façade.

Things slowly started to improve and I even got a few weekends to myself. The tests were robust and they found bugs – lots of them. The issue I had now was the rework required to fix the bugs. What a waste! We were still behind schedule, but we were maintaining pace. I wanted to – needed to – claw back the lost month. To reduce rework I asked the developers to write the tests first.

They refused.

“It’s too hard. All the fakes we’ve written only work in test.” And on and on. The fake services relied on the golden dataset, which at a whopping 40 gigabytes was too big to put onto every developer’s workstation. They also needed to behave differently depending on which environment they were in. In development they needed to be simpler, in test, fakes talk to other fakes.

I refused to be refused. But I had to find a solution. We started making the fake services environmentally aware. It was tricky but it worked. Fixing the data problem was more effort. We wrote fakes that would do the work of creating the data for us. This allowed us to get rid of the golden data. It was so successful that we almost missed a bug when one of the developers used a fake for creating data rather than using the more cumbersome ‘real’ services.

Now that developers were also writing tests, they were missing opportunities for sharing test logic because they didn’t always know what existing tests did. Each tests worked out which fakes to use and which validation to run. Often the problem had already been solved elsewhere, leading to massive duplication.

So we moved all of logic out of the test scripts and put it into the fake services. The mantra \“simple tests\ was printed out on a banner hung in the entrance to the development floor. Each fake would know where it was in and it would validate every message coming through. The tests became input data and nothing else.

Failures were decoupled from tests, too. We correlated which fakes were reporting which errors and then worked on getting the errors down to zero. Error handling went in to make the fakes as robust as possible.

The build light glowed green. And I started to question the value of having a test suite that was always green. You may be thinking “isn’t a green build enough?” but I wanted the team to be better. I wanted to go beyond green.

We were still finding bugs during manual testing. Just not as many. I realised our test coverage was too low. Our automation could only do so much. The manual testers were plugging gaps in our coverage but we didn’t how the size of what we were covering. I knew there was one more source of test ideas that we could use: we cloned the traffic from production into our test. We were now testing live.

This was scandalous. We discovered whole flows through the system that were unknown to us. We had covered the common and high risk parts, but edge case combinations were numerous. Our fake services were not up to scratch. Each fake failure taught us something new. I was riding a wave, walking around the office high-fiving anyone and everyone. I felt confident we would do this.

The next week was the worst of my life. The deployment into production was a debacle. The systems we had built: incomprehensible. The messages they sent were gibberish and garbage. It didn’t make any sense. We worked all day and night piecing together the wreckage. We didn’t understand why production failed when the same traffic being copied into test, was working.

The problem we would learn that the new systems had never ever spoken to each other. We’d used fake services in so many places that the first integration was production. There was no disaster recovery plan. That was built into the new system.

Production was still down, I heard we were losing money every day and we couldn’t rollback.

A penny dropped. I read about a new technique called ‘Blue-Green deployment’ where you cut-over from old to new by rerouting traffic. “Why don’t we cut-over the production to the test environment. We know test works”. It was crazy. We turned off the real mess and our ‘fake’ services became production. The King is dead! Long live the King!

With that switch we had consolidated the platforms, upgrading all existing technology. We had a test suite, services that self validate their input, comprehensive application monitoring and a modern development culture. All that was missed was a few tiny edge cases of business functionality.

And that’s what I pitched to the board.

“It would be cheaper to tell our call centre people that we were no longer offering the missing services than to try and recover. I don’t even know if we’re making money on them.” I waited in the office for the board’s verdict. They didn’t like what I had done. But they would accept it if the team delivered the missing features before the end of the financial year. I would not be in the team. Or the company.

I heard later that their ESB wasn’t even being used. They had sold it to one of their competitors and made up for the revenue lost during the production deployment.