A new method for fast refactoring of legacy code

27 Aug 2019 / Alex Bolboaca / No Comments

In this article, I will present a method that I’ve tried in a few codebases in compiled languages for safely and quickly refactoring untested code. First, we will discuss the main problem we are trying to solve, quickly introduce the techniques coined by Michael Feathers, then discuss some shortcomings of the existing techniques, and finally describe the proposed technique.

Briefly, the technique I’ve been experimenting with implies refactoring first towards pure functions using safe, mechanical refactoring steps, then testing the pure functions by quickly writing data-driven tests, property-based tests or by using the golden master technique, and finally refactoring the pure function towards the desired design.

Please feel free to skip to the section “A New Method” if you are familiar with the legacy code problem and techniques.

The Problem

Many software projects go through a cycle like the following:

In the beginning: “we’ll write the best code we can, and as fast as we can”
Few months later: “we need to ship now, let’s make things work”
A year or so later: “we can’t touch that part of code anymore because it might blow up”
Few years later: “it’s very slow to add more features because of the existing code”

You probably recognized the technical debt / legacy code problem. Once you’ve reached the final point, the options are limited. You can either

rewrite the project (extremely risky)
live with the low productivity (bad for business)
try to throw more people at the problem (and prepare to fail due to Brooks’s law)
refactor the existing code to allow faster development

But here’s the rub: refactoring involves changing code, and changing code can create even more problems. Also, given that you’ve written code that’s hard to change until now, what makes you think that you can suddenly create code that’s easier to change?

Existing Solutions

Fortunately, Michael Feathers did the heavy lifting for us and found ways to safely refactor existing code. If you don’t know how, go and read his book “Working Effectively with Legacy Code”. Then practice the techniques, at a workshop or at a Legacy Coderetreat. Only then, maybe you can start trying to apply the techniques on your code.

The basic technique goes as follows:

Pick a section of the code
Write characterization tests (i.e. tests that define how the code works now). You may need to do some small changes in the code to allow writing the tests using seams.
Refactor the code while using the characterization tests to preserve existing behavior.

This solution works very well once you master the techniques. However, it has one problem: it’s quite slow and tedious. Sure, that should be expected – cleaning up a mess is rarely fast or easy. But the business often doesn’t have time to invest into the clean up.

Maybe there’s another way to do the same thing?

A New Method

I’ve been pondering this problem for a long time. At the same time, I learned more and more about functional programming. Therefore, I’m proposing a new method that involves pure functions and passing functions as arguments, and that I believe to be faster.

Before we move on, I’d like to make it clear that, while I’ve played with this method in various code samples, I can’t claim that it’s fully studied and perfect. I plan to try it out with more people, and see what I can learn from it. I believe however that it’s promising enough to be described.

The method has three steps:

Refactor the selected code towards pure functions, using mechanical, safe refactorings and no tests
Write data-driven tests or property-based tests for the pure functions
Refactor the pure functions towards your desired design paradigm using the tests.

Let’s define a few terms, and move on to describe each step.

What is a pure function?

Before we move on, let’s define pure functions. A pure function is a function that returns the same output values when receiving the same input values, and changes nothing in the program state. For example the following function is pure:

int add(int first, int second){
    return first + second;
};

while the following function is not pure:

int add(int first, int second){
    first += second; // this changes the value of first
    return first;
};

As you can see, pure functions cannot be dependent on I/O or on time, they cannot change the parameters they receive, and they are very predictable.

So how can we take advantage of pure functions?

It’s Pure Functions All the Way Down

I will postulate that any non-trivial program can be written as a set of pure functions combined with a few mutable functions.

For example, if your program is a web application, all the code that writes or reads from the database, all the code that creates the response and reads the request, and all the code that writes log files can be encapsulated in a few mutable functions. Everything else is easily written with pure functions.

If your program is a game, all the code that interacts with the graphical card, all the code that reads the player actions, and all the code that saves or loads the game is mutable. Everything else can be easily written with pure functions.

A more interesting effect is that this rule applies at different levels. It can apply to a class, it can apply to a module, to a set of classes, or to a method. The only time it fails is if we try to apply it on a very simple I/O method.

That’s very powerful: it means that the pure functions representation of the program can be used no matter where we start or how large the code is. This is the first part of the puzzle.

The second part of the puzzle is: how can we safely refactor any code towards pure functions?

Phase 1: Refactor towards pure functions

The basic technique is the following:

select one or more lines of code
extract method from them
make the method static
use the compiler to identify mutable code (NB: I’ve tried this with compiled languages until now)
for each piece of mutable code, either extract a new function that you pass as a parameter, or extract data that you pass as a parameter. When in doubt, prefer functions.

The end result is a pure function that receives a lot of parameters, either other functions or data parameters. Either way, we have refactored the function towards a pure function with all dependencies injected. This makes the function testable.

At this point, you can either move to phase 2, or refactor the function to reduce its parameters.

From my experience, each of these steps is mechanical and safe to do with modern IDEs (or with vim).

It’s time to write some tests.

Phase 2: Write tests for the pure function

At this point, we have at least one pure function we can test. Since the function doesn’t change its parameters, and returns the same outputs for the same inputs, we can use data-driven tests. In fact, the function is equivalent with a very large data table whose last column represents the output.

Moreover, the input data can be generated. We can use property-based testing and/or the golden master technique to take advantage of input data generators, thus making the process faster.

As for the functions injected as variables, we can use in our tests stubbing or, for some of the I/O code, mocking.

This leads us to phase 3.

Phase 3: Refactor the pure function

Once covered by tests, our pure function can be easily refactored. The functions passed as parameters can be turned into interfaces that are injected. The function can be split, and the resulting functions moved into their own classes. Some of the input parameters can be passed to a constructor, while others remain function parameters. etc.

Or, we can go all in functional, extracting lambdas, composing functions, and using partial application to remove duplication. That’s up to you.

This takes us to a conclusion.

Conclusion

I have briefly presented in this article a new method to refactor existing, untested code, by refactoring first towards pure functions, then covering with data-driven or property-based tests, and finally refactoring the pure function towards the end goal. There are many more techniques that I found while trying this method, but for brevity I decided not to include them.

I have also made a few claims that require more investigation and experiments. Can an average programmer, once taught the basic techniques, apply them without changing the behavior of the code? Does this work for any type of code? etc.

I can only hope that you find this method interesting, and decide to try it out or ask questions about it.

An Example

For brevity, I have avoided a particular example in this article. If enough people are interested, I will create a few examples to showcase the technique. Until then, an example in C++ is detailed in my latest book “Hands-on Functional Programming with C++”.