Reading source code is more difficult than writing it. Inherently. Everybody who worked with a legacy system and needed to understand the authors’ intent knows what I mean. Browsing through thousands of lines of function definitions, variables and figuring out what’s the point is a daunting task.
But why, actually?
Is reading a recipe difficult?
Is it more difficult than writing it?
Is reading a hundred years old recipe more difficult than understanding thirty days old recipe?
I don’t think so.
Why then reading source code gets more difficult as the code matures? What can be the reason for software decay and common hatred towards legacy systems?
Let us think of the action of reading source code. What do we do when we try to understand the flow of the computer program?
We are a bit like computers, but less effective. We go through the code and read the variables and functions. Because we are not as fast as machines we can’t remember the value of each variable and we are unable to memorize the body of each function or subroutine. Our “understanding” is based on approximation. We read the name of the variable and try to guess its usage, its meaning. We read the name of a function and without reading its body we try to figure out what the function can do.
It’s exactly the same as in the real world. When we read about “a carrot” in the recipe we imagine a carrot. We can understand the concept of carrot without receiving a lengthy list of details about the carrot, such as its genetic code, mass, temperature, color and so on.
But source code – even with the entire effort of object-oriented paradigm – is not like the real world.
In the real world we have a huge, but a limited amount of words in our vocabularies. Natural language is processed by human brains in a completely different manner than source code by the compiler. For example “a chair” to a human being is not a particular chair but an idea of a chair. In most cases, we don’t need detailed definitions to process natural language. On the contrary – in rare cases like the law we have big problems with the definitions.
A very good example illustrating what would happen to the natural language if we process it as literally as computers process the source code is this video:
Let’s go back to the process of reading source code by software engineers. We take the name (of function or variable) and guess. As long as the code is “clean” and not old enough to match our instinctive understanding of the definition our guess is somehow correct. In this situation, the process of reading is smooth and painless.
The issues pop up when we can’t guess properly.
But the real issue is far greater. It’s one of the fundamental problems of software development. Partly it’s described by a quote by Phil Karlton
There are only two hard things in Computer Science: cache invalidation and naming things
Phil Karlton
I mean naming things.
Naming things is the point of incompatibility between the world of humans and computers. In the real world, where the language is processed by human minds, which are able to process ideas, classes of objects we rarely invent new words. In the realm of a computer program, we constantly do it.
What happens when we invent a new word in a natural language? We learn it. There comes the new word “computer” and all the people learn that it’s a kind of computing machine. But it doesn’t matter if’s MacBook Pro or new Dell XPS or ENIAC or PC. We don’t have a new word describing computers slightly differently in every company or even the company’s department.
But in the software world, we do. “User” means something different in every single program written so far. In one it’s just user name and surname. In other, it’s also the date of birth. In other, it’s only sex and nickname.
We simply can’t name anything properly in a computer program. Every construct within code does not match the real-world meaning of the world.
We can be too vague – let’s say naming the user “user”, or too precise “userWithNameAndSurnameAndSexAndDateOfBirth”. Almost never we will be able to fully express the object by its name. That’s why reading source code is so difficult. The words, the names of variables and functions never mean what we believe they mean. We always need to go to definition and check. Every time we check we learn the new language of a particular software project. Learning thousands of new words is difficult. Therefore reading source code is difficult…