bookmark_borderThe naming problem in programming

Reading source code is more difficult than writing it. Inherently. Everybody who worked with a legacy system and needed to understand the authors’ intent knows what I mean. Browsing through thousands of lines of function definitions, variables and figuring out what’s the point is a daunting task.

But why, actually?

Is reading a recipe difficult?

Is it more difficult than writing it?

Is reading a hundred years old recipe more difficult than understanding thirty days old recipe?

I don’t think so.

Why then reading source code gets more difficult as the code matures? What can be the reason for software decay and common hatred towards legacy systems?

Let us think of the action of reading source code. What do we do when we try to understand the flow of the computer program?

We are a bit like computers, but less effective. We go through the code and read the variables and functions. Because we are not as fast as machines we can’t remember the value of each variable and we are unable to memorize the body of each function or subroutine. Our “understanding” is based on approximation. We read the name of the variable and try to guess its usage, its meaning. We read the name of a function and without reading its body we try to figure out what the function can do.

It’s exactly the same as in the real world. When we read about “a carrot” in the recipe we imagine a carrot. We can understand the concept of carrot without receiving a lengthy list of details about the carrot, such as its genetic code, mass, temperature, color and so on.

But source code – even with the entire effort of object-oriented paradigm – is not like the real world.

In the real world we have a huge, but a limited amount of words in our vocabularies. Natural language is processed by human brains in a completely different manner than source code by the compiler. For example “a chair” to a human being is not a particular chair but an idea of a chair. In most cases, we don’t need detailed definitions to process natural language. On the contrary – in rare cases like the law we have big problems with the definitions.

A very good example illustrating what would happen to the natural language if we process it as literally as computers process the source code is this video:

Let’s go back to the process of reading source code by software engineers. We take the name (of function or variable) and guess. As long as the code is “clean” and not old enough to match our instinctive understanding of the definition our guess is somehow correct. In this situation, the process of reading is smooth and painless.

The issues pop up when we can’t guess properly.

But the real issue is far greater. It’s one of the fundamental problems of software development. Partly it’s described by a quote by Phil Karlton

There are only two hard things in Computer Science: cache invalidation and naming things

Phil Karlton

I mean naming things.

Naming things is the point of incompatibility between the world of humans and computers. In the real world, where the language is processed by human minds, which are able to process ideas, classes of objects we rarely invent new words. In the realm of a computer program, we constantly do it.

What happens when we invent a new word in a natural language? We learn it. There comes the new word “computer” and all the people learn that it’s a kind of computing machine. But it doesn’t matter if’s MacBook Pro or new Dell XPS or ENIAC or PC. We don’t have a new word describing computers slightly differently in every company or even the company’s department.

But in the software world, we do. “User” means something different in every single program written so far. In one it’s just user name and surname. In other, it’s also the date of birth. In other, it’s only sex and nickname.

We simply can’t name anything properly in a computer program. Every construct within code does not match the real-world meaning of the world.

We can be too vague – let’s say naming the user “user”, or too precise “userWithNameAndSurnameAndSexAndDateOfBirth”. Almost never we will be able to fully express the object by its name. That’s why reading source code is so difficult. The words, the names of variables and functions never mean what we believe they mean. We always need to go to definition and check. Every time we check we learn the new language of a particular software project. Learning thousands of new words is difficult. Therefore reading source code is difficult…

bookmark_borderDRY is dead

The DRY principle, together with YAGNI, SOLID or KISS, is one of the most popular acronyms which shaped our way of thinking about developing software. It is simple, intuitive and easy to learn even during the early stages of education. However, the principle has been born in completely different circumstances than what we are dealing with today.

Simple idea

I’m not a historian of the software development and I’m not sure how the DRY principle has been born, but I guess it was created during procedural programming age. It stinks with a procedural way of thinking anyway.

The idea is simple. We have some code. The code should be organized. As long as we have some part of the code which repeats here and there we should create a procedure – extract this block of code, give it a name and reuse it.

Time flows

Since procedural programming things have changed. First of all the object-oriented paradigm explosion has happened. The complexity of the software has been growing. The systems for accounting, summing long rows of numbers, generating reports have been already created. The new frontier was internet browsers, instant messaging apps, trading systems for companies and snake for Nokia 3310. Except for the last one – it was quite a challenge.

The DRY principle doesn’t fit OOP as much as the procedural paradigm. Actually, if you think about it – it doesn’t fit at all.

Let’s think for a while – what happens when we try to avoid repetition in object-oriented code? First what comes to our mind is probably inheritance – the beautiful useless idea. The dog has a name, the cat has a name so let’s create a class Animal with property Name. But wait a second. Wild animals don’t have names. Let’s create WildAnimal and DomesticAnimal. Damn! – almost nobody gives names to fish…

Second popular solution for repetition problem is utils or commons.

There’s a secret rule in the software industry – every complex enough project has utils directory or class. Some of them 8k, 16k LoC.

It’s avoidable, it’s possible to properly design object-oriented software without these cancer cells of utils and disastrous inheritance. Please keep in mind anyway that both of them are the result of the DRY principle application. We tried, in the most easy, cheap way, to not repeat ourselves.

Microservices – nail in the coffin

Once upon a time I asked a colleague who worked at Amazon – the company which is a role model, a pioneer of microservice architecture – how do they organize common parts of the project, how they manage reusability, he answered:

We don’t. We do repeat. It’s cheaper and quicker at that scale of the project. 

The enormous size of the systems we are developing nowadays entails a new approach and different rules. The most visible tendency recently is breaking down problems and systems into smaller ones. Actually it is one of the main techniques since the beginning of software development, but recently it becomes more important than ever before. We can spot this trend in front-end frameworks (Angular, React – componentization), as well as in back-end (microservices architecture).

To some extent, we can think of it as a proper way of object orientation, more proper than inheritance. The organisms are similar but not the same. They do not, strictly speaking, share some features. The human eye is not the same as a dog’s or hawk’s eye. Only seemingly, on the level of naming these objects, they’re the same. Implementation details differ greatly. I’m not a genetic scientist but I bet that if we cut out from human DNA the parts which we don’t share with monkeys it will not create a monkey. I guess there are many subtle differences, some small parts of genetic code, few “lines” which make a difference even if most of the code is the same.

What to do?

It seems that the DRY principle became harmful. Should we stop using it? Maybe. For sure we should use it more carefully. In many scenarios, it may bring more harm than good. In some cases repeated code can be signature of failure, in some cases, it may be the best possible solution.

Is it bad when we repeat the identical code twice? If we repeat within the same class – I guess it is; in the same module – maybe; if it’s repeated in the same project, which consists of 100k LoC, and repetition happened in different modules – maybe not.

Is partial code repetition bad? Well, maybe it’s not bad by default, but it depends. Depends on the possibility to create a proper abstraction to avoid it. Quite often we use principles very strictly. Don’t. Don’t follow these rules blindly because they’re merely suggestions.