Posts Tagged ‘clean code’

Tests are dangerous

Saturday, October 27, 2012 @ 05:10 PM Author: athos

Do you write tests because you are expected or told to do so, or do you write them to help all those (including yourself) who will sooner or later work with your code? Tests are code as well, so they should be written an maintained with the same amount of care as the production code. Here I share reproductions of some badly written tests I happened to come across here and there. I’m not going to show the original ones since the aim of this post is not to blame the authors of them but to show how easy it is for tests to become harmful when insufficient amount of thought are put into them.

TDD vs. PHP gotcha’s: TDD wins!

How many problems do you see in the following test?

<?php

class ThreeLetterWordCollectorTest extends PHPUnit_Framework_TestCase
{
    public function testThreeLetterWordCollector()
    {
        $cut = new ThreeLetterWordCollector();
        $actual = $cut->collectThreeLetterWords("bar hello foo world");
        $expected = array("foo", "bar");
        $this->assertSame(sort($expected), sort($actual));
    }
}

It’s just a minor issue that the name testThreeLetterWordCollector does not provide any new information about what behavior it is testing. It’s just the words of the tester class in different order. There are probably better names for a variable than $cut as well. (Don’t make me think! It’s an abbreviation of the term “Class Under Test”.) But the real problem here is the dangerous one: this test does NOTHING! The intent here is likely to be to compare the arrays ignoring the order, but PHP’s sort() function does not work the way one would assume knowing most other languages. Actually, it is one of PHP’s worst library functions. Have you read PHP: a fractal of bad design? It’s a must-read. sort() does the work in-place and returns a boolean depending on the success of sorting. (And to make things worse, it compares elements in a crazy undeterministic type-juggling way by default. I’ve written about it a couple of months ago.) So in the end, given that ThreeLetterWordCollector::collectThreeLetterWords() returns an array, the assertion will always pass, regardless of the contents of the array.

If that test and the production code it validates had been written with TDD, then they would have been attempted to be seen failing and once the test passed when it should have failed, then the problem would have been noticed immediately. Don’t trust tests you have never seen fail!

Comparing strings the Unix-y way

Another test written in PHP, but this one actually could have been written equally badly in any popular language:

public function test1()
{
    $expected = __DIR__ . "/fixtures/expected.conf";
    $template = __DIR__ . "/fixtures/template.tpl";
    $actual = __DIR__ . "/fixtures/actual.conf";

    $generator = new ConfigFileGenerator();
    $generator->generateFromTemplate($template, $actual);

    $result = 1;
    system("cmp $expected $actual 2>/dev/null", $result);

    $this->assertTrue($result == 0);
}

Let’s ignore the fact that the naming of this test is even worse than in the first example. But the way it compares the expected and the actual output is gotta be a joke. A surprisingly bad one. I didn’t even know that this cmp command existed! Oh how I miss you, good old Assembly! Now seriously, what does this test print when it fails?

There was 1 failure:

1) ConfigFileGeneratorTest::test1
Failed asserting that false is true.

Thank you, dear test, now I know what requirement is broken and what is the exact problem with it. If this test was written with keeping not just the happy case in mind but the possibility of failing as well, then it would at least read those two files and compare their contents as strings using the assertions provided by the test framework, producing usable output when going red. Maybe all that would be done by a method named assertFileContents(). Or bringing it to the next level: taking SRP into consideration, the config file generator logic could be decoupled from reading and writing files on the disk, so the latter could be easily mocked in the test harness. (It would run faster as well.) And no fixtures in various files would be required, the test could be fully self-contained!

Obfuscating tests using mutable test data

You can’t say I’m picking on tests written in PHP, here’s a Python one:

class FooFinderTest(object):
    def test_foo_finder(self):
        expected_foo_ids = [ 1, 2, 3, 4 ]
        foo_finder = FooFinder()
        self.assert_foo_ids(expected_foo_ids, foo_finder.find(bar=True))
        self.assert_foo_ids(expected_foo_ids, foo_finder.find(bar=False))

    def assert_foo_ids(self, expected_foo_ids, actual_foo_ids):
        if len(expected_foo_ids) != len(actual_foo_ids):
            print "Unexpected number of foo ids: %s" % str(actual_foo_ids)
            sys.exit(1)
        while len(expected_foo_ids) > 0:
            if expected_foo_ids[0] not in actual_foo_ids:
                print "Expected foo id %d to be found" % expected_foo_ids[0]
                sys.exit(1)
            del expected_foo_ids[0]

The question is: what do we expect in the bar=False case? Those knowing Python well can tell the correct answer: this test expects an empty list when bar is False. Those knowing Python well can also tell that whoever writes a test like this, does not know Python, since it has a built-in test framework so no in-house invented wheels are necessary. E.g. the unittest module and its assertEqual() method could have been used instead of this make-shift assertion that removes all the elements from the list which is passed by reference so it becomes empty before assert_foo_ids() is called for the second time.

Tests are first class citizens too!

In theory, tests encourage and help you to refactor. But a test like the next one does the opposite of that. (You know the Game of Life kata, don’t you?)

class GameOfLifeTest(unittest.TestCase):
    def test_any_live_cell_with_less_than_two_neighbors_dies(self):
        world = World()
        world.put_cell(0, 0)
        world.put_cell(1, 0)
        game = GameOfLife(world)
        game.evolve()
        self.assertFalse(game.get_world().is_alive(0, 0))
        self.assertFalse(game.get_world().is_alive(1, 0))

    def test_any_live_cell_with_two_or_three_neighbors_survives(self):
        world = World()
        world.put_cell(1, 1)
        world.put_cell(2, 1)
        world.put_cell(0, 2)
        world.put_cell(1, 2)
        game = GameOfLife(world)
        game.evolve()
        self.assertTrue(game.get_world().is_alive(0, 2))
        self.assertTrue(game.get_world().is_alive(1, 2))

    def test_any_live_cell_with_more_than_three_neighbors_dies(self):
        world = World()
        world.put_cell(0, 1)
        world.put_cell(1, 1)
        world.put_cell(2, 1)
        world.put_cell(0, 2)
        world.put_cell(1, 2)
        world.put_cell(2, 2)
        game = GameOfLife(world)
        game.evolve()
        self.assertFalse(game.get_world().is_alive(1, 1))
        self.assertFalse(game.get_world().is_alive(1, 2))

    def test_any_dead_cell_with_three_neighbors_becomes_alive(self):
        world = World()
        world.put_cell(1, 1)
        world.put_cell(2, 1)
        world.put_cell(0, 2)
        game = GameOfLife(world)
        game.evolve()
        self.assertTrue(game.get_world().is_alive(1, 2))

What would happen if we needed a Cell object instead of pairs of coordinates? How can a new coordinate system be introduced in this code? How many places do we need to touch the test code just to change the signature of GameOfLife‘s constructor? Imagine any feature request for the game along with the necessary refactorings needed to add the feature cleanly and you will see that the test above makes most of them harder to implement since it needs a lot of changes in several places for basically any change that might occur in the interface of the production code. This test is too coupled to the interface of the production code and its collaborators, in other words, it defines classes and methods multiple times but the behavior it aims to validate is lost inside the noise.

Conclusion

Be careful when writing your tests. Having tests is certainly better than not having them, but to make them effective and helpful, one should never forget the goals the tests are written to reach. Tests are code, so standards, metrics, design principles and readability should be applied on them as well.

Game of life – never getting boring

Thursday, May 31, 2012 @ 09:05 PM Author: athos

Do you think it’s funny to implement Conway’s Game of Life for the 20th time? According to my experiences, the answer is hell yeah!

Since the Global Day of CodeRetreat event (organized by Marton Meszaros in Hungary), I’m a big fan of the CodeRetreat format of coding dojos. I’ve been attending all the events of the CodeRetreat Budapest Community, and at the last one, I even helped coaching during some of the sessions. Nucc and me decided to introduce this event to our fellow developers at BalaBit as well and fortunately the management was kind enough to support us by dedicating some working hours to learn about TDD and other modern software development methodologies.

At a CodeRetreat event, you always implement the four rules of Game of Life (without UI and such fancy stuff). Doing the same task again and again makes the problem well-known and understood so you can focus on the style and design of your tests and code, which is the goal of a CodeRetreat. You don’t have to finish the implementation, though, it’s not about Getting Things Done or competing which pair is the first to finish. More like Not Getting Things Done. Often it’s impossible as well to finish for you have only 45 minutes to work. After that, well, code is deleted, a short retrospective meeting is held then a new session begins. Pair programming is mandatory, and by changing the pairs after every session you’re guaranteed to learn a lot from others. To make sure that you won’t be able to train yourself during the day to finish new constraints are introduced in every session, like infinite grid, immutability of states, 3 line methods with at most 2 parameters, no-mouse (did you notice you go faster when you use only keyboard shortcuts in your IDE?), etc..

During the first few sessions, there are no constraints. The goal is getting the problem understood. In the implementation, you just do whatever you want. That seems to be the hardest constraint though, since people tend to not finish during the first sessions. Instead they do Big Design Up Front and forget to care about the software to be implemented. These are the sessions where nice UML diagrams about Cells and Neighborhood relations and Worlds and Games and Rules and such nicely designed stuff are drawn, then people realize after 45 minutes that they spent almost an hour working on getters and setters but not a single rule of the game got implemented. Yep, there’s a reason for the Four Rules of Simple Design having been invented. ;-)

Turns out if you’d started the very first testcase with the first rule (“Any live cell with fewer than two live neighbours dies, as if caused by under-population”), you might have had a working implementation of all four of them after 15-20 minutes or so. And nothing more than that since you wouldn’t have written code that is never used except for making those UMLs feel more complete. YAGNI!

Of course there’s a chance that such a quick and dirty implementation will be, well, dirty. It may contain at least one thing you consider ugly. But you have more than half of the session left and the safety net of a bunch of tests to refactor the hell out of it!

You spoiled the public interface of your Game with methods that should belong to a class named Board or World or Map instead? Are you feeling the need to test a method that is declared private? After extracting such methods into a separate class, you might end up with a Game class that has no state at all and contains only one public method that just creates new generations of cells based on a previous one.

On the other hand, after realizing how over-engineering slows them down, many people try to continue with sloppy design, skipping the refactorings for example to get rid of language primitives in the most abstract levels of the business logic or to conform to the SOLID principles, etc. They can be forced to do so by introducing requirement change surprises. For example, to handle hexagon maps along with square grids, a Topology or similar interface has to be extracted from the class representing the board, which can encapsulate the details of coordinate systems and neighborhood relations between coordinates. (Did you notice that the rules of the game do not depend on the placement and shape of the neighbors nor the neighbors themselves, but only their count is what matters?) To handle colored cells which know how old they are, cells have to be represented by a separate class instead of bare coordinate-tuples or boolean flags.

Such refactorings can be a pain in the ass when tests contain copy&paste, duplication, redundancy and duplication, so they must be cleaned up as well: by extracting common fixtures, creating abstract assertions based on the repeating patterns of similar assertions, etc. (Abstract assertions even help you to conform to such rigorous principles like “1 assertion per test.”)

Imagine your day to day software development work boiling down to such simple business requirements implemented by a couple of short but well named methods in a few simple and minimalistic classes, all of them thoroughly unit tested (by unit testing I mean unit testing), all of those tests proven to be useful (since you’ve seen them fail and turning into green a couple of seconds later) and providing 100% coverage of assertions. Wouldn’t it be easier than digging yourself through stack traces in mindblowingly complex codes full of nested infinite-but-conditionally-breaking-somewhere-in-the-middle-do-while loops depending on side effects of getter functions and badly named global variables coming from a dozen layers away? If you know this feeling I’m talking about, I’d recommend you to visit the next CodeRetreat near your location (look around in Twitter or Facebook) – to try something new!

GitHub breach: Rails mass assignments vs. Architecture

Tuesday, March 6, 2012 @ 02:03 PM Author: athos

You may have heard about what happened to GitHub the other day. (BalaBit source codes like Zorp-GPL or syslog-ng codes mirrored at GitHub were not affected.)

In short: any user could upload her public key to the account of any other user by exploiting that GitHub uses unprotected mass assignments to instantiate objects responsible for storing public key data.

But how the hell can something like that happen in a well-structured, clean architecture? Here’s a code snippet using mass assignment:

class User < ActiveRecord::Base
    belongs_to :group
    # ...
end

user = User.new(params[:user])

A User object is created and its properties are set immediately to hold arbitrary values, possibly controlled by an attacker.

Now let me show you some dirty PHP code (hey kids, this is dangerous, don’t try this at home and never use such a code in production):

// $_POST = array("name" => "John", "group_id" => 5);
$columns = implode("`, `", array_keys($_POST));
$values = implode("', '", array_values($_POST));
mysql_query(
    "INSERT INTO `users` (`$columns`)"
    . " VALUES ('$values');"
);

You’d never do that, right? Not just because of its vulnerability to SQLi, but because this code mixes several responsibilities, several layers of a web application: input processing and data persistency for instance. It’s not just the code that’s crappy in the example above, it’s the behavior of the code! Blindly writing anything into the database coming from the user is surely a bad idea. No matter what a beautifully chaotic dance of how many classes and objects and layers and Routers and Controllers and Models and Whistles and Bells do the same thing as the PHP code above, it still remains a mess. From a behavioral point of view, there’s not much difference between the above two code snippets.

No, blacklisting/whitelisting model attributes is not the OO way of solving this issue, especially if you think of different possible use cases. A truly object oriented, clean architecture would instead separate business logic itself (ie. implementation of use cases), collecting information required for the business logic (and nothing more) from the input, passing that information to the business logic and sending the output of the business logic into some data persistency layer, etc.

Web is just a UI, a delivery mechanism, an ugly detail! says Uncle Bob Martin in this talk. When business logic is truly separated from HTTP request processing and other web (thus, UI) related tasks the application does (all the layers with their own unit tests), such accidents like setting a field of a business logic data structure that should not contain user supplied value, would not happen.

Separate your business logic from both the UI (ie. web) and from data persistency, so you can write unit tests to verify its operation with security in mind, independently from the framework and third parties your application relies on.

IOCCC vs Clean Code

Sunday, November 27, 2011 @ 10:11 PM Author: athos

I found this piece of code at the website of the International Obfuscated C Code Contest. (Did you know the contest is open for this year?) Since I read tons of books and papers about clean code nowadays, I couldn’t resist refactoring it, just to see how much a heavy code cleanup can improve a source code that was intentionally written to be obscrure as possible, and of course, to abuse both IOCCC and Clean Code as hard as can be. :-)

The original code

main(n,i,a,m){while(i=++n)
for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;);}

To compile it using GCC, the following command can be used:

$ gcc -ansi -o makarios makarios.c

The only thing it does is slowly printing a bunch of numbers to stdout in an infinite loop. Note that this is pretty much all the information that can be extracted from the code at first blink, either by running it or by reading it.

Refactoring without tests is suicide

Before refactoring it, I needed to make it testable. Since this is a very simple program, I decided to test it end-to-end, by saving a reference output to a textfile (say, for example, the first 150 numbers it prints) and by breaking the infinite loop (of course it would terminate once n overflows, but that’s way too slow):

main(n,i,a,m){while(i=++n)
#define STOP_AT 522233

main(n,i,a,m){while((i=++n) <= STOP_AT)
for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;);}

Note that it prints numbers in octal base, and 522233 is the 150th number in the reference output in decimal base. It could’ve been written in octal, but in the end I’d like to get rid of this hard-coded constant, so it will do the job well for now. Let’s see if it still generates the expected output:

$ gcc -ansi -Wall -o ./makarios makarios.c \
      && ./makarios 2>&1 \
      | diff expected_output.txt - \
      && echo OK

After each and every change, I ran the test and created a patch only if the test passed. (Find the individual patches at GitHub.)

Recovering structure

So do you have any idea how this code works? Me neither. But there are a couple of things to note:

  • Some indentation might come handy in the future.
  • Local variables are defined as arguments to main(), OMG!
  • The output depends on n (which will always be the number of command line arguments to the program), but all the other variables are overwritten sooner or later, before used. (a and i are trivial to see, and since i is always positive and a is 0, a<i will always be true in the beginning of the inner loop, so the positive branch of the ternary is executed first, which means m gets overwritten as well).
  • Easy to see by experiment, this program does not print numbers smaller than the number of command line arguments (note that the output is in octal base), so if one would want to see only numbers greater than eg. 64 (100 in octal base), the program should be invoked like this:
    ./makarios `seq 1 64`

    This is not very useful, so I decided to drop this feature. Variable n will start from 1, like when the program is invoked without arguments (making the first parameter of main() be 1).

  • There are a couple of compiler warnings, these are better cleaned up so we can see if we mess up something.

After a couple of minutes, my code looked like this:

#include <stdio.h>

#define STOP_AT 522233

main(n,i,a,m){while((i=++n) <= STOP_AT)
for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;);}
int main(int argc, char* argv[])
{
    int n = 1,
        i, a, m;

    while((i=++n) <= STOP_AT)
    {
        for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;)
        {
        }
    }

    return 0;
}

Next I converted the for loop into a while loop and moved a and m inside the outer loop since they are not used outside of it:

#include <stdio.h>

#define STOP_AT 522233

int main(int argc, char* argv[])
{
    int n = 1,
        i, a, m;
        i;

    while((i=++n) <= STOP_AT)
    {
        for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;)
        int a = 0,
            m;

        while (a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m)
        {
        }
    }

    return 0;
}

Now if that ugly expression in the condition of the loop would be inside the block, it could be easily broken up:


int a = 0,
    m;
    m, is_not_ready;

while (a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m)
}
do
{
    is_not_ready = a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;
} while (is_not_ready);

First, replace the ternary with an if statement:

do
{
    is_not_ready = a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;
    if (a < i)
    {
        is_not_ready = a=a*8+i%8,i/=8,m=a==i|a/8==i,1;
    }
    else
    {
        is_not_ready = (n-++m||printf("%o\n",n))&&n%m;
    }
} while (is_not_ready);

Notice that I did not even try to understand the hows and whys. The only thing I did so far was to analyze the code, breaking up expressions, replacing them to more readable alternatives, but I have no ghost of an idea what this code does and why it does it so. Let’s continue the work by breaking up the assignment in the true branch of the if statement (from now on, a good cheat sheet of operator precedence will be a life saver):

if (a < i)
{
    is_not_ready = a=a*8+i%8,i/=8,m=a==i|a/8==i,1;
    a = a*8 + i%8;
    i /= 8;
    m = (a == i) | (a/8 == i);
    is_not_ready = 1;
}
else
{
    is_not_ready = (n-++m||printf("%o\n",n))&&n%m;
}

Notice the value assigned to m? Those are two booleans with a binary OR in the middle. That’s ugly, using a logical operator instead will change nothing:

m = (a == i) | (a/8 == i);
m = (a == i) || (a/8 == i);

Now breaking up the assignments inside the else block: printf() will always return non-zero making the expression inside parenthesis true, so is_not_ready depends only from n%m. Since it’s a boolean expression, I added a comparison to zero:

is_not_ready = (n-++m||printf("%o\n",n))&&n%m;
(n-++m||printf("%o\n",n));
is_not_ready = 0 != n%m;

Due to lazy evaluation, printf() is only executed when n-++m equals to zero. To make that expressed in code, an if statement can be used instead of the ignored boolean expression:

(n-++m||printf("%o\n",n));
if (0 == n-++m)
    printf("%o\n", n);

is_not_ready = 0 != n%m;

To make the condition more readable:

if (0 == n-++m)
if (++m == n)
    printf("%o\n", n);

So far, so good:


#include <stdio.h>

#define STOP_AT 522233

int main(int argc, char* argv[])
{
    int n = 1,
        i;

    while((i=++n) <= STOP_AT)
    {
        int a = 0,
            m, is_not_ready;

        do
        {
            if (a < i)
            {
                a = a*8 + i%8;
                i /= 8;
                m = (a == i) || (a/8 == i);
                is_not_ready = 1;
            }
            else
            {
                if (++m == n)
                    printf("%o\n", n);

                is_not_ready = 0 != n%m;
            }
        } while (is_not_ready);
    }

    return 0;
}

This is kinda readable, isn’t it? No! This is not much better than it was in its 105 bytes size.

Let’s do some Command-query separation here by extracting state-changing operations from conditionals.

if (++m == n)
++m;
if (m == n)
    printf("%o\n", n);

Pretty straightforward. Now ++n and the assignment to i require slightly more work:

int main(int argc, char* argv[])
{
    int n = 1,
        i;
    int n = 2;

    while((i=++n) <= STOP_AT)
    while(n <= STOP_AT)
    {
        int a = 0,
            i = n,
            m, is_not_ready;

        do
        {
            if (a < i)
            {
                a = a*8 + i%8;
                i /= 8;
                m = (a == i) || (a/8 == i);
                is_not_ready = 1;
            }
            else
            {
                ++m;
                if (m == n)
                    printf("%o\n", n);

                is_not_ready = 0 != n%m;
            }
        } while (is_not_ready);

        ++n;
    }

    return 0;
}

Actually the outer loop can be written shorter as a for loop:

int main(int argc, char* argv[])
{
    int n = 2;
    int n;

    while(n <= STOP_AT)
    for (n = 2; n <= STOP_AT; ++n)
    {
        int a = 0,
            i = n,
            m, is_not_ready;

        do
        {
            if (a < i)
            {
                a = a*8 + i%8;
                i /= 8;
                m = (a == i) || (a/8 == i);
                is_not_ready = 1;
            }
            else
            {
                ++m;
                if (m == n)
                    printf("%o\n", n);

                is_not_ready = 0 != n%m;
            }
        } while (is_not_ready);
        ++n;
    }

    return 0;
}

Now we can get rid of is_not_ready, it was just a temporary helper to break up some magical expressions:


for (n = 2; n <= STOP_AT; ++n)
{
    int a = 0,
        i = n,
        m, is_not_ready;
        m;

    do
    {
        if (a < i)
        {
            a = a*8 + i%8;
            i /= 8;
            m = (a == i) || (a/8 == i);
            is_not_ready = 1;
        }
        else
        {
            ++m;
            if (m == n)
                printf("%o\n", n);

            is_not_ready = 0 != n%m;
            if (0 == n%m)
                break;
        }
    } while (is_not_ready);
    } while (1);
}

For an infinite loop, it does not make much difference if it’s a post-test or a pre-test:

do
while (1)
{
    if (a < i)
    {
        a = a*8 + i%8;
        i /= 8;
        m = (a == i) || (a/8 == i);
    }
    else
    {
        ++m;
        if (m == n)
            printf("%o\n", n);

        if (0 == n%m)
            break;
    }
} while (1);
}

If you look at carefully, that’s two inner loops for the price of one:

  • Variable a is starting from 0 and is compared to i which is always positive. This implies that in every iteration of the outer loop the positive branch of the if statement is executed first.
  • The negative branch of the if statement changes neither variables read by the positive branch, nor those used in the condition.
  • In other words, once the negative branch is executed during an iteration of the outer loop, the positive one is never executed again in that iteration.
  • These together imply that in every iteration of the outer loop, the positive branch of the if statement is executed some times, then the negative branch, which finally exits from the inner loop. I.e. during execution, practically there are two separate loops.
  • Separating them in code as well makes understanding the behavior of the program easier.
while (1)
{
    if (a < i)
    while (a < i)
    {
        a = a*8 + i%8;
        i /= 8;
        m = (a == i) || (a/8 == i);
    }
    else
    while (1)
    {
        ++m;
        if (m == n)
            printf("%o\n", n);

        if (0 == n%m)
            break;
    }
}

Now the last compiler warning can be easily eliminated once we notice that m is not used inside the first inner loop, but it’s always overwritten before evaluating the loop condition again, which means it can be extracted to be after the loop (knowing that for every possible values of i and a the loop iterates at least one, so m is never uninitialized before the second inner loop).

while (a < i)
{
    a = a*8 + i%8;
    i /= 8;
    m = (a == i) || (a/8 == i);
}
m = (a == i) || (a/8 == i);

Now let’s take care of the second inner loop! Reverting back to a post-test loop (getting rid of always true conditional):

while (1)
do
{
    ++m;
    if (m == n)
        printf("%o\n", n);

    if (0 == n%m)
        break;
}
} while (0 != n%m);

Notice that conditional printing can be extracted because the loop will either exit before the condition of printing could become true or increment m until it reaches n, but when the latter happens, the loop will break anyways.

do
{
    ++m;
    if (m == n)
        printf("%o\n", n);
} while (0 != n%m);

if (m == n)
    printf("%o\n", n);

Now m is used in two roles: sometimes it holds a boolean value, sometimes it’s used in arithmetical expressions. This violates Single Responsibility Principle, so let’s get rid of one of the roles. Notice that when m is 0, the do-while loop will iterate only once, and then no printing will happen since n is always greater than 1 and m will be equal to 1, and 1 is a divisor of any n. Which means we can skip the do-while loop when the magical expression after the first inner loop is false, and start m from 1 when it’s not:

m = (a == i) || (a/8 == i);
if (!((a == i) || (a/8 == i)))
    continue;
m = 1;

There’s one more tiny thing disturbing me: I don’t really like post-test loops when they’re used unnecessarily, so let’s get rid of this one:

m = 1;
do
m = 2;
while (0 != n%m)
{
    ++m;
} while (0 != n%m);
}

The code so far:

#include <stdio.h>

#define STOP_AT 522233

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        int a = 0,
            i = n,
            m;

        while (a < i)
        {
            a = a*8 + i%8;
            i /= 8;
        }

        if (!((a == i) || (a/8 == i)))
            continue;

        m = 2;
        while (0 != n%m)
        {
            ++m;
        }

        if (m == n)
            printf("%o\n", n);
    }

    return 0;
}

Cleaning the code

Now that’s clean, isn’t it? Of course not! For instance, nothing in it has a nice, explanatory name. It’s just a messy mix of one-letter variables, magical expressions and a spaghetty of loops.

Notice that m is only used in the second half of the outer loop starting from the assignment and ending at the conditional of the if statement. If those lines would be in a separate function, m could be a local variable and n would be a parameter. Similar can be done for the first part starting from the declaration of a and i and ending at the first if statement. Note that the two loops have no variables in common, but both can be parametrized by n. Let’s move those loops into functions in the hope they will be easier to understand without their disturbing contexts:

#include <stdio.h>

#define STOP_AT 522233

static int magic_algorithm(int number);
static int other_magic_algorithm(int number);
static void print_octal(int number);

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        int a = 0,
            i = n,
            m;
        while (a < i)
        {
            a = a*8 + i%8;
            i /= 8;
        }

        if (!((a == i) || (a/8 == i)))
        if (!magic_algorithm(n))
            continue;

        m = 2;
        while (0 != n%m)
        {
            ++m;
        }

        if (m == n)
            printf("%o\n", n);
        if (other_magic_algorithm(n))
            print_octal(n);
    }
    return 0;
}

static int
magic_algorithm(int number)
{
    int a = 0,
        i = number;
    while (a < i)
    {
        a = a*8 + i%8;
        i /= 8;
    }

    return (a == i) || (a/8 == i);
}

static int
other_magic_algorithm(int number)
{
    int m;

    m = 2;
    while (0 != number%m)
    {
        ++m;
    }

    return m == number;
}

static void
print_octal(int number)
{
    printf("%o\n", number);
}

That printf() with the format string are ugly implementation details that shoul be buried inside a small, well named function somewhere at the bottom.

The first loop is called magic_algorithm() and the second is other_magic_algorithm() from now. I’m sure they could be given more rational names, but coming up with better names would require understanding the code, which I don’t want to do without making it obvious at the same time, so let’s continue refactoring.

The easiest thing to do is simplifying main() by joining the two if statements:

for (n = 2; n <= STOP_AT; ++n)
{
    if (!magic_algorithm(n))
        continue;

    if (other_magic_algorithm(n))
    if (magic_algorithm(n) && other_magic_algorithm(n))
        print_octal(n);
}

Now main() is pretty obvious to understand: it prints numbers that are accepted by both of our magic algorithms in octal format. The question is, what and how do these magic algorithms do?

The second is simplier: by extracting the condition of the while loop into a well named function, it becomes almost obvious:

static int is_divisable(int dividend, int divisor);

static int
other_magic_algorithm(int number)
{
    int m;

    m = 2;
    while (0 != number%m)
    while (!is_divisable(number, m))
    {
        ++m;
    }

    return m == number;
}

static int
is_divisable(int dividend, int divisor)
{
    /* FIXME: handle division by zero */
    return 0 == dividend % divisor;
}

Note that there’s a reason this function is static: it’s not reusable, becuase it does not handle unexpected input, that’s why a FIXME comment is added. Now let’s focus on refactoring.

After extracting is_divisable(), the role of m becomes clear, so let’s find a good name for it:

static int
other_magic_algorithm(int number)
{
    int m;
    int smallest_divisor = 2;

    m = 2;
    while (!is_divisable(number, m))
    while (!is_divisable(number, smallest_divisor))
    {
        ++m;
        ++smallest_divisor;
    }

    return m == number;
    return smallest_divisor == number;
}

What do you call a number that equals to its smallest divisor greater than 1?

#include <stdio.h>

#define STOP_AT 522233

static int magic_algorithm(int number);
static int other_magic_algorithm(int number);
static int is_prime(int number);
static void print_octal(int number);

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        if (magic_algorithm(n) && other_magic_algorithm(n))
        if (magic_algorithm(n) && is_prime(n))
            print_octal(n);
    }

    return 0;
}

static int is_divisable(int dividend, int divisor);

static int
other_magic_algorithm(int number)
is_prime(int number)
{
    int smallest_divisor = 2;

    while (!is_divisable(number, smallest_divisor))
    {
        ++smallest_divisor;
    }

    return smallest_divisor == number;
}

Do you see Single Responsibility Principle being violated? I do:

static int is_divisable(int dividend, int divisor);
static int find_smallest_divisor_greater_than_one(int number);

static int
is_prime(int number)
{
    /* FIXME: handle 0 and 1 */
    return find_smallest_divisor_greater_than_one(number) == number;
}

static int is_divisable(int dividend, int divisor);

static int
find_smallest_divisor_greater_than_one(int number)
{
    /* FIXME: handle negative numbers */

    int smallest_divisor = 2;

    while (!is_divisable(number, smallest_divisor))
    {
        ++smallest_divisor;
    }

    return smallest_divisor == number;
    return smallest_divisor;
}

Finding the smallest divisor and deciding if a number is prime are at different abstraction levels. Besides, it was never mentioned in the code that 1 is not considered a divisor.

Don’t worry about function call overhead, the first thing compilers will do when encountering this code will be to make is_divisable() and find_smallest_divisor_greater_than_one() inline. Then why did we extract them? Because this code is being written for humans, not compilers!

Now to the other magic function.

What is 42 % 10? It’s 2 of course. What is X % 10? Yes, it’s the last digit of X in base 10. Now replace 10 with 8, and you’re doing the same in octal base. This may help a lot to understand what to extract from the other magic algorithm and how to name the functions extracted:

static int get_last_octal_digit(int number);
static int remove_last_octal_digit(int number);

static int
magic_algorithm(int number)
{
    int a = 0,
        i = number;
        i = number,
        digit;

    while (a < i)
    {
        a = a*8 + i%8;
        i /= 8;
        digit = get_last_octal_digit(i);
        a = a*8 + digit;
        i = remove_last_octal_digit(i);
    }

    return (a == i) || (a/8 == i);
    return (a == i) || (remove_last_octal_digit(a) == i);
}

static int
get_last_octal_digit(int number)
{
    return number % 8;
}

static int
remove_last_octal_digit(int number)
{
    return number / 8;
}

This starts to make sense, so let’s continue with this octal digit theory:


static int get_last_octal_digit(int number);
static int append_octal_digit(int number, int digit);
static int remove_last_octal_digit(int number);

static int
magic_algorithm(int number)
{
    int a = 0,
        i = number,
        digit;

    while (a < i)
    {
        digit = get_last_octal_digit(i);
        a = a*8 + digit;
        a = append_octal_digit(a, digit);
        i = remove_last_octal_digit(i);
    }

    return (a == i) || (remove_last_octal_digit(a) == i);
}

static int
get_last_octal_digit(int number)
{
    return number % 8;
}

static int
append_octal_digit(int number, int digit)
{
    /* FIXME: handle invalid digits */
    return number * 8 + digit;
}

static int
remove_last_octal_digit(int number)
{
    return number / 8;
}

(Sorry, that’s another FIXME added.)

Now the roles of those variables are getting clearer: in every iteration, a digit is removed from the end of i and is appended to the end of a. This will make a contain the digits of the given number in reversed order, though the algorithm is stopped when reaching the digit in the middle.

static int get_last_octal_digit(int number);
static int append_octal_digit(int number, int digit);
static int remove_last_octal_digit(int number);

static int
magic_algorithm(int number)
{
    int a = 0,
        i = number,
        digit;
    int reversed_digits = 0,
        remaining_digits = number;

    while (a < i)
    while (reversed_digits < remaining_digits)
    {
        digit = get_last_octal_digit(i);
        a = append_octal_digit(a, digit);
        i = remove_last_octal_digit(i);
        int digit = get_last_octal_digit(remaining_digits);
        reversed_digits = append_octal_digit(reversed_digits, digit);
        remaining_digits = remove_last_octal_digit(remaining_digits);
    }

    return (a == i) || (remove_last_octal_digit(a) == i);
    return (reversed_digits == remaining_digits)
        || (remove_last_octal_digit(reversed_digits) == remaining_digits);
}

When the digits of a number in a base remain the same when the order is reversed, then the number is called palindromic in that base. This algorithm is a test for the octal base. Note that the algorithm stops moving the digits when it reaches the one in the middle. That’s why the expression in the return statement is built from two parts: one is for an odd number of digits, the other is for the even case. This is enough to rename magic_algorithm() (oops, I noticed a new FIXME again):

#include <stdio.h>

#define STOP_AT 522233

static int magic_algorithm(int number);
static int is_palindromic_in_octal_base(int number);
static int is_prime(int number);
static void print_octal(int number);

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        if (magic_algorithm(n) && is_prime(n))
        if (is_palindromic_in_octal_base(n) && is_prime(n))
            print_octal(n);
    }

    return 0;
}

static int get_last_octal_digit(int number);
static int append_octal_digit(int number, int digit);
static int remove_last_octal_digit(int number);

static int
magic_algorithm(int number)
is_palindromic_in_octal_base(int number)
{
    /* FIXME: handle negative numbers */

    int reversed_digits = 0,
        remaining_digits = number;

    while (reversed_digits < remaining_digits)
    {
        int digit = get_last_octal_digit(remaining_digits);
        reversed_digits = append_octal_digit(reversed_digits, digit);
        remaining_digits = remove_last_octal_digit(remaining_digits);
    }

    return (reversed_digits == remaining_digits)
        || (remove_last_octal_digit(reversed_digits) == remaining_digits);
}

static int
get_last_octal_digit(int number)
{
    return number % 8;
}

static int
append_octal_digit(int number, int digit)
{
    /* FIXME: handle invalid digits */
    return number * 8 + digit;
}

static int
remove_last_octal_digit(int number)
{
    return number / 8;
}

Now the while loop could be extracted from is_palindromic_in_octal_base() (remember SRP?), but that would require:

  • output parameters: ugly
  • iterating through the whole number to calculate reversed order to the end: might be slow, notice that the original author might have done it this way to save half of the steps
  • structs and pointers: overkill
  • macro magic: can be confusing

In short: the code is not clean, but we cannot refactor it better. Let’s admit this failure by adding some explanatory comments to this function:

static int
is_palindromic_in_octal_base(int number)
{
    /* FIXME: handle negative numbers */

    int reversed_digits = 0,
        remaining_digits = number;

    /* NOTE: The loop will stop when the digit in the middle is reached.  */
    while (reversed_digits < remaining_digits)
    {
        int digit = get_last_octal_digit(remaining_digits);
        reversed_digits = append_octal_digit(reversed_digits, digit);
        remaining_digits = remove_last_octal_digit(remaining_digits);
    }

    /* NOTE: For an odd number of digits, reversed_digits contains
             an extra digit (coming from the middle). */
    return (reversed_digits == remaining_digits)
        || (remove_last_octal_digit(reversed_digits) == remaining_digits);
}

After these steps, I realized the need for a Makefile so I could extract reusable code to new compilation units (in addition, I created some quick and dirty unit tests which you can find at GitHub):

lib/primes.h:

#ifndef _PRIMES_H
#define _PRIMES_H

int is_prime(int number);

#endif

lib/primes.c:

#include "primes.h"

static int find_smallest_divisor_greater_than_one(int number);

int
is_prime(int number)
{
    /* FIXME: handle 0 and 1 */
    return find_smallest_divisor_greater_than_one(number) == number;
}

static int is_divisable(int dividend, int divisor);

static int
find_smallest_divisor_greater_than_one(int number)
{
    /* FIXME: handle negative numbers */

    int smallest_divisor = 2;

    while (!is_divisable(number, smallest_divisor))
    {
        ++smallest_divisor;
    }

    return smallest_divisor;
}

static int
is_divisable(int dividend, int divisor)
{
    /* FIXME: handle division by zero */
    return 0 == dividend % divisor;
}

lib/palindromes.h:

#ifndef _PALINDROMES_H
#define _PALINDROMES_H

int is_palindromic_in_octal_base(int number);

#endif

lib/palindromes.c:

#include "palindromes.h"

static int get_last_octal_digit(int number);
static int append_octal_digit(int number, int digit);
static int remove_last_octal_digit(int number);

int
is_palindromic_in_octal_base(int number)
{
    /* FIXME: handle negative numbers */

    int reversed_digits = 0,
        remaining_digits = number;

    /* NOTE: The loop will stop when the digit in the middle is reached.  */
    while (reversed_digits < remaining_digits)
    {
        int digit = get_last_octal_digit(remaining_digits);
        reversed_digits = append_octal_digit(reversed_digits, digit);
        remaining_digits = remove_last_octal_digit(remaining_digits);
    }

    /* NOTE: For an odd number of digits, reversed_digits contains
             an extra digit (coming from the middle). */
    return (reversed_digits == remaining_digits)
        || (remove_last_octal_digit(reversed_digits) == remaining_digits);
}

static int
get_last_octal_digit(int number)
{
    return number % 8;
}

static int
append_octal_digit(int number, int digit)
{
    /* FIXME: handle invalid digits */
    return number * 8 + digit;
}

static int
remove_last_octal_digit(int number)
{
    return number / 8;
}

Reviewing primes.c, it’s obvious it could perform better, but premature optimization is the root of all evil, so let’s just add a comment for now:

static int
find_smallest_divisor_greater_than_one(int number)
{
    /* FIXME: handle negative numbers */

    /* OPTIMIZATION FIXME: The smallest divisor is always less than
                           the square root of the number. */

    int smallest_divisor = 2;

    while (!is_divisable(number, smallest_divisor))
    {
        ++smallest_divisor;
    }

    return smallest_divisor;
}

The most interesting part of the refactoring process ends here.

After the above, I removed all the FIXME comments one by one. Some of them were bad idea, others got converted to unit tests. All of them were easy to fix by adding some parameter checks to various functions.

Next I generalized palindromes.c to work with other bases than octal, then I created a function named find_octal_palindromic_prime_bigger_than(int number), so I could write this:

void
print_first_150_octal_palindromic_primes()
{
    int n = 0, i;
    for (i = 0; i != 150; ++i)
    {
        n = find_octal_palindromic_prime_bigger_than(n);
        print_octal(n);
    }
}

Of course, main() is pretty understandable now:

int main(int argc, char* argv[])
{
    print_first_150_octal_palindromic_primes();
    return 0;
}

The details are not very entertaining, though you can find individual commits at GitHub.

Performance

I changed both the refactored and the original version to print all the numbers 10 times. Here are the results (ran on my eeePC, compiled to 64 bit with GCC using -O3):

Obfuscated version:

real    0m15.230s
user    0m15.180s
sys     0m0.010s


Clean version:

real    0m15.401s
user    0m15.130s
sys     0m0.040s

The clean version is slightly slower than the original. Sad. But luckily we have a couple of unit tests so we can change the way calculations are done anytime we want, and we already have a good guess on what can be slow: testing each and every number as divisor in primes.c!

After rewriting is_prime() to test only odd numbers and to stop when reaching the square root of the given number, the refactored version gets nearly 14 times faster than the original! The results for 40 iterations of 150 numbers on the same machine:

Obfuscated version:

real    0m59.723s
user    0m59.610s
sys     0m0.000s


Clean version:

real    0m4.432s
user    0m4.410s
sys     0m0.000s

Maybe it could be improved further if it didn’t test numbers whether they are palindromic or not, but it generated palindromic numbers at the first place instead.

How much information did the cleanup give?

I mean literally.

That’s the question I wanted to answer when I began writing this post. Lucky we have some mathematical theory to apply here: entropy, ie. the expected value of information (measured in bits). Though Shannon’s entropy could easily be calculated for these files, I’ll be happy with an approximation. By compressing the complete source code (including unit tests but excluding the textfile containing expected output and performance tests) and looking at compressed sizes, theoretical entropy can be approximated well enough. I will compress the source code from four stages: the original, the structure recovered, the function extracted and the final version:

Original makarios.c (commit 0b966486f8)

main(n,i,a,m){while(i=++n)
for(a=0;a<i?a=a*8+i%8,i/=8,m=a==i|a/8==i,1:(n-++m||printf("%o\n",n))&&n%m;);}

Somewhat readable makarios.c (commit bcf7f4046b)

#include <stdio.h>

#define STOP_AT 522233

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        int a = 0,
            i = n,
            m;

        while (a < i)
        {
            a = a*8 + i%8;
            i /= 8;
        }

        if (!((a == i) || (a/8 == i)))
            continue;

        m = 2;
        while (0 != n%m)
        {
            ++m;
        }

        if (m == n)
            printf("%o\n", n);
    }

    return 0;
}

Function extracted makarios.c (commit 18b94d3210)

#include <stdio.h>

#define STOP_AT 522233

static int is_palindromic_in_octal_base(int number);
static int is_prime(int number);
static void print_octal(int number);

int main(int argc, char* argv[])
{
    int n;

    for (n = 2; n <= STOP_AT; ++n)
    {
        if (is_palindromic_in_octal_base(n) && is_prime(n))
            print_octal(n);
    }

    return 0;
}

static int get_last_octal_digit(int number);
static int append_octal_digit(int number, int digit);
static int remove_last_octal_digit(int number);

static int
is_palindromic_in_octal_base(int number)
{
    /* FIXME: handle negative numbers */

    int reversed_digits = 0,
        remaining_digits = number;

    /* NOTE: The loop will stop when the digit in the middle is reached.  */
    while (reversed_digits < remaining_digits)
    {
        int digit = get_last_octal_digit(remaining_digits);
        reversed_digits = append_octal_digit(reversed_digits, digit);
        remaining_digits = remove_last_octal_digit(remaining_digits);
    }

    /* NOTE: For an odd number of digits, reversed_digits contains
             an extra digit (coming from the middle). */
    return (reversed_digits == remaining_digits)
        || (remove_last_octal_digit(reversed_digits) == remaining_digits);
}

static int
get_last_octal_digit(int number)
{
    return number % 8;
}

static int
append_octal_digit(int number, int digit)
{
    /* FIXME: handle invalid digits */
    return number * 8 + digit;
}

static int
remove_last_octal_digit(int number)
{
    return number / 8;
}

static int find_smallest_divisor_greater_than_one(int number);

static int
is_prime(int number)
{
    /* FIXME: handle 0 and 1 */
    return find_smallest_divisor_greater_than_one(number) == number;
}

static int is_divisable(int dividend, int divisor);

static int
find_smallest_divisor_greater_than_one(int number)
{
    /* FIXME: handle negative numbers */

    int smallest_divisor = 2;

    while (!is_divisable(number, smallest_divisor))
    {
        ++smallest_divisor;
    }

    return smallest_divisor;
}

static int
is_divisable(int dividend, int divisor)
{
    /* FIXME: handle division by zero */
    return 0 == dividend % divisor;
}

static void
print_octal(int number)
{
    printf("%o\n", number);
}

Modularized version tarball (commit c6c01681e3)

lib/octal_palindromic_primes.c
lib/octal_palindromic_primes.h
lib/palindromes.c
lib/palindromes.h
lib/primes.c
lib/primes.h
makarios.c
Makefile
tests/test.h
tests/test_octal_palindromic_primes.c
tests/test_palindromes.c
tests/test_primes.c

The amount of information in the source code from each stages:

  • Original: 128 bytes (note that the uncompressed C source is smaller, 105 bytes!)
  • After recovering structure: 265 bytes
  • After extracting functions: 689 bytes
  • After organizing into modules with unit tests: 2872 bytes

To summarize: the “compressed” version of the original code contains more information (the gzip header I guess) than the uncompressed file. That’s how you rule at IOCCC! Adding indentation and using the appropriate control structure for each task doubles the information in the code. Extracting some well named functions, adding some comments and naming variables well introduced 2.6 times more information into the source (6.5 times the original 105 bytes), and organizing everything into standalone units with standalone tests multiplied that amount by another 4.1. (That’s 27.3 times more than the original 105 bytes!)

How much work?

According to commit logs, I made visible progress during the following time intervals:

  • Nov 23 18:50:43 2011 – Nov 23 20:54:56 2011: ~2 hours
  • Nov 23 22:56:29 2011 – Nov 24 01:01:11 2011: ~2 hours (with some breaks)
  • Nov 24 23:04:42 2011 – Nov 25 02:09:22 2011: ~3 hours (with many breaks)
  • Nov 26 21:26:47 2011 – Nov 26 23:21:41 2011: ~2 hours

That’s 9 hours of work which includes writing more detailed commit messages than I usually do, taking notes for this blog post, and committing changes way more frequently than I usually do. Also consider that I’m not an experienced C developer, and to make that even worse, most of the work was done in the night, after getting home from real-life work.

Conclusion

I’m not going to suggest anything, since this was just a funny experiment and what’s more, I don’t even think this coding style matches with the C world. The numbers are interesting though.

As an advice, I can say only one thing: when you’re writing code, please think for a moment how long it will take others to understand it. If your code is not crystal clear just by reading through it (without the need for stopping to think for a millisecond, without scrolling back and forth taking notes etc.), then everytime others look at that code, you’re spending your co-worker’s time (thus your company’s money) on something you could’ve got for a one time prize, or even for free if you’d took that three seconds to extract that loop into a separate function or name that variable something better than a and m

A few words on Clean code

Friday, April 22, 2011 @ 12:04 AM Author: athos

Today I gave a talk at BalaBit Meetup regarding clean code principles and related techniques. I had to squeeze the stuff into 25-30 minutes, so a lot of principles and tricks had to be left out (LoD, KISS, vertical formatting, to mention just a few of them). In the hope that it may be useful, I uploaded my slides to slideshare.net.