Posts Tagged ‘php’

Sorry PHP, I tried…
A couple of days ago I came across a blog post titled PHP: a fractal of bad design. I honestly don’t like programming language debates, but when the flame begins, I usually attempt to show that other languages than PHP have their flaws as well, none of them is perfect. But hey, this post was really shocking! I’ve been coding in PHP since ages. I’ve seen many gotchas and WTFs along the way, but I never stopped to summarize how many times I had to workaround PHP anti-features, poorly designed APIs and bugs just to get my tasks done. Now I did, and I’m not happy with the result.
My most memorable facepalms not mentioned in the above referred article include:
Database abstraction flaws
PHP provides a common interface for many kinds of database connections called PDO, PHP Database Object. It’s a very convenient and powerful tool to build your database handling logic upon (of course you will still need an additional layer above PDO to hide syntactic and semantical differences between different SQL implementations, ie. PDO is nothing more than a layer to hide the very low-level functions used to talk with the SQL server). At least when you don’t plan on using persistent database connections when running PHP over FastCGI.
<?php
function create_pdo()
{
return new PDO("mysql:host=localhost;dbname=test",
"test", "my_password",
array(PDO::ATTR_PERSISTENT => true));
}
$pdo1 = create_pdo();
$pdo2 = create_pdo();
$pdo1->query("SELECT i_can_has_fail() ;");
var_dump($pdo2->errorInfo());
?>
The error that belongs to $pdo1 is returned by $pdo2. This might not be surprising when using persistent connections, but the use-after-free segfault in your server logs that can be tracked down to the same issue causing this behavior certainly is.
What is the empty string a substring of?
Any programming language and library I can think of right now considers the empty string to be a trivial substring, more specifically a trivial prefix of any string. Except for PHP:
<?php
error_reporting(E_ALL);
var_dump(strpos("hello", "hell"));
var_dump(strpos("", "world"));
var_dump(strpos(false, null));
var_dump(strpos("world", ""));
var_dump(strpos("", ""));
?>
In PHP, false is not a substring of null (without any warnings, even when E_STRICT is turned on), but attempting to find the empty string inside any string throws a warning, even when $haystack is empty as well.
PHP Warning: strpos(): Empty delimiter in...
Same holds true for strstr() which is clearly based on the C function with the same name except for that in C it works fine with empty strings. And by the way, we have haystacks and needles here, but what on earth do they have to do with delimiters?
Substring of an empty string
Another gotcha somewhat related to the previous one is how substr() handles empty strings. To demonstrate it, try the following script:
<?php
var_dump(substr("hello", 0, 0));
var_dump(substr("", 0, 0));
?>
As you’d expect, a zero-length prefix of "hello" is of course the empty string. But a zero-length prefix of the empty string itself is boolean false which indicates failure according to the manual. Yes, it is stated there clearly that substr() won’t accept the empty string, but I can’t understand how such an artificial limitation could make sense.
Sorting
<?php
$array = array("1", 1, "a");
sort($array); var_dump($array);
sort($array); var_dump($array);
?>
Sorting an array twice in a row is a waste of time. Once the array is sorted, sorting it again shouldn’t make any difference, right? Wrong! The above script will produce array('a', 1, '1') after the first sort() call and array('1', 'a', 1) after the second. Of course the manual warns about unpredictable behavior when sorting mixed type arrays, so this example is cheating. Let’s try again with an array containing only strings:
<?
$array = array("a", "2", "3e-1");
sort($array); var_dump($array);
?>
Now you would expect predictable behavior, won’t you? Please don’t be very disappointed when the above code prints the following array:
array('3e-1', '2', 'a')
If it were sorted lexically, then '2' would precede '3e-1', so it must have been sorted numerically. But as a number, 'a' should be converted to 0 which then should precede 0.3, which leads to the conclusion that some elements in this array were compared as numbers, others were compared as strings. Of course sort() can be told to always compare elements as numbers (SORT_NUMERIC) or as strings (SORT_STRING), but the default behavior is neither of those. The default comparison is called SORT_REGULAR, about which the manual says:
SORT_REGULAR– compare items normally (don’t change types)
Did you read it carefully? Don’t change types! Of course, this seems to be related to the string equality bug, which PHP developers closed with the status “Not a bug”, advicing users to call strcmp() or use === to compare strings.
Conclusion?
Yes, if you want to write safe and predictable code in PHP, you must be very careful. Very careful, when writing both your tests and your production code. Or if you can, you may better avoid using PHP at all. I’m sad to say this, because there are a couple of features in PHP that I actually like. For example, in PHP, private methods are really private, not just magically renamed, interfaces and type-hinting comes very handy when you want to write self explanatory method headers, and there are many great application and testing frameworks out there that make development just as powerful and simple like Rails does in Ruby or Django does in Python, but still, with such and so many flaws in the core, I don’t know what to say. Sorry PHP, I tried.

Test-driving a regular expression
“When you have a problem to solve, use regular expressions! Now you have two problems!”
Regular expressions are very useful tools in the inventory of a programmer: they can be used to validate a string according to a given format, or to find and extract parts of a string, or even to change substrings matching a given pattern. But when you write some hellish tricky regular expression, do you stop for a moment to think about the spaghetty of while loops and switch-case statements that your regexp will translate to? I mean just a few characters of regexp may specify a state machine that would take tens or maybe hundreds of lines of loops and conditionals to implement. All that amount of code gets executed whenever you use your regexp, so you would need a big amount of testcases to cover that state machine if you happen to practise Test-driven development (TDD). In this experiment I’m going to blindly apply the three well-known steps of TDD to create a regular expression for a given task, and see how far I can get and how much time it takes.
These three steps of the TDD cycle are:
- Write a failing unit test (compilation errors are considered a failure).
- Write just enough production code and not a single line more to pass the failing test.
- Cleanup the mess created during the first two steps.
Or in short:
- Red
- Green
- Refactor
My kata
My kata for this experiment will be validating a simple time specification, that contains two digits of hour and minute parts separated by a colon and either AM or PM in case of 12 hour notation. The rules are going to be:
- Valid time specifications should always contain a two digit number (with a leading zero if necessary) describing the hour of the day followed by a colon followed by a two digit number (with a leading zero as well) to describe the minute of the hour.
- The HH:MM part may be followed by a single space followed by two alphabetic characters, either AM or PM. In this case the time specification is considered to be in 12 hour format, thus the hour part should not be more than 12.
- When there is no AM or PM at the end of the string, the time specification is considered to be 24 hour format, in which case the hour part should be between 00-23 (inclusive).
- In any case, the minute part must be between 00-59 (inclusive).
The job is to decide if a string matches all the above rules or not.
There are a million possible implementations, but for now, let’s evolve this one by applying the TDD development cycle to meet the requirements. The language is going to be PHP, the testing framework will be PHPUnit, and the implementation will rely on PHP’s preg_match() function. First, let’s start from very small, trivial steps. Going forward, we may speed things up, just like in Kent Beck’s wonderful book.
The first testcase
In TDD, tests come first, so let’s create a simple testcase:
<?php
require_once 'TimeValidator.php';
class TimeValidatorTest extends PHPUnit_Framework_TestCase
{
public function testTimeValidatorCanBeCreated()
{
new TimeValidator();
}
}
?>This will fail badly, obviously, but let’s run it, just to be sure!
PHPUnit 3.5.15 by Sebastian Bergmann.
PHP Fatal error: Class 'TimeValidator' not found in /home/athos/projects/TimeValidator/TimeValidatorTest.php on line 7
PHP Stack trace:
PHP 1. {main}() /usr/bin/phpunit:0
PHP 2. PHPUnit_TextUI_Command::main() /usr/bin/phpunit:49
PHP 3. PHPUnit_TextUI_Command->run() /usr/share/php/PHPUnit/TextUI/Command.php:129
PHP 4. PHPUnit_TextUI_TestRunner->doRun() /usr/share/php/PHPUnit/TextUI/Command.php:188
PHP 5. PHPUnit_Framework_TestSuite->run() /usr/share/php/PHPUnit/TextUI/TestRunner.php:305
PHP 6. PHPUnit_Framework_TestSuite->runTest() /usr/share/php/PHPUnit/Framework/TestSuite.php:733
PHP 7. PHPUnit_Framework_TestCase->run() /usr/share/php/PHPUnit/Framework/TestSuite.php:757
PHP 8. PHPUnit_Framework_TestResult->run() /usr/share/php/PHPUnit/Framework/TestCase.php:576
PHP 9. PHPUnit_Framework_TestCase->runBare() /usr/share/php/PHPUnit/Framework/TestResult.php:666
PHP 10. PHPUnit_Framework_TestCase->runTest() /usr/share/php/PHPUnit/Framework/TestCase.php:628
PHP 11. ReflectionMethod->invokeArgs() /usr/share/php/PHPUnit/Framework/TestCase.php:738
PHP 12. TimeValidatorTest->testTimeValidatorCanBeCreated() /home/athos/projects/TimeValidator/TimeValidatorTest.php:0Yeah, we are in Red! Now we can write a tiny bit of production code in the hope we will get into the Green phase:
<?php
class TimeValidator
{
}
?>And now the test passes, we’re in Green.
Is there anything to refactor here? We have 0 ELoC, so let’s move on. I’m going to ignore Tell, don’t ask! for this kata, so I’m going to call the one and only public method of my TimeValidator class isValid(). Maybe later I’ll come up with a cleaner name for both the class and the method, but it will do for now. But before we could write any more lines of production code, we need a failing testcase first. So let’s do some actual work in the testcase:
public function testTimeValidatorCanBeCreated()
{
$validator = new TimeValidator();
$this->assertTrue($validator->isValid('10:42'));
}There’s no such method, so this will fail. Let’s implement the method with an empty body in TimeValidator.php:
public function isValid($time_specification)
{
}Again, the test fails because NULL does not equal to true. We’re still in Red, so we can write some more production code, because the lines we have written so far are not enough to get us into Green. What is the simplest algorithm that will make our failing testcase pass if implemented in the production code? Yes, it’s a big facepalm:
public function isValid($time_specification)
{
return true;
}And now we’re in Green. Let’s do some refactoring!
The name of our testcase lies: it says it checks the construction of our newly defined class, but actually the return value of a function is being checked. So let’s rename the testcase as a refactoring:
public function test24HourFormatIsAccepted()
{
$validator = new TimeValidator();
$this->assertTrue($validator->isValid('10:42'));
}I decided to consider 10:42 to be in 24 hour format, because that is the simplier one.
Is there any more opportunity to refactor? Though our validator is very useless at the moment, we don’t have any tests to enforce us to implement some real functionality in that method, so we’re not allowed to change that code to something more complex. We were required to write just enough production code to pass the failing test, and we’ve done exactly that and not a single line more.
Now what about creating a failing testcase to let us move forward?
A negative test
After a few seconds of thinking, I decided to test if our method does care about colons. Any kinds of time specifications mentioned in the requirements must contain a colon character, so it gives us a fairly simple testcase:
public function testTimeSpecificationMustContainOneColon()
{
$validator = new TimeValidator();
$this->assertFalse($validator->isValid('1042'));
}Sadly Hooray, a failure! We’re back in Red, let’s write some more production code! Sadly a fairly simple change to our isValid() can make this test pass:
public function isValid($time_specification)
{
return $time_specification != '1042';
}Back in Green, but far from any meaningful code. At least we can refactor, luckily we have some code duplications:
- There’s a repeating pattern in our tests: instantiation of the class under test (CUT).
- The string constant ’1042′ appears in both the test and the code!
In the Refactoring phase, we’re allowed to remove code duplication while keeping the tests pass and the functionality of the code unchanged. Let’s deal with the first problem, PHPUnit offers the setUp() method for that:
class TimeValidatorTest extends PHPUnit_Framework_TestCase
{
public function setUp()
{
$this->validator = new TimeValidator();
}
public function test24HourFormatIsAccepted()
{
$this->assertTrue($this->validator->isValid('10:42'));
}
public function testTimeSpecificationMustContainOneColon()
{
$this->assertFalse($this->validator->isValid('1042'));
}
private $validator;
}That was piece of cake, now let’s clean the production code! The simplest PHP line that came to my mind to make both testcases pass without the dirty hack of repeating constants in the production code used by the tests to check the behavior of the code, was an strpos() call. But knowing that in the end I want to see a regular expression doing the work, I used an equivalent regexp matching instead:
class TimeValidator
{
public function isValid($time_specification)
{
return preg_match('/:/', $time_specification) == 1;
}
}The tests still pass, so it’s time to look for more difficult cases. (And speed things up a little bit.)
Only one colon
The next test that came to my mind was to see if the regexp can deal with two colons:
public function testTimeSpecificationMustContainOneColon()
{
$this->assertFalse($this->validator->isValid('1042'));
$this->assertFalse($this->validator->isValid('10::42'));
}Red phase again, but it’s not very difficult to get into Green:
public function isValid($time_specification)
{
return preg_match('/^[^:]*:[^:]*$/', $time_specification) == 1;
}Now we look for exactly one colon from the beginning of the string to the very end. And we’re in Green! Anything to refactor here? Yes, duplicated non-trivial code appears in the regular expression! Let’s get rid of it:
public function isValid($time_specification)
{
$not_colon = '[^:]*';
return preg_match(
"/^$not_colon:$not_colon\$/",
$time_specification
) == 1;
}The tests still pass. Now there is still repeated code in the regular expression and to make things worse, it got longer, but this will move us forward, so knowing what to come, it’s a good idea to extract that little piece. Besides, the variable explains pretty well what the expression does. Anything else to refactor? Maybe a dataProvider would help eliminating the two almost identical lines from the tests, but I’m lazy now. I promise the next time I’m gonna write a similar assertion, I’ll do clean up the tests.
Hours must be numeric
The next simpliest test that came to my mind was:
public function testHoursMustBeNumeric()
{
$this->assertFalse($this->validator->isValid('ab:42'));
}Now we’re in Red, but it seems the next Refactoring phase will be a nice opportunity to get back to that promise.
How to fix that? I know I want to validate hours, so I’m going to create a new variable to be substituted into the regular expression:
public function isValid($time_specification)
{
$hours = '[0-9]*';
$not_colon = '[^:]*';
return preg_match(
"/^$hours:$not_colon\$/",
$time_specification
) == 1;
}The new variable is called hours and it matches any sequence of numbers. Yeah, Green again! That was just a matter of seconds. So anything to refactor here? Yes, remember what I promised a few minutes ago:
/**
* @dataProvider provideInvalidTimeSpecifications
*/
public function testInvalidTimeSpecificationsAreNotAccepted($time_specification)
{
$this->assertFalse($this->validator->isValid($time_specification));
}
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Hours must be numeric' => array('ab:42'),
);
}Hours must be less than 24
Regardless of 12 hour time format, the hours part must be less than 24. This assertion is simple enough to create a new testcase from it:
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
);
}And the code that passes all the tests so far is not a big deal either:
public function isValid($time_specification)
{
$hours = '([0-1]?[0-9]|2[0-3])';
$not_colon = '[^:]*';
return preg_match(
"/^$hours:$not_colon\$/",
$time_specification
) == 1;
}Anything to refactor? Nope I guess.
Minutes must be numeric
One more simple testcase, just to see how well minutes are validated:
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
);
}Getting from Red to Green is children’s play as well, I did the same as in the case of the hours. Minutes are validated to be numeric, then a new testcase to make sure they are less than 60:
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
'Minutes must be less than 60' => array('10:60'),
);
}And the code:
public function isValid($time_specification)
{
$hours = '([0-1]?[0-9]|2[0-3])';
$minutes = '([0-5]?[0-9])';
return preg_match(
"/^$hours:$minutes\$/",
$time_specification
) == 1;
}Anything to refactor? Not a single bit! Notice how nicely that ugly $not_colon variable disappeard? The regular expression is quite self-explanatory, it doesn’t even need a comment to be understandable.
Two digits
Both hours and minutes should be expected to be two digits:
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
'Minutes must be less than 60' => array('10:60'),
'Hours must be two-digit' => array('1:42'),
'Minutes must be two-digit' => array('10:4'),
);
}Okay, those are two tests at the same time, but TDD allows you to go in your own tempo as long as you can clearly see the small, trivial steps you join together to take a bigger leap. In this case, the two tests are nearly identical, and the fix for them is similar as well: we only have to remove two question marks (why the hell did I write them in the first place?):
public function isValid($time_specification)
{
$hours = '([0-1][0-9]|2[0-3])';
$minutes = '([0-5][0-9])';
return preg_match(
"/^$hours:$minutes\$/",
$time_specification
) == 1;
}There’s no doubt that the trivial steps joined together here are visible for everyone, so let’s move on. What to refactor now? I can’t see anything. ![]()
The good news is that we are now validating 24 hour format time specifications and our code is still readable, even the reqular expression.
The half of 24 is 12
But the work to validate it is going to be at least the double. Let’s start with some positive tests:
public function test12HourFormatIsAccepted()
{
$this->assertTrue($this->validator->isValid('10:42 AM'));
}That’s very similar to test24HourFormatIsAccepted(), but we’re not in Red, so we have to write even more code to get into Green!
What is the simplest regular expression that can make our new testcase pass? Yes, it’s only two characters, but a nice facepalm:
public function isValid($time_specification)
{
$hours = '([0-1][0-9]|2[0-3])';
$minutes = '([0-5][0-9])';
return preg_match(
"/^$hours:$minutes.*\$/",
$time_specification
) == 1;
}Now we accept any bullshit following a 24 hour format time specification. But we can refactor now:
/**
* @dataProvider provideValidTimeSpecifications
*/
public function testValidTimeSpecificationsAreAccepted($time_specification)
{
$this->assertTrue($this->validator->isValid($time_specification));
}
public function provideValidTimeSpecifications()
{
return array(
'Simple 24 hour time specification' => array('10:42'),
'Simple 12 hour time specification' => array('10:42 AM'),
);
}Now we’re green, but we’ve just invalidated our assertion about the exactly one colon.
Exactly one colon is allowed
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Time specification must not contain colon after minutes' => array('10:42:'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
'Minutes must be less than 60' => array('10:60'),
'Hours must be two-digit' => array('1:42'),
'Minutes must be two-digit' => array('10:4'),
);
}The not so beautiful solution can be something like this:
public function isValid($time_specification)
{
$hours = '([0-1][0-9]|2[0-3])';
$minutes = '([0-5][0-9])';
return preg_match(
"/^$hours:{$minutes}[^:]*\$/",
$time_specification
) == 1;
}PHP thought I wanted to refer to $minutes as an array, so I put curly braces around the variable.
Now we’re in Green again, so we can refactor the code. First the tests:
public function provideValidTimeSpecifications()
{
return array(
'24 hour time specification' => array('10:42'),
'12 hour time specification in the morning' => array('10:42 AM'),
'12 hour time specification in the afternoon' => array('10:42 PM'),
);
}Okay, I’m cheating. I added another testcase that I know will be helpful later on, but it still passes with our current production code, so it won’t hurt. Apropo, the production code: a minor refactoring might be to reintroduce the variable we’ve deleted a few minutes ago.
public function isValid($time_specification)
{
$hours = '([0-1][0-9]|2[0-3])';
$minutes = '([0-5][0-9])';
$not_colon = '[^:]*';
return preg_match(
"/^$hours:$minutes$not_colon\$/",
$time_specification
) == 1;
}At least it’s readable. Let’s move on, that $not_colon will not haunt us too long.
Bad idea
I want to get rid of $not_colon as soon as possible, so let’s force the evolution of the regexp into a direction that does not involve $not_colon existing any longer. I want a space after the minutes part of the time specification:
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Time specification must not contain colon after minutes' => array('10:42:'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
'Minutes must be less than 60' => array('10:60'),
'Hours must be two-digit' => array('1:42'),
'Minutes must be two-digit' => array('10:4'),
'12 hour time specification must contain a space after minutes' => array('10:42AM'),
);
}The first idea that came to my mind to make that pass was to insert a space before $not_colon in the regexp:
"/^$hours:$minutes $not_colon\$/"That makes the 24 hour format tests fail, so I deleted both the space and the new testcase. Though the new testcase was a teeny-tiny little step, it’d have required bigger changes in the regular expression, namely to separate the two kinds of hour format. Instead I introduced a more strict testcase in provideInvalidTimeSpecifications() that is comparable to the size of the changes to make it pass while not breaking all the other tests:
'12 hour format must end with AM or PM' => array('10:42 XY'),The production code to pass that test:
public function isValid($time_specification)
{
$hours = '([0-1][0-9]|2[0-3])';
$minutes = '([0-5][0-9])';
$ampm = '( [AP]M)';
return preg_match(
"/^$hours:$minutes$ampm?\$/",
$time_specification
) == 1;
}Reviewing that now I can see I could have made the original testcase pass by applying the same changes I did for this case, but the test I deleted was not suggesting it. It seems to me that TDD is not about coding skills but about testing skills. Writing good code is one thing, but writing good tests that enforce building up good code step-by-step can be at least as much important.
Differentiating 12 hour and 24 hour formats
Time for a bigger change. A new testcase for the negative tests:
'24 hour format must be exactly 5 characters long' => array('14:42 PM'),To make that pass, the regular expression must recognize the difference between the two kinds of time formats. It requires matching numbers less than 13, but most of the code needed already exists.
public function isValid($time_specification)
{
$minutes = '([0-5][0-9])';
$hours_less_than_24 = '([0-1][0-9]|2[0-3])';
$time_spec_24 = "$hours_less_than_24:$minutes";
$hours_less_than13 = '(0[0-9]|1[012])';
$ampm = '( [AP]M)';
$time_spec_12 = "$hours_less_than13:$minutes$ampm";
return preg_match(
"/^($time_spec_24|$time_spec_12)\$/",
$time_specification
) == 1;
}It seems to be a big change, but actually it’s quite simple: the question mark has been substituted with the two alternatives it specified: one is hour specification without $ampm, the other is the same, but with $ampm. Of course the second alternative should not match hours greater than 12, so it required a second type of hour matching regular expression. When everything worked, some extractions and variable renames were done.
Attempt to write more failing tests
Are we ready? The code in isValid() seems to contain all the information appearing in the requirements, so there must be only a few small steps remaining. Let’s try adding some new tests. These should not be considered valid:
'Hour part of 12 hour format must be less than 13' => array('13:00 PM'),
'12 hour format must be exactly 8 characters long' => array('12:45 AM'),They pass without touching the code, we couldn’t manage to get into Red. This means there’s some room for further refactoring. All the dirty details of building our regular expressions can go to a private method buried at the bottom of the class:
class TimeValidator
{
public function isValid($time_specification)
{
return preg_match($this->buildRegExp(), $time_specification) == 1;
}
private function buildRegExp()
{
$minutes = '([0-5][0-9])';
$hours_less_than_24 = '([0-1][0-9]|2[0-3])';
$time_spec_24 = "$hours_less_than_24:$minutes";
$hours_less_than13 = '(0[0-9]|1[012])';
$ampm = '( [AP]M)';
$time_spec_12 = "$hours_less_than13:$minutes$ampm";
return "/^($time_spec_24|$time_spec_12)\$/";
}
}Maybe isValid() is easier to read like this:
public function isValid($time_specification)
{
return 1 == preg_match($this->buildRegExp(), $time_specification);
}So is there anything to test? Well, I’m sure PHP has a gotcha for us:
'Multiline strings are not accepted' => array("10:42\n"),Bang, it fails! The fix is very simple:
return "/^($time_spec_24|$time_spec_12)\$/D";Sometimes it’s worth reading around PCRE modifiers in PHP documentation, just to make time pass faster. ![]()
Leading zeros
One more thing I miss: though the code cares about leading zeros, there’s no testcase mentioning them. So these should be valid:
'12 hour time specification with leading zero' => array('00:42 PM'),
'24 hour time specification with leading zero' => array('00:42'),Doesn’t that 00:42 PM look weird? I’m not used to 12 hour time format, so I had to think about it for a few seconds: in 12 hour notation there’s no such thing as 00:00. Hours can be between 01 and 12, which means that our two new passing tests should look like:
'12 hour time specification with leading zero' => array('01:42 PM'),
'24 hour time specification with leading zero' => array('00:42'),(Yes, they pass as expected. Note that I forgot this speciality of 12 hour notation when writing the requirements!)
And the bug can be easily triggered by this negative test:
'12 hour time specification does not allow hours to be zero' => array('00:42 PM'And the code fixing it:
$hours_less_than13 = '(0[1-9]|1[012])';The whole code
After adding some corner cases to the list of positive tests, the final results are:
TimeValidator.php
class TimeValidator
{
public function isValid($time_specification)
{
return 1 == preg_match($this->buildRegExp(), $time_specification);
}
private function buildRegExp()
{
$minutes = '([0-5][0-9])';
$hours_less_than_24 = '([0-1][0-9]|2[0-3])';
$time_spec_24 = "$hours_less_than_24:$minutes";
$hours_less_than13 = '(0[1-9]|1[012])';
$ampm = '( [AP]M)';
$time_spec_12 = "$hours_less_than13:$minutes$ampm";
return "/^($time_spec_24|$time_spec_12)\$/D";
}
}TimeValidatorTest.php
class TimeValidatorTest extends PHPUnit_Framework_TestCase
{
public function setUp()
{
$this->validator = new TimeValidator();
}
/**
* @dataProvider provideValidTimeSpecifications
*/
public function testValidTimeSpecificationsAreAccepted($time_specification)
{
$this->assertTrue($this->validator->isValid($time_specification));
}
public function provideValidTimeSpecifications()
{
return array(
'24 hour time specification' => array('10:42'),
'12 hour time specification in the morning' => array('10:42 AM'),
'12 hour time specification in the afternoon' => array('10:42 PM'),
'12 hour time specification with leading zero' => array('01:42 PM'),
'24 hour time specification with leading zero' => array('00:42'),
'First minute of a 24 hour specification' => array('00:00'),
'First minute of a 12 hour specification' => array('01:00 AM'),
'Last 12 hour time specification in the morning' => array('12:59 AM'),
'Last 12 hour time specification in the afternoon' => array('12:59 PM'),
'Last 24 hour time specification' => array('23:59'),
);
}
/**
* @dataProvider provideInvalidTimeSpecifications
*/
public function testInvalidTimeSpecificationsAreNotAccepted($time_specification)
{
$this->assertFalse($this->validator->isValid($time_specification));
}
public function provideInvalidTimeSpecifications()
{
return array(
'Time specification must contain at least one colon' => array('1042'),
'Time specification must contain at most one colon' => array('10::42'),
'Time specification must not contain colon after minutes' => array('10:42:'),
'Hours must be numeric' => array('ab:42'),
'Hours must be less than 24' => array('24:42'),
'Minutes must be numeric' => array('10:ab'),
'Minutes must be less than 60' => array('10:60'),
'Hours must be two-digit' => array('1:42'),
'Minutes must be two-digit' => array('10:4'),
'12 hour format must end with AM or PM' => array('10:42 XY'),
'24 hour format must be exactly 5 characters long' => array('14:42 PM'),
'Hour part of 12 hour format must be less than 13' => array('13:00 PM'),
'12 hour format must be exactly 8 characters long' => array('12:45 AM'),
'Multiline strings are not accepted' => array("10:42\n"),
'12 hour time specification does not allow hours to be zero' => array('00:42 PM'),
);
}
private $validator;
}You can check them out at GitHub.
Conclusion
The whole code copied above took about 45-60 minutes to write (including taking notes for this blog post), during which I’ve learnt a lot of things about practising TDD. I’m sure that if I tried to reproduce this experiment from scratch, it would take several steps less while still moving with teeny-tiny steps forward.
Just as any part of code, regular expressions also worth writing a bunch of unit tests that make sure that the code can be changed anytime later, either to fix bugs or to extend functionality. The big advantage of unit tests (besides architectural flexibility, documentation, etc.) is not that they can expose any possible bug that can be imagined but they allow the code to be changed without worrying much about breaking existing, well-tested functionality.
But there are some gotchas in the field as well:
- There’s no tool to measure coverage for regular expressions that I’m aware of at the moment. The most useful hit I found with Google doesn’t help much either.
- It’s very easy to overlook bugs and miss testcases during test-driving regular expressions. (When the regexp engine that comes bundled with your language has some gotchas to offer, well, that doesn’t make the situation much better.)
- One character change in the regular expression may require a bunch of new testcases to be written.

Some thoughts on getter-setter methods
It was an interesting post to read, but I don’t like the conclusion. Basically the post is about how tons of getter-setter methods affect the overall performance of a PHP application, and how it can be improved by using public properties and some magic methods in specific cases instead.
First to mention: one can gain much more performance boost with less effort
by using various means of caching (e.g. compiled templates, caching HTML output, caching PHP bytecode, or simply storing some values instead of recalculating them over and over again during processing a request) or by optimizing SQL queries, changing locking strategy and synchronization or maybe doing some refactoring on the heaviest parts of the architecture (moving things that go together closer to each other, shortening call chains while maintaining the overall logical concept of the code, etc.)
Actually I’d consider the need for a huge amount of boilerplate getter-setter methods a code smell that needs to be refactored. It can be acceptable for implementations of the active record pattern or things like that (actually I’m not a fan of getter-setters in such cases either) but generally speaking getter-setters can be a sign of bad OO design, and in that case, complicating the code with magic methods is a mean of working around a bad design while making the code harder to maintain.
Keeping encapsulation, open/closed principle, KISS and such guidelines in mind, I think an object (of a class) can be viewed as an entity that has some inner state, has some operations to manipulate that state, and has some functions that produce information according to that state. Operations or commands represent actions that can be done during the life of the object and that have effects on the future of the system. (That’s why a big majority of good method names contain a verb.) Nor queries nor commands in the public interface of an object are expected to be trivial. As most OOP coders have agreed to encapsulate the inner state of their objects, exposing parts of that inner state via trivial getter-setters just doesn’t make sense. (Forget about active records for a moment, basically they are intentionally designed to expose parts of their inner state.)
Why would someone want to call a method named setSomething()? Is it to setup some resource or dependency of the object? I guess a name like useSomething() is a better choice then, and if that’s the case, that method will likely check if the dependency satisfies expectations of the object regarding the dependency in question (most likely by validating interface implementations or inheritance of a specific base class, e.g. by using type hinting in the method’s signature). Is it to change a public property (like a caption)? In that case both changeSomething() and setSomething() can be a rational name of the method, but it still should ensure that no unexpected things can be injected into the state of the object, simply because nasty values can cause hellish bugs later on. In other terms, I cannot seem to find reasonable examples of a bare naive setter being explainable.
Getters are similar to that. In some cases no getter is needed at all. E.g. in the former case above, dependencies tend to be injected into the object, not queried. (I mean once I’ve given an SQL adapter object for example to some table row reader, I’m not likely to be interested later in knowing what adapter the reader is using, what I want it to return is the rows found in that table.) In other situations it might be reasonable to expose a property directly to the world outside a specific object class, but maybe only for reading. E.g. I can have a protected setter and a public getter, or to be more restrictive, when it makes sense, a private way of setting, but a protected getter method.
To put it in one sentence: in a well designed OOP architecture (which has huge benefits in maintaining and extending the existing code) that 5-95% rate mentioned in the blog post referred is very far from being true and the true reasons of a possibly bad performance cannot be solved with such magic methods.

PHP autoload performance
Ever wondered how autoloading classes affects performance of your PHP application? There are several discussions and benchmarks out there, but I had some questions that a quick google-ing did not answer, so I did some research on the topic.
I examined three class loading strategies:
- No autoload at all, every class is require_once-d manually (require_once).
- Guessing the path of the defining PHP source from the class name. Similar to PSR-0 but without namespaces (PSR-0).
- Finding path of the defining PHP source in an associative array (class map).
Each strategy has been tested with two values of two variants:
- File paths:
- absolute,
- relative.
- Include path:
- default,
- extended with dummy paths.
To see how much overhead the benchmarking tests did, I’d added a dummy loading strategy that simply did nothing, thus didn’t load anything (skip).
I’ve generated some 5000 class definitions (with some 50 lines of dummy PHP code each) with unique names and paths. Every test loaded all of them using one of the loading strategies combined with each variants. To trigger autoload, class_exists() function call was used.
Tests were run 10 times each, here are the average runtimes:
| Relative paths | Relative paths with extended include_path | Absolute paths | Absolute paths with extended include_path | |
|---|---|---|---|---|
| skip | 0.04 | 0.042 | 0.04 | 0.42 |
| PSR-0 | 2.643 | 3.676 | 2.595 | 2.608 |
| class map | 2.575 | 3.623 | 2.558 | 2.616 |
| require_once | 2.409 | 3.411 | 2.36 | 2.404 |
| require | 2.266 | 3.284 | 2.267 | 2.261 |
Conclusion:
- Autoloading does not significantly degrade performance. Include_path lookup, reading and parsing PHP scripts from disk takes much longer time than that bare autoloading logic costs.
- Although it can’t be seen in the above table, autoloading can improve performance in real-world situations. Consider what would happen if not all 5000 classes were triggered to be autoloaded. The require_once-ing tests would still load all of them, while autoload strategies would load only those that are really needed.
- When loading a huge number of classes, PHP implementation of PSR-0 is slower than a hash-based strategy.
Source code of the tests are available at GitHub
Update: Adam Backstrom suggested to put require on the test, as require_once implies much more stat() calls under the hood (exactly twice as many as require, check it out with strace).

Betrayed by PHP :-(
Nucc shared a blogpost on Twitter yesterday titled PHP vs Python array memory allocation. It states that storing ten thousands of arrays of strings in an array costs only 1.39 MB of memory usage for Python and 19 MBs for PHP. I couldn’t believe PHP can be that stupid, so I tried it for myself.
I ran the test scripts and saw that the results on a 64 bit machine with Python 2.6.6 and PHP 5.3.3 are not much different: 3.5 MB for Python, 22 MB for PHP. Then I realized what’s wrong with the Python test script: it does not emulate a real world situation. You know, Python stores references for strings, which means basically all the test strings were represented as pointers in the memory. I guess the author of the original blog post used a 32 bit environment, that’s why my test showed different results for Python.
In real world you don’t want to store thousands of references for the same string, so I decided to modify the tests to be more reallistic. My tests put a string of ~50 characters suffixed with a unique number into an array. I also created tests that put the same string into the array every time, just for comparison. The results are still making Python the winner:
As you can see it does not make a difference in PHP if you store the same or different strings, but in the worst scenario, when using different strings, PHP uses nearly twice as much memory as Python. The difference might be caused by the internal character encoding PHP uses. Just for the record, Python can be made to eat tons of memory by prefixing the lorem-ipsum string with a “u” in the source (to use Unicode strings).
Okay, to be honest, at this point I still like PHP, it has some nice OOP features that I miss from Python. But!
As an additional test, I wanted to see the results using an SplFixedArray, as it might perform better (or worse), so I created a test script and… wait for it… it crashed PHP with a segmentation fault! It was the default PHP package provided by Ubuntu 10.10, and I could not reproduce it with the latest snapshot from php.net compiled manually, but it still made me feel sad and betrayed.
P.S: regardless of the language I don’t like huge amount of data stored in memory (the original motivation for the tests was some Excel importing problem). For almost every tasks some smart iterators will do the job way better.
Update: Algernon implemented the tests for Perl and it turned out Perl beats the hell out of both Python and PHP. Look at the results of the dynamic test:

Quines
When I saw this 11 level quine, I was like OMG, how on earth can someone do this? I knew that programs that reproduce their own source code, can be written in every Turing-complete language, but man, if I had that amount of free time! ![]()
For the first time, writing quines seems to be impossible (of course, it would be trivial if you could cheat, e.g. read the source from a file). For example, look at a simple PHP code:
<?php
echo "";
?>If you want to write a PHP script that outputs the above code, you may start something like this:
<?php
echo "<?php\n\necho \"\";\n\n?>\n";
?>And end up with an infinite code:
<?php
echo "<?php\n\necho \"<?php\n\necho \\\"<?php\\n\\n...
?>The trick is simple: use a string variable that contains all the code except for the string itself. First, output the first part of this string, then output the string escaped, and then output the rest. Additionally, this way any program can be extended with quine functionality.
The following PHP-CLI code outputs some Fibonacci-numbers (note: the dwim() function could be anything), but when invoked with the first command line argument being “quine”, it outputs it’s own source:
<?php
function qescape($str)
{
return str_replace("\n", "\\n", preg_replace('/([$\\\\"])/', '\\\\$1', $str));
}
$s = "<?php\n\nfunction qescape(\$str)\n{\n return str_replace(\"\\n\", \"\\\\n\", preg_replace('/([\$\\\\\\\\\"])/', '\\\\\\\\\$1', \$str));\n}\n\n\$s = \"\";\n\nfunction dwim()\n{\n // payload\n \$a = 0;\n \$b = 1;\n for (\$i = 0; \$i != 20; ++\$i)\n {\n echo \"\$a \";\n \$b = \$a + \$b;\n \$a = \$b - \$a;\n }\n echo \"\\n\";\n}\n\nif (isset(\$argv[1]) && \$argv[1] == 'quine')\n echo substr(\$s, 0, 124) . qescape(\$s) . substr(\$s, 124);\nelse\n dwim();\n\n?>\n";
function dwim()
{
// payload
$a = 0;
$b = 1;
for ($i = 0; $i != 20; ++$i)
{
echo "$a ";
$b = $a + $b;
$a = $b - $a;
}
echo "\n";
}
if (isset($argv[1]) && $argv[1] == 'quine')
echo substr($s, 0, 124) . qescape($s) . substr($s, 124);
else
dwim();
?>
The same idea can be used to create iterating quine: a program that outputs another program that produces the source of the original program again (or if you have more free time, this sequence can be any longer). I tried this in PHP and Python: I developed my PHP quine to output it’s source Python-quoted, and add a print and quotes around it.
<?php
function php_escape($str)
{
return str_replace("\n", "\\n", preg_replace('/([$\\\\"])/', '\\\\$1', $str));
}
function python_escape($str)
{
return str_replace("\n", "\\n", preg_replace('/([\\\\"])/', '\\\\$1', $str));
}
$s = "print \"<?php\n\nfunction php_escape(\$str)\n{\n return str_replace(\"\\n\", \"\\\\n\", preg_replace('/([\$\\\\\\\\\"])/', '\\\\\\\\\$1', \$str));\n}\n\nfunction python_escape(\$str)\n{\n return str_replace(\"\\n\", \"\\\\n\", preg_replace('/([\\\\\\\\\"])/', '\\\\\\\\\$1', \$str));\n}\n\n\$s = \"\";\n\necho substr(\$s, 0, 7) . python_escape(substr(\$s, 7, 243) . php_escape(\$s) . substr(\$s, 250, 125)) . substr(\$s, 375);\n\n?>\n\"\n";
echo substr($s, 0, 7) . python_escape(substr($s, 7, 243) . php_escape($s) . substr($s, 250, 125)) . substr($s, 375);
?>



Twitter
LinkedIn