Improve PHP Captcha with Optical Character Recognition Tests

If you’re working on a captcha system for your PHP-based website, you may be faced with an interesting challenge. How do you make your system too hard for spam bots to read, but not too hard for humans? This is especially worrying in the wake of bots that can harness OCR for reading captchas. This article explains how to increase the difficulty of a captcha system and test it to make sure it meets your requirements.

There are many standard captcha solutions available that will work with PHP, including reCaptcha and Asirra captcha.

However, some PHP developers might need to work on their own captcha solution. They can gain overall flexibility and total control of the captcha solution. It also helps the PHP developer to learn the overall operating principle and development behind captcha (Completely Automated Public Turing test to tell Computers and Humans Apart). Gaining this level of understanding means that, if the captcha produces an undesirable result (such as being too difficult for humans), the developer can adjust it accordingly, without defeating the overall objective of separating bots from humans.

This article will focus on improving the difficulty of a PHP-based captcha system with optical character recognition test (OCR). Most spammers or anti-captcha spam bots use OCR technology to crack captchas. If your captcha system is easy, it is in fact worthless if OCR technology can crack it.

Presentation of the Problem

Consider the existing PHP captcha script  (antibot.php) below:

 

//Start PHP session

session_start();

 

//Generate random number

 

//Store generate random number to a session

 

 

//Create image 50 x 50 pixels

 

 

//Initial background and text color of the captcha image

 

 

//Write the string at the image

 

 

//Output the image

 

 

 

<form>

<!–Display the captcha image on the browser–>

<img src="antibot.php" />

 

<br />

 

Type the anti-bot code above:

 

<br /> <br />

 

<input type="text" name="captcha" size="10">

 

 

 

How easy is this captcha for a bot to defeat? I tested it with an excellent open source optical character recognition engine, Tesseract, which is also used by an online OCR tool. I took three image samples of this captcha output and then uploaded the image to the OCR. I obtained the following result:

 

This captcha can be broken perfectly by a very good OCR engine. If you use this captcha in your website, you risk being compromised by a spam bot using this OCR engine. 

{mospagebreak title=Adding background noise to captcha}

The original captcha is perfectly clean; there is no background noise, which you can see in other captcha systems. Let’s add background noise to the existing captcha script above.

Step 1. Start the session. Generate a random number and assign the number to the session variable. This session variable is used when passing the correct answer to the web form, and then to the PHP script that will validate the answer in the actual web application:

<?php

 

session_start();

 

$stringgen = mt_rand(1000, 9999);

 

$_SESSION['answer']=$stringgen;

 

Step 2. Create a 50 x 50 pixel image, and then declare the background and text color of the captcha:

$imagecreate = imagecreate(50, 50);

 

// white background and blue text

 

$background = imagecolorallocate($imagecreate, 0, 0, 255);

 

$textcolor = imagecolorallocate($imagecreate, 255, 255, 255);

Step 3. Using the background noise script discussed at the link adds a sprinkling of pixel dots on the existing, created image that constitutes background noise.

You can randomly sprinkle the dots on the image using the loop function below, and the imagesetpixel function, which will draw a single pixel on the image defined by random $x and $y coordinates:

 

for ($c = 0; $c < 40; $c++){

 

   $x = rand(0,50-1);

 

   $y = rand(0,50-1);

 

   imagesetpixel($imagecreate, $x, $y, $textcolor);

 

   }

 

The pixel coordinates should be dependent on the size of the image (which is 50 x 50 pixels). This is why the boundary is set in the random number generation (0, 49) with the maximum possible coordinate (49,49).

The number of white pixel dots in the 50 x 50 pixel image is estimated to be around 40, which is defined by the for loop: $c<40

To increase captcha difficulty, since the font color of the original captcha text is white (defined by $textcolor = imagecolorallocate($imagecreate, 255, 255, 255);) the color of the pixel dot background noise should also be set to white (defined by imagesetpixel($imagecreate, $x, $y, $textcolor);  

Step 4. Draw the captcha, now with background noise as defined in the previous steps, horizontally on the image:

$xlocation = rand(1,10);

 

$ylocation = rand(1,10);

 

imagestring($imagecreate, 5, $xlocation, $ylocation, $stringgen, $textcolor);

 

The string will also be positioned randomly in the image, as defined at the $xlocation and  $ylocation variables.

Step 5. Finally, output the generated image as PNG in the web browser:

header("Content-type: image/png");

 

$image= imagepng($imagecreate);

?>

The final generated script will be:

<?php

 

session_start();

 

$stringgen = mt_rand(1000, 9999);

 

$_SESSION['answer']=$stringgen;

 

$imagecreate = imagecreate(50, 50);

 

$background = imagecolorallocate($imagecreate, 0, 0, 255);

 

$textcolor = imagecolorallocate($imagecreate, 255, 255, 255);

 

for ($c = 0; $c < 40; $c++){

 

   $x = rand(0,50-1);

 

   $y = rand(0,50-1);

 

   imagesetpixel($imagecreate, $x, $y, $textcolor);

 

   }

 

$xlocation = rand(1,10);

 

$ylocation = rand(1,10);

 

imagestring($imagecreate, 5, $xlocation, $ylocation, $stringgen, $textcolor);

 

header("Content-type: image/png");

 

$image= imagepng($imagecreate);

 

?>

 

Below are the captchas generated using first solution: http://www.php-developer.org/firstcaptchasolution.php

And the following screen shot reveals the evaluation results using the Tesseract optical character recognition engine:

 

According to the results, the OCR still correctly answers the captcha challenges. This means that the background noise setting at $c < 40 is not effective in adding difficulty to the original challenge.

It is highly recommended that you increase the level of background noise, but continue to make the original security code text visible to humans.

{mospagebreak title=Increasing captcha difficulty}

To make the captcha five times as difficult, you will need to set the value of $c < 40  to around $c < 40*5 or $c < 200 so the FOR loop condition will now look like this:

for ($c = 0; $c < 200; $c++){

 

   $x = rand(0,50-1);

 

   $y = rand(0,50-1);

 

   imagesetpixel($imagecreate, $x, $y, $textcolor);

 

   }

To see how the captcha looks at this setting, go to this link: http://www.php-developer.org/antibot.php

Below is the evaluation result of this captcha against strong OCR:

 

Ten samples were evaluated, and there wasn’t a single accurate detection by the OCR.

This means the increasing the background noise to five times the original level increases the difficulty of the captcha level to the point that it could not be  accurately detected by the optical character recognition engine.

{mospagebreak title=The final script}

The final script is shown below:

<?php

 

session_start();

 

$stringgen = mt_rand(1000, 9999);

 

$_SESSION['answer']=$stringgen;

 

$imagecreate = imagecreate(50, 50);

 

$background = imagecolorallocate($imagecreate, 0, 0, 255);

 

$textcolor = imagecolorallocate($imagecreate, 255, 255, 255);

 

for ($c = 0; $c < 200; $c++){

 

   $x = rand(0,50-1);

 

   $y = rand(0,50-1);

 

   imagesetpixel($imagecreate, $x, $y, $textcolor);

 

   }

 

$xlocation = rand(1,10);

 

$ylocation = rand(1,10);

 

imagestring($imagecreate, 5, $xlocation, $ylocation, $stringgen, $textcolor);

 

header("Content-type: image/png");

 

$image= imagepng($imagecreate);

 

?>

Conclusions and Recommendations

There are still a lot of ways to increase the difficulty of captcha, such as distorting the text or having the characters appear connected to each other. However, this system can also substantially increase the difficulty humans will experience in solving it, and even humans can commit frequent mistakes when answering this type of challenge.

As shown in this tutorial, simply increasing the background noise can make the OCR unable to solve the challenge properly without necessarily adding too much difficulty for humans.

[gp-comments width="770" linklove="off" ]

chat