Creating 50,000 unique values

A client at work wants to run a promotion where a customer will receive a card with a unique 8 digit code on it when they buy something. They will then be able to visit the website to find out if they are a winner or to get a chance to enter a prize draw. I have to put all of the code together to manage this and thought I would run a few experiments to try out a few ideas, the first of which was to generate the 50,000 unique codes needed for the competition. I didn’t think this would be a particularly difficult task (and in reality it wasn’t) but I hit a number of problems while doing this that revealed some interesting things about the variations of running PHP on different operating systems and issues with handling large datasets.

Development environment

My development environment is Windows 7 with Apache (compiled using VC9), PHP 5.3.2 and MySQL 5.1.47. For coding I use NuSphere’s PHPED. All of my timings were made using the built in profiler that comes with PHPED and the DBG PHP debugging extension. Why is this information important? Hopefully that will become clear later.

My first attempt

My first attempt used a while loop to iterate 50,000 times to create the values. On each iteration a code was generated and a prepared statement executed to insert the code into the database. The column holding the code has a unique index on it, causing an exception to be thrown if PHP generates a duplicate 8 digit code. This is then caught in the inner catch block, which causes the loop to go through another iteration. If the code is inserted successfully the counter is incremented and the loop carries on. Code for this is below:


<?php
 //Up the execution time to 15 minutes as this takes ages to run
 ini_set('max_execution_time', 900);
 try {
 //Create a PDO connection to the database
 $db = new PDO('mysql:dbname=test;host=localhost', 'root', 'PASSWORD');
 //Set the PDO error mode to exceptions
 $db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
 //Delete all entries from the test table
 $db->exec('TRUNCATE test');
 //Prepare the statement
 $st = $db->prepare('INSERT INTO test(code) VALUES(?)');
 //Bind the parameter to the $code variable.
 //When the statement is executed whatever value is in $code will be used as the value for the parameter
 $st->bindParam(1, $code);
 //Set the counter for the main loop
 $i = 0;
 //Possible characters for the code string
 $possible = "0123456789bcdfghjkmnpqrstvwxyz";
 //Length of the possible characters (0 based)
 $len = strlen($possible) - 1;
 //Loop 50,000 times!
 while ($i < 50000) {
 try {
 //Reset the $code variable
 $code = "";
 //Create a counter for the inner loop
 $j = 0;
 //add random characters to $code until it is 8 characters long
 do {
 //Get a random character
 $char = $possible[mt_rand(0, $len)];
 //we don't want this character if it's already in the code
 if (! strpos($code, $char)) {
 $code .= $char;
 $j++;
 }
 } while ($j < 8);
 //Execute the statement. The value of $code will be used as the parameter
 $st->execute();
 $i++;
 }
 catch (PDOException $e) {
 //If an exception is caught here it's probably a duplicate value in the code column. Continue to get a new value.
 continue;
 }
 }
 }
 catch (PDOException $e) {
 //Any other exceptions caught here.
 echo $e->getMessage();
 }
?>

The major problem with this was the time taken to execute it: after 15 minutes I still didn’t have 50,000 rows in the database but only something like 38,000! The code worked but was amazingly slow. After running the code through a profiler I found that 14 mins 54 seconds were spent on the line ‘$st->execute()’ to insert the row. I was quite amazed by this as I didn’t expect the database to add quite such an overhead. The other important point to mention here is that I used the mt_rand() function. Why that’s important will become clear in a moment.

At this point I had two main questions:

  1. What’s the quickest way to create 50,000 unique 8 digit codes?
  2. What’s the quickest way to insert these rows into the database?

I should add that the need to find answers to both of these questions was somewhat academic and more to satisfy my own curiosity. This code would never be run on a live server but would be used on my development box to generate a database table. This could then be dumped into a SQL file and used to create the table on a live server. As a result performance was not the number one consideration in this case. I still wanted to run this in less than 15 minutes though!

The ‘Hamlet’ solution

At this point I turned to the Open University web development forums for help. I learnt web development at the OU and their forums are open to students, tutors and ex-students like myself. Some very clever people hang out there and I knew I would get some good suggestions. Two of the tutors of the course on open source web development tools, Michelle Hoyle and Keith Evetts, which focuses on PHP as a server side scripting language engaged in a discussion with me about how to generate codes quickly. From the beginning I noticed that ideas they were posting were not working for me. They were using PHP functions like str_shuffle(), array_rand() and rand() and producing 50,00 unique values quickly. I had independently tried str_shuffle() and array_rand() too but had given up on them as they generated huge numbers of duplicate codes after a while. At this point I couldn’t understand why Keith and Michelle could produce solutions using these functions that I couldn’t get to work.

Keith Evetts came up with an idea which seemed crazy at the time but which generates random values very well and quickly. His idea was to take a large piece of text, use mcrypt() to encrypt it, discard all non alpha-numeric characters from the encrypted text and use this to generate the 8 digit codes. He used the the third act of Hamlet as the text (hence my calling this the Hamlet solution). This worked extremely well except for the fact that even the third act of Hamlet couldn’t be used to produce 50,000 codes. The solution was to loop over the code generation, encrypting the text using different initialisation vectors and keystrings, until 50,00 uniques values had been produced. I also worked out at this time that the fastest way to insert this many values quickly into a MySQL table was to use the ‘LOAD DATA INFILE’ command. I used Keith’s function to generate the codes, writing these to a text file before using ‘LOAD DATA INFILE’ to insert the records into the database. The PHP code ran in 1.9 seconds while the database insert took around 6 seconds. Here’s the PHP code:


<?php
 /* ---------  array function generate_codes ( int $number, string $plaintext , string $keystring, $length ) -----------
Keith Evetts 3 July 2010 license: LGPL 2.  This notice and author name must remain intact.
Args: number of codes to be generated,
 plaintext string from which to generate them, minimum 100 chars (optimal length is 4000 chars)
 keystring in plain text; minimum 12 chars
 length of code strings to be generated
Returns: enumerated array of  unique alphanumeric codes of length $length
Requires: PHP mcrypt lib with Rijndael 256 (v. 2.4 + of mcrypt)
-------------------------------------------------------------------------------------------------------------------------------------- */
 function generate_codes ( $number, $plaintext, $keystring, $length = 8 ) {
 switch (true) {
 case ( ! is_int($number) )                                                :
 case ( ! is_int($length) )                                                  :
 case ( ! is_string($plaintext)  || strlen($plaintext ) < 100)   :
 case ( ! is_string($keystring) || strlen($keystring) < 12 )    :
 throw new Exception (' function generate_codes called with incorrect params ');
 break;
 // default is proceed
 }
 // use the same text and randomly different keys to reach desired number of codes
 // for e.g. 50000 codes this will take several passes
 $unique_array = array();
 do {
 // get a new key
 $key =  substr ( sha1 ( str_shuffle ( $keystring )  ) , 0 , 32 );
 // get a new initialisation vector
 $iv = substr ( sha1 ( str_shuffle ( "the slings and arrows of outrageous fortune" )  ) , 0, 32 );
 $ciphertext = mcrypt_encrypt (   MCRYPT_RIJNDAEL_256,
 $key,
 $plaintext,
 MCRYPT_MODE_CBC,
 $iv
 );
 /* clean out non-alphanumeric chars at some cost to code redundancy */
 $ciphertext = preg_replace( '/[^2346789abdefhjkmnprtwxyz]/' ,  "" , strtolower($ciphertext) );
 $codearray = str_split ( $ciphertext, $length );
 // dump leftover element at end
 array_pop( $codearray );
 $size = sizeof($codearray);
 for ($i = 0; $i < $size; $i++) $unique_array[] = $codearray[$i];
 /*somewhat amazingly, it is far quicker to enlarge the array by adding elements one at a time in a for loop, than to use array_merge() ! */
 } while ( sizeof ( $unique_array ) <= ( $number  + 1 ) ) ;
 // now remove any duplicates at end of whole process
 $unique_array = array_unique ( $unique_array );
 return array_slice($unique_array, 0, $number);
 }
 try {
 $codes = generate_codes(50000, file_get_contents('plaintext_Hamlet_Act3.txt'), 'This is the keystring');
 }
 catch (Exception $e) {
 echo $e->getMessage();
 }
 $file = fopen('codes.txt', 'w');
 foreach($codes as $code) {
 fwrite($file, "$code\r\n");
 }
 fclose($file);
?>

This was clearly the winner for me on Windows but it still didn’t explain why PHP random functions couldn’t be used for me on Windows.

‘Random’ functions on Windows-a theory

At this point I was left wondering why the various ‘random’ functions on Windows had performed so badly for me. I knew that Michelle Hoyle uses a Mac and that Keith Evetts scripts executed without a problem on a web server. I began to think that the problem was PHP on Windows. To test this out I uploaded one of the scripts that generated huge numbers of duplicates for me to a live server running CentOS Linux and everything worked without any problems. So what’s going on? I have a theory and once again I need to thank Keith Evetts for pointing me on the way to this. It seems that the PHP function rand() is simply a wrapper for the operating systems native random function (see here for more details). The PHP manual states:

Note: On some platforms (such as Windows), getrandmax() is only 32768. If you require a range larger than 32768, specifying min and max will allow you to create a range larger than this, or consider using mt_rand() instead.

My theory is that functions such as str_shuffle() or array_rand() are also using the native operating system random function. It just happens to be that the function on the Linux/Unix platform is far better than the one available under Windows, which would explain why the scripts run so differently under different platforms. Normally this wouldn’t create any problems but when you’re dealing with very large datasets where randomness is a necessity it becomes a problem. This would also explain why my initial attempt had no problems generating 50,000 unique values as I was using the mt_rand() function. This is based on mathematics known as the Mersenne Twister, which generates better random numbers, and does not rely on the operating systems random number generator. I don’t know any C or the PHP source code well enough to confirm this but this is my strong hunch. Is someone able to confirm this?

Conclusion

As I said at the beginning what should have been a fairly simple exercise turned into something a little more involved and ultimately informative. I would suggest that if you need to make a large number of random values that these are the guidelines you might want to follow:

  • If you’re going to be running the script on a Linux/Unix box or you’re developing on such a system go ahead and use whichever PHP functions you like. As the underlying OS random number generation is better than on Windows you won’t run into any problems and will almost certainly get better performance this way.
  • If you’re running on Windows your options are more limited. Here I would suggest that you use either mt_rand() or the ‘Hamlet’ approach if you require more than 32,768 random values.

I know that all programs rely on the OS they’re executing on but this discrepancy seems quite big to me. Is there any way that PHP’s ‘random’ functions could be re-written to take advantage of the Mersenne Twister? With my limited knowledge of the PHP core code that would seem to me to offer the combination of good random generation and consistency across platforms. Of course, if it were that simple I’m sure someone else would already have done it.

2 thoughts on “Creating 50,000 unique values

  1. I should add that on Apache, the following method is fifty times quicker than encrypting Hamlet (you need to write an Exception class to go with it):

    function get_codes( $numcodes, $codelength , $wastage = 25 ) {
    if ( is_int($numcodes)
    && is_int($codelength)
    && $codelength >= 5
    && $codelength <= 40
    && is_int($wastage) )
    {

    // acceptable characters
    $goodchars = '2346789AaBbDdEeFfHhJjKkMmNnPpRrTtWwXxYyZz';
    // build array allowing for possible wastage through duplicate values
    do { $unique_array[] = substr ( str_shuffle( $goodchars ) , 0 , $codelength );
    } while ( sizeof ( $unique_array ) $numcodes + 1 ) {
    return array_slice ( $unique_array, 0, ($numcodes + 1) );
    } else throw new CodeGenException ( $numcodes – $size.’ fewer codes produced than requested: wastage’, 1, $numcodes, $codelength );
    } else throw new CodeGenException ( ‘function get_codes called with wrong params’, 2, $numcodes, $codelength );
    } // end func
    ___________________________
    Notes:
    – the string $goodchars comprises alphanumeric characters optimised to reduce visual confusions
    – $goodchars must be longer than max codelength. Currently 41 chars.
    – $wastage is tuned for default of 25 for a 5-character code
    – average time to execute for an array of 50000 codes of 8-character length is 0.25 seconds
    – you can try…catch CodeGenException # 1 and try the function again with a greater allowance for wastage

    1. Thanks Keith, that one is faster than ‘Hamlet’ but it didn’t work for me under Apache on Windows. If my theory is correct the problem is the use of str_shuffle(), which generated huge numbers of duplicates for me after just 4 iterations.

Leave a Reply