# Performing Inferential Statistical Analysis with PHP

This is the continuation of a two-part article in which we cover the use of PHP to calculate descriptive statistics. In this article, we will go deeper, using PHP to do inferential statistics. This is a completely different area of statistics in the sense that inferential statistics deals with the comparison of data at a certain level of statistical significance.

In other words, if there are at least two data sets that are being analyzed and compared with each other, we are dealing with the inferential side of statistics. What is the importance of doing this in PHP? After all, we can do this using Excel, right?

The short answer to this question is the benefit you gain from doing it in PHP. I’m referring to complete interactivity online using the most common web servers (for example Apache).

It is not simple to upload an Excel worksheet and make it completely interactive and compatible with Apache. PHP, on the other hand, is the main server side scripting language for Apache.

Of course, I emphasized this already in the first part. You can run Excel online, but doing so brings a little bit of complexity to the equation, and may cause security issues.

{mospagebreak title=Overview of Statistical Comparison}

In strict statistical literature, all samples must be gathered at random to avoid bias, which can pollute the results. For example, say you are conducting a study of whether your campaign efforts have improved your website’s traffic. A suitable comparison could be to take daily traffic from your website at random during one month (say 15 days) before the your campaign, and do the same for one month after you’ve implemented the actions you’re taking for your campaign (also 15 random days). Then you can use inferential statistical techniques to analyze your data and arrive at a conclusion.

One of the most popular of these techniques is called the Student T- test. It is a method of comparing the means of two data samples taken from a random sampling. Theoretically it is defined in T-value as:

T = (average of x – average of y) / (Square root ((Variance of x/Nx) + (Variance of y/Ny)))

It looks complex, doesn’t it? This is why we need to find simple ways to automate the comparison of data online. Without PHP, things can get pretty complicated.

A short principle is that once the T-value has been computed, it will be compared with the T-critical value that determines whether the two compared samples are the same, or different.

In the above formula, the x and y variables signify the two samples. Then Nx is the number of data available in sample x, and same idea applies with Ny.

Variance is the square of the standard deviation, based on classic statistical literature.

The basic strategy is to compute the descriptive statistics (average and variance) first. Once these data are available, the T-value can be computed, which will then be compared with a T-critical value.

The T-critical value depends on the percentage of confidence level and degrees of freedom. The percentage of confidence level measures the confidence of the results. A 95% confidence level (industry standard level) means you are allowing 5% error in your results.

Degrees of freedom is equal to: Nx + Ny -2

{mospagebreak title=Programming Strategy in PHP}

It can be difficult to implement the above concepts in the programming. This is why you’ll find very few PHP developers online who are willing to develop this kind of application.

Anyway, with the solid background provided above, we can start with a web form that accepts two data sets. It should be two text areas, to make it easy for users to enter data (using the copy-and-paste method instead of entering data one small piece at a time, which can be very stressful to use).

The script should then accept the numerical data and start computing the descriptive statistics (average, standard deviation, degrees of freedom). With these variables computed, we can compute the T value.

So how will the T-critical value be computed? The best method is to obtain t-critical value from MS Excel, save it as .csv, and then export it to the MySQL database. To follow the industry standard, we can fix those tables to use the 95% confidence level.

In Excel, the T-critical value can be computed using the Excel function called TINV (%error, degrees of freedom), so for a 95% level, % error is 5% (0.05). In an Excel spreadsheet, a table can be generated using this formula, like the one shown below:

Complete the tables as shown above (you can have more than 100 “degrees of freedom” to make sure your applications can accept high amount of samples).

If you have trouble converting your Excel worksheet to a MySQL database, you’ll find a tutorial on ASP Free at the link to help you.

Once the t-critical value is in place at the MySQL database, we can query it based on degrees of freedom and use PHP to compare it with the T value computed (above).

{mospagebreak title=The PHP Code}

Below is the complete PHP script required to implement the above strategy.

<html>

<title>Statistical T-Test Online using PHP by www.php-developer.org</title>

<body>

<?php

//Check if the form is submitted

if (!\$_POST[‘submit’])

{

//form not submitted, display form

?>

<form action="<?php echo \$SERVER[‘PHP_SELF’]; ?>"

method="post">

Compare data using Statistical T-test below at 95% confidence level.<br />

Copy and paste numerical data for analysis in each of the text area below (one number per line):<br /><br />

DATA1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DATA2<p><textarea name="firstsample" rows="35" cols="5"></textarea>&nbsp;&nbsp;<textarea name="secondsample" rows="35" cols="5"></textarea></p>

<br />

<input type="submit" name="submit" value="Compare if the two data sets are statistically different">

</form>

<?php

}

else

{

//form submitted, grab the data from POST

//Connect to MySQL database after sanitizing the data

\$hostname = "localhost";

\$database = "ttest";

or die("Unable to connect to MySQL");

//select a database to work with

\$selected = mysql_select_db(\$database,\$dbhandle)

or die("Could not select \$database");

\$firstsample =trim(\$_POST[‘firstsample’]);

\$secondsample =trim(\$_POST[‘secondsample’]);

//Escape variables for use in MySQL

//test if it contains some data.

if ((!isset(\$firstsample) || trim(\$firstsample) == "") || (!isset(\$secondsample) || trim(\$secondsample) == ""))

{

//feedback to user that it contains no data

die (‘ERROR: Incomplete Form. <a href="/ttest.php">Click here to proceed with the analysis</a>’);

}

else

{

//explode data and assign it to an array

\$data1 = explode("n", \$firstsample);

\$data2 = explode("n", \$secondsample);

////////////////////////////////

//ANALYSIS FOR 1ST DATA SETS///

///////////////////////////////

//function to compute statistical mean

function average1(\$data1) {

return array_sum(\$data1)/count(\$data1);

}

\$dataaverage1 = average1(\$data1);

//function to compute standard deviation

function stdev1(\$data1){

\$average1 = average1(\$data1);

foreach (\$data1 as \$value1) {

\$variance1[] = pow(\$value1-\$average1,2);

}

\$standarddeviation1 = sqrt((array_sum(\$variance1))/((count(\$data1))-1));

return \$standarddeviation1;

}

\$datastandarddeviation1 = stdev1(\$data1);

//variance of data set 1

\$datavariance1 = \$datastandarddeviation1 * \$datastandarddeviation1;

//number of data for 1st data set

\$count1 =count(\$data1);

//variance over number of data

\$sterror1 = \$datavariance1/\$count1;

////////////////////////////////

//ANALYSIS FOR 2nd DATA SETS///

///////////////////////////////

//function to compute statistical mean

function average2(\$data2) {

return array_sum(\$data2)/count(\$data2);

}

\$dataaverage2 = average2(\$data2);

//function to compute standard deviation

function stdev2(\$data2){

\$average2 = average2(\$data2);

foreach (\$data2 as \$value2) {

\$variance2[] = pow(\$value2-\$average2,2);

}

\$standarddeviation2 = sqrt((array_sum(\$variance2))/((count(\$data2))-1));

return \$standarddeviation2;

}

\$datastandarddeviation2 = stdev2(\$data2);

//variance of data set 2

\$datavariance2 = \$datastandarddeviation2 * \$datastandarddeviation2;

//number of data for 2nd data set

\$count2 =count(\$data2);

//variance over number of data

\$sterror2 = \$datavariance2/\$count2;

////////////////////////////

//COMPUTE STANDARD ERROR///

//////////////////////////

\$sumerror=\$sterror1+ \$sterror2;

\$standarderror=sqrt(\$sumerror);

///////////////////////////////////

//COMPUTE DIFFERENCE OF TWO MEANS//

///////////////////////////////////

\$difference=\$dataaverage1-\$dataaverage2;

\$meandifference=abs(\$difference);

////////////////////

//COMPUTE T-VALUE///

////////////////////

\$tvalue=\$meandifference/\$standarderror;

///////////////////////////////

//COMPUTE DEGREES OF FREEDOM///

///////////////////////////////

\$df=\$count1 + \$count2 -2;

///////////////////////////////////////////////

//EXTRACT CRITICAL T VALUE FROM THE DATABASE///

///////////////////////////////////////////////

\$df = mysql_real_escape_string(stripslashes(\$df));

\$result = mysql_query("SELECT `critical` FROM `ttest` WHERE `degrees`=’\$df’")

or die(mysql_error());

// store the record of the "example" table into \$row

\$row = mysql_fetch_array(\$result)

or die("Invalid query: " . mysql_error());

// Print out the contents of the entry

\$criticaltvalue = \$row[‘critical’];

//////////////////////////////////////////

//COMPARE CRITICAL AND COMPUTED T VALUE///

//////////////////////////////////////////

if (\$tvalue > \$criticaltvalue)

//they are statistical different

{

echo ‘<h2>T-Test Results of the Analyzed Samples:</h2>’;

echo ‘<br />’;

echo ‘The two data sets are <b>statistical DIFFERENT at 95% confidence level.It says that the two samples are NOT the same.</b><br />’;

echo ‘The computed t-value is:&nbsp;’.\$tvalue;

echo ‘<br />’;

echo ‘And the critical t-value is:&nbsp;’.\$criticaltvalue;

echo ‘<br />’;

echo ‘The computed degrees of freedom is:&nbsp;’.\$df;

echo ‘<br />’;

}

else

//they are not statistically different

{

echo ‘<h2>T-Test Results of the Analyzed Samples:</h2>’;

echo ‘<br />’;

echo ‘The two data sets are <b>NOT statistical different at 95% confidence level.It says that the two samples are the same.</b><br />’;

echo ‘The computed t-value is:&nbsp;’.\$tvalue;

echo ‘<br />’;

echo ‘And the critical t-value is:&nbsp;’.\$criticaltvalue;

echo ‘<br />’;

echo ‘The computed degrees of freedom is:&nbsp;’.\$df;

echo ‘<br />’;

}

echo ‘<br></br>’;

echo ‘Below is the submitted/analyzed data for your reference’;

echo ‘<br></br>’;

\$display1 = implode("n <br />", \$data1);

\$display2 = implode("n <br />", \$data2);

echo ‘<table border="1">’;

echo ‘<tr>’;

echo ‘<th>Data</th>’;

echo ‘</tr>’;

echo ‘<tr><td>’.\$display1.'</td></tr>’;

echo ‘<tr><td>’.\$display2.'</td></tr>’;

echo ‘</table>’;

}

}

?>

</body>

</html>

You can test a full, working version of this script to see it in action. You can also download (copy and paste) the script.