HomePHP Page 8 - Implementing Bayesian Inference Using PHP: Part 2
Beta distribution sampling model - PHP
While the first article in this series discussed building intelligent Web applications through conditional probability, this Bayesian inference article examines how you can use Bayes methods to solve parameter estimation problems. Relevant concepts are explained in the context of Web survey analysis using PHP and JPGraph. (This intermediate-level article was first published by IBM developerWorks on April 12, 2004 at http://www.ibm.com/developerWorks).
A random variable is said to have the standard beta distribution with parameters a and b if its probability density function is:
f() =a - 1 * (1 -) b - 1 / B(a, b)
Rather than explain the formula by resorting to more mathematics, I will discuss PHP code that you can use to compute f() for various values of a and b. Towards this end I created a class called BetaDistribution.phpand added it to a probability distributions package I developed for a previous article (see Resources). This class supplies methods that accept the a, b, andparameters.
The class constructor is first called with the a and b parameters as follows:
Listing 4. Instantiating the BetaDistribution class
// Demonstration of how to instantiate Beta Distribution // class.
$a = 1; // num successes $b = 4; // num failures
$beta = new BetaDistribution($a, $b);
In this example, the number of success events (previously k) is denoted by a. The number of failure events is equal to n - k and is denoted by b. The a and b parameters jointly control the shape and location of the beta distribution curves.
Graphically speaking, the beta distribution refers to a large family of plotting curves that can differ substantially from one another depending upon the a and b parameter values. As you shall see, the a and b parameters can be used to represent the prior probability distribution that you feel is most appropriate for representing P(i).
Suppose you test your simple binary survey before going live. Select five people that you think are representative of the target sampling population and ask them to fill it out the survey online; then observe the following results:
One participant responds "yes" (a "success" event).
Four participants respond "no" ("failure" events).
The code in Listing 5 invokes the BetaDistributionwith the appropriate parameter values of a=1 and b=4 to represent the results of this survey. Once you instantiate the beta distribution constructor with the appropriate a and b parameter values, you can then use other methods in this class (depicted in the following) to compute standard probability distribution functions:
Listing 5. Using other methods to compute probability distribution functions
Which produces the output displayed in this table:
Table 4. Output of beta probability distribution methods
If the test has no glitches, you can go into your main experiment with BetaDistribution(1, 4) being used to represent your prior distribution P(). Note that the mean value (= p = k/n = a / a + b = .20) reported in the table is what you expect it to be from common-sense considerations (such as the expected value ofequal to the observed proportion of cases to date k/n).
To visualize your prior probability distribution, you can use the following code below to obtain the x and y coordinates to plot. The probability density function PDF()returns a "probability" value associated with a particularvalue -- for instance, P[p = .20]. Given a contiguous range of p values, the PDF()method give you a corresponding range of probability values f(p) that you can use to graph the shape of the probability distribution for fixed a, b parameters and for a range of possiblevalues:
In the following graph, p = $parameters[$i], and f(p) = $pdf_vals[$i].
Figure 4. Prior distribution is not well defined; too few observations
The exact values of f(p) are of less concern than the overall shape and center of gravity for the prior distribution. What this graph shows is that your prior distribution is still not very well defined because it does not peak around a particular parameter estimate. This is as it should be when you only have a few observations to work with.