Home PHP Page 5 - Implement Bayesian inference using PHP, Part 1

# Frequency versus probability format - PHP

Have you ever wanted to build an intelligent Web application? Paul Meagher shows how to do it using conditional probability. (This intermediate-level article was first published by IBM developerWorks, March 16, 2004, at http://www.ibm.com/developerWorks).

Rating:  / 25
January 05, 2005

SEARCH DEV SHED

TOOLS YOU CAN USE

The getConditionalProbability function you've developed operates on counts and frequencies rather than on probabilities. In reading the literature on Bayesian reasoning, you will notice that the enumeration method for computing P(A | B) is only briefly discussed. Most authors quickly move onto describing how P(A | B) can be formulated using terms denoting probability values rather than frequency counts. For example, you can recast the formula for computing P(A | B) using such probability terms as:

P(A | B) = P(A & B) / P(B)

The advantage of recasting the formula using terms denoting probabilities instead of frequency counts arises because in practice, you often don't have access to a data set we can use to derive conditional probability estimates through an enumeration of cases method. Instead, you often have access to higher-level summary information from past studies in the form of percentages and probabilities. With the available information, the challenge then becomes finding a way to use these probability estimates instead to compute the conditional probabilities you are interested in. Recasting the conditional probability formula in terms of probabilities allows you to make inferences based on related probability information that is more readily accessible.

The enumeration method might still be regarded as the most basic and intuitive method for computing a conditional probability. In Thomas Bayes' "Essay on the Doctrine of Chances," he uses enumeration to arrive at the conclusion that P( 2nd Event = b | 1st Event = a ) is equal to [P / N] / [ a / N], which is equal to P / a, which one can also denote as {a & b} / {a}:

Figure 1. Graphical representation of relations

Another reason why it is important to be aware of frequency versus probability format issues is because it has been demonstrated by Gerd Gigerenzer (and others) that people are better at reasoning in accordance with prescriptive Bayesian rules of inference when background information is presented in terms of frequencies of cases (1 in 10 cases) rather than probabilities (10 percent probability). A practical application of this research is that medical students are now being taught to communicate risk information in terms of frequencies of cases instead of probabilities, making it easier for patients to make better informed judgements about what actions are warranted given the test results.

Joint probability

The most basic method for computing a conditional probability using a probability format is:

P(A | B) = P(A & B) / P(B)

This probability format is identical to the frequency format, except for the probability operator P( ) surrounding the numerator and denominator terms. The P(A & B) term denotes the joint probability of A and B occurring together. To understand how the joint probability P(A & B) can be computed from cross-tabulated data, consider the following hypothetical data (taken from pp. 147-48 of Grimstead and Snell's online texbook):

 -Smokes +Smokes Totals -Cancer 40 10 15 +Cancer 7 3 10 Totals 47 13 60

To convert this table of frequencies to a table of probabilities, you divide each cell frequency by the total frequency (60). Note that dividing by the total frequency also ensures that Cancer x Smokes cell probabilies sum to 1 and permits you to refer to the silver area of the table below as the joint probability distribution of Cancer and Smoking.

 -Smokes +Smokes Totals -Cancer 40/60 10/60 50/60 +Cancer 7/60 3/60 10/60 Totals 47/60 13/60 60/60

To compute the probability of cancer given that a person smokes P(+Cancer | +Smokes), you can simply substitute the values from this table into the above formula as follows:

P(+Cancer | +Smokes) = ( 3 / 60 ) / ( 13 / 60) = 0.05 / .217 = 0.23

Note that you could have derived this value from the table of frequencies as well:

P(+Cancer | +Smokes) = 3 / 13 = 0.23

How do you interpret this result? Using the recommended approach of communicating risk in terms of frequencies, you might say that of the next 100 smokers you enounter, you can expect 23 of them to experience cancer in their lifetime. What is the probability of getting cancer if you do not smoke?

P(+Cancer | -Smokes) = ( 7 / 60 ) / ( 47 / 60) = 0.117 / .783 = 0.15

So it appears that you are more likely to get cancer if you smoke than if you do not smoke, even though the tallies appearing in the table might not have initially given you that impression. It is interesting to speculate on what the true conditional probabilities might be for various types of cancer given various criteria for defining someone as a smoker.

A "cohort" research methodology would also require you to equate smokers and non-smokers on other variables like age, gender, and weight so that smoking, and not these other co-variates, can be isolated as the root cause of the different cancer rates.

To summarize, you can compute a conditional probability (+Cancer | +Smokes) from joint distribution data by dividing the relevant joint probability P(+Cancer & +Smokes) by the relevant marginal probability P(+Smokes). As you might imagine, it is often easier and more feasible to derive estimates of a conditional probability from summary tables like this, rather than expecting to apply more data-intensive enumeration methods.

 >>> More PHP Articles          >>> More By developerWorks