# What are biostatistics?

*print*Print*list*Cite

**The Methodology of Statistics**

The aim of every study in the field of biostatistics is to discover something about a population—all patients with a particular disease, for example. Populations are usually much too large to be studied in their entirety: It would be impractical to round up all the patients with diabetes in the world for a study, and the researcher would still miss those who lived in the past or who have yet to be born. For this reason, most research focuses on a sample drawn from the population of interest—for example, those diabetics who were studied at a particular hospital over a two-year period. Ideally, the sample would be a random sample, one in which each member of the population has an equal chance of being included. In practice, a random sample is hard to achieve: The patients with diabetes at a hospital in Chicago, for example, will differ in various ways from those at hospitals in London, Hong Kong, or rural Mexico.

Descriptive statistics are statistics that describe samples. The most commonly used statistics are mean, median, mode, and standard deviation. The mean of *n* values is simply the sum of all the values divided by *n*, or, symbolically, x̄=Σ*x/n*, where x̄ (pronounced “*x*-bar”) stands for the mean, *n* stands for the number of observations, and Σ*x* (pronounced “sum of *x*”) stands for the sum obtained when all the different values of *x* (*n* of them) are added together. The median of a series of values is the middle value; it must always be the case that half of the values are above the median and half are below. The mode is simply the most common value, the value that occurs most often.

The mean, median, and mode are all “typical” values that characterize the center of distribution of all the values. The standard deviation, which is always positive, is a measure of variation that describes how close all the values cluster about a central value.

The numerator is calculated by subtracting the mean (x̄) from each value, squaring the difference, and adding together all these squared differences. After the sum is divided by *n* -1, the square root of the entire quantity is taken to determine the standard deviation. A small standard deviation indicates that the values differ very little from one another; a large standard deviation indicates greater variability among the values.

When two quantities vary, such as height and weight, correlation and regression coefficients are also calculated. A correlation coefficient is a number between -1 and +1. A correlation near +1 shows a very strong relationship between the two variables: When either one increases, the other also increases. A correlation near -1 is also strong, but when either variable increases, the other decreases. A correlation of 0 shows independence, or no relationship, between the variables: An increase in one has no average effect on the value of the other. Correlations midway between 0 and 1 show that an increase in one variable corresponds only to an average increase in the other, but not a dependable increase in each value. For example, taller people are generally heavier, but this is not true in every case.

Regression is a statistical technique for finding an equation that allows one variable to be predicted from the value of the other. For example, a regression of weight on height allows a researcher to predict the average weight for persons of a given height.

Inferential statistics, the statistical study of populations, begins with the study of probability. A probability is a number between 0 and 1 that indicates the certainty with which a particular event will occur, where 0 indicates an impossible occurrence and 1 indicates a certain occurrence. If a certain disease affects 1 percent of the population, then each random sampling will be subject to a .01 probability that the next person sampled will have the disease. For a larger sample than only one individual, the so-called binomial distribution describes the probability that the sample will include no one with the disease, one person with the disease, two people with the disease, and so on.

A variable such as height is subject to so many influences, both genetic and environmental, that it can be treated mathematically as if it were the sum of thousands of small, random variations. Characteristics such as height usually follow a bell-shaped curve, or normal distribution (see figure), at least approximately. This means that very few people are unusually tall or unusually short; most have heights near the middle of the distribution.

Each study using inferential statistics includes four major steps. First, certain assumptions are made about the populations under study. A common assumption, seldom tested, is that the variable in question is normally distributed within the population. Second, one or more samples are then drawn from the population, and each person or animal in the population is measured and tested in some way. Most studies assume that the samples are randomly drawn from the populations in question, even though true randomness is extremely difficult to achieve. Third, a particular assumption, called the null hypothesis, is chosen for testing. The null hypothesis is always an assumption that can be expressed numerically and that can be tested by known statistical tests. For example, one might assume that the mean height of the population of all diabetics is 170 centimeters (5.5 feet). Fourth, each statistical procedure allows the calculation of a theoretical probability, often a calculated value from a table. If the calculated probability is moderate or high, the null hypothesis is consistent with the observed results. If the calculated probability is very small, the observed results are very unlikely to occur if the null hypothesis is true; in these cases, the null hypothesis is rejected.

Some common types of inferential statistics are *t*-tests, chi-square tests, and the analysis of variance (also called ANOVA).

**The Application of Statistics to Medicine**

Nearly all medical research studies include biostatistics. Descriptive statistics are often presented in tables of data or in graphic form.

In the following example, height and weight data were gathered from a sample of six diabetic patients, and height was also measured in a sample of six nondiabetic patients (see table 1). The mean height of these diabetic patients is (166 + 171 + 157 + 161 + 179 + 186) ÷ 6 = 1,020 ÷ 6 = 170 centimeters. In order to calculate the standard deviation, this mean is subtracted from each of the six values, to yield differences of -4, 1, -13, -9, 9, and 16. Squaring these differences and adding them up gives a numerator of 16 + 1 + 169 + 81 + 81 + 256 = 604. Dividing this by *n* - 1 = 5 (*n* is 6) and taking the square root reveals that the standard deviation is 11.0 centimeters (rounded to the nearest tenth). Similar calculations show that the mean height of the nondiabetic sample is 177 centimeters, with a standard deviation of 8.0 centimeters.

In this example, the nondiabetic sample averaged 7 centimeters taller than the diabetic sample. Inferential statistics can be used to find out if this difference is a meaningful one. In this case, a technique known as a *t*-test can compare the means of two samples to determine whether they might have come from the same population (or from populations with the same mean value). The assumptions of this test (not always stated explicitly) are that the six diabetic patients were randomly drawn from a normally distributed population and that the six nondiabetic patients were randomly drawn from another normally distributed population. The null hypothesis in this case is that the populations from which the two samples were drawn have the same mean height. The value of *t* calculated in this test is 1.141; this value is looked up in a table to reveal that the probability is larger than .1 (or 10 percent). In other words, if the null hypothesis is true, then this value of *t* (or a larger value) is expected to arise by chance alone more than 10 percent of the time. Under the usual criterion of a test at the 5 percent level of significance, one would keep or accept the null hypothesis. This means that the difference between the above sample means is not large enough to demonstrate a difference
between the two populations from which these samples are derived. There may in fact be an average difference in height between diabetic and nondiabetic populations, but samples this small cannot detect such a difference reliably. In general, when differences between populations are small, larger sample sizes are required to demonstrate their existence.

A *t*-test similar to the one above can also be used in
drug testing. A drug is given to one set of patients, and a placebo (a fake medicine lacking the essential drug ingredient being tested) is given to a second group. Some relevant measurement (such as the drug’s level of an important chemical that the patients normally lacked) is then compared between the two groups. The null hypothesis would be that the groups are the same and that the drug makes no difference. Rejection of the null hypothesis would be the same as demonstrating that the drug is effective.

For the diabetic patients in the above sample, a correlation coefficient of .60 can be calculated. This moderate level of correlation shows that, on the average, the taller among these patients are also heavier and thus the shorter patients are also lighter. Despite this average effect, however, individual exceptions are likely to occur. The square of the correlation coefficient, .36, indicates that about 36 percent of the variation in weight can be predicted from height. The equation for making the prediction in this case is “weight = 13.99 + .404 (height),” where the numbers 13.99 and .404 are called regression coefficients. The variable being predicted (weight in this example) is sometimes called the dependent variable; the variable used to make the prediction (in this case, height) is called the independent variable. A high positive correlation signifies that the regression equation offers a very reliable prediction of the dependent variable; a correlation near zero signifies that the regression equation is hardly better than assigning the mean value of the dependent variable to every prediction.

To illustrate another type of inferential statistics, consider the following data on the diseases present among elderly patients who own pets, compared to a comparable group who do not. This type of table is called a contingency table.

As illustrated in table 2, 90 � 240, or three-eighths of the patients sampled, are pet owners. Thus, if there were no relationship between the diseases and pet ownership (which is the null hypothesis), one would expect three-eighths of the eighty arthritis patients (or 30) to be pet owners and five-eighths of the eighty (or 50) to be nonowners. Calculating all the expected frequencies in this way, one can compare them with the actual observations shown above.

The result is a statistic called chi-square, which in this problem has a value of 18.88. A table of chi-square values shows that, for a 2 4 contingency table, a chi-square value this high occurs by chance alone much less than 1 percent of the time. Thus, something has been observed (a chi-square value of 18.88) that is extremely unlikely under the null hypothesis of no relationship between disease and pet ownership, so the null hypothesis is rejected and one must conclude that there is a relationship. This conclusion applies only to the population from which the sample was drawn, however, and it does not reveal the nature of the relationship. Further investigation would be needed to discover whether pet ownership protected people from arthritis, whether people who already had arthritis were less inclined to take on the responsibility of pet ownership, or whether people who had pets gave them up when they became arthritic. All these possibilities (and more) are consistent with the findings.

The analysis of variance (also called ANOVA) is a very powerful technique for comparing many samples at once. Suppose that overweight patients were put on three or four different diets; one diet might be better than another, but there is also much individual variation in the amount of weight lost. Analysis of variance is a statistical technique that allows researchers to compare the variation between diets (or other treatments) with the individual variation among the people following each diet. The null hypothesis would be that all the diets are the same and that individual variation can account for all the observed differences. Rejecting the null hypothesis would demonstrate that a consistent difference existed and that at least one diet was better than another. Further tests would be required to pinpoint which diet was best and why.

**Perspective and Prospects**

The historical foundations of biostatistics go back as far as the development of probability theory by Blaise Pascal (1623–62). Karl Friedrich Gauss (1777–1855) first outlined the characteristics of the normal (also called Gaussian) distribution. The chi-square test was introduced by Karl Pearson (1857–1936). The greatest statistician of the twentieth century was Ronald A. Fisher (1890–1962), who clearly distinguished descriptive from inferential statistics and who developed many important statistical techniques, including the analysis of variance.

Although medicine existed long before statistics, it has become a nearly universal practice for every research study to use statistics in some important way. New medical procedures and new drugs are constantly being evaluated by using statistical techniques to compare them to other procedures or older drugs.

**Bibliography**

Brase, Charles Henry, and Corrinne Pellillo Brase. *Understandable Statistics*. 10th ed. Boston: Brooks/Cole, 2012.

Daniel, Wayne W. *Biostatistics: A Foundation for Analysis in the Health Sciences*. 9th ed. Hoboken, N.J.: John Wiley & Sons, 2009.

Glaser, Anthony N. *High-Yield Biostatistics, Epidemiology, and Public Health*. Philadelphia: Lippincott, Williams and Wilkins, 2013.

Hebel, J. Richard, and Robert J. McCarter. *A Study Guide to Epidemiology and Biostatistics*. 7th ed. Burlington, Mass.: Jones and Bartlett, 2012.

Le, Chap T. *Health and Numbers: Problems-Based Introduction to Biostatistics*. Malden, Mass.: Blackwell, 2009.

Merrill, Ray M. *Fundamentals of Epidemiology and Biostatistics: Combining the Basics*. Burlington, Mass.: Jones and Bartlett, 2013.

Phillips, John L. *How to Think About Statistics*. 6th ed. New York: Henry Holt, 2002.

Sokal, Robert R., and F. James Rohlf. *Biometry*. 4th ed. San Francisco: W. H. Freeman, 2012.

Walpole, Ronald E. *Elementary Statistical Concepts*. New York: Macmillan, 1976.

Zar, Jerrold H. *Biostatistical Analysis*. 5th ed. Englewood Cliffs, N.J.: Pearson, 2010.