## Correlation

In statistics, correlation is the degree to which two events or variables are consistently related. This measure indicates both the degree and direction of the relationship between variables. However, it yields no information concerning the cause of the relationship. Correlation techniques are available for both parametric and nonparametric data. The Pearson Product Moment Correlation is also used in other inferential statistical techniques such as regression analysis and factor analysis to help researchers and theorists build models that reflect the complex relationships observed in the real world.

Keywords Correlation; Data; Demographic Data; Dependent Variable; Distribution; Factor Analysis; Independent Variable; Inferential Statistics; Model; Nonparametric Statistics; Parametric Statistics; Regression Analysis; Reliability; Variable

### Correlation

### Overview

Every day we make assumptions about the relationship of one event to another in both our personal and professional lives. "My alarm clock failed to go off this morning, so I will be late for work." "The cat ate an entire can of cat food so she must be feeling better." "I received a polite e-mail from Mr. Jones, so he must not be angry that my report was not submitted on time." Sociologists attempt to express the relationship between variables in the same way on a broader scale. "Advertisements induce previous purchasers to buy additional lottery tickets." "People tend to act more openly with strangers who outwardly appear to be similar to themselves." "Younger males tend to be less prejudiced towards women in the workplace."

From a statistical point of view, the mathematical expression of such relationships is called correlation. This is the degree to which two events or variables are consistently related. Correlation may be positive (i.e., as the value of one variable increases the value of the other variable increases), negative (i.e., as the value of one variable increases the value of the other variable decreases), or zero (i.e., the values of the two variables are unrelated). However, correlation does not give one any information about what caused the relationship between the two variables. Properly used, knowing the correlation between variables can give one useful information about behavior. For example, if I know that my cat gets sick when I feed her "Happy Kitty" brand cat food, I am unlikely to feed her "Happy Kitty" in the future. Of course, knowing that she gets sick after eating "Happy Kitty" does not explain why she gets sick. It may be that she is sensitive to one of the ingredients in "Happy Kitty" or it may be that "Happy Kitty" inadvertently released a batch of tainted food. However, my cat's digestive problems might not have anything to do with "Happy Kitty" at all. The neighborhood stray may eat all her "Happy Kitty" food, causing her to have eaten something else that causes her to get sick, or I changed her food to "Happy Kitty" at the same time she was sick from an unrelated cause. All I know is that when I feed her "Happy Kitty" she gets sick. Although I do not know why, this is still useful information to know. The same is true for the larger problems of sociology.

There are a number of ways to statistically determine the correlation between two variables. The most common of these is the technique referred to as the Pearson Product Moment Coefficient of Correlation, or Pearson *r* . This statistical technique allows researchers to determine whether the two variables are positively correlated (i.e., my cat gets sick when she eats "Happy Kitty"), negatively correlated (i.e., my cat is healthier when she eats "Happy Kitty"), or not correlated at all (i.e., there is no change in my cat's health when she eats "Happy Kitty").

### Correlation vs. Causation

However, as mentioned above, knowing that two variables are correlated does not tell us whether one variable caused another or if both observations were caused by some other, unknown, third factor. As opposed to the various techniques of inferential statistics where we attempt to make inferences such as drawing conclusions about a population from a sample and in decision making by looking at the influence of an independent variable on a dependent variable, correlation does not imply causation. For example, if I have two clocks that keep perfect time in my house, I may observe that the alarm clock in my bedroom goes off every morning at seven o'clock just as the grandfather clock in the hallway chimes. This does not mean that the alarm clock caused the grandfather clock to chime or that the grandfather clock caused the alarm clock to go off. In fact, both of these events were caused by the same event: the passage of 24 hours since the last time they did this. Although it is easy to see in this simple example that a third factor must have caused both clocks to go off, the causative factor for two related variables is not always so easy to spot. To act on such unfounded assumptions about causation as inferred from correlation is part of the cycle of superstitious behavior. Many ancient peoples, for example, included some sort of sun god in their pantheon of deities. They noticed that when they made offerings to their sun god, the sun arose the next morning, bringing with it heat and light. So, they made offerings. From our modern perspective, however, we now know that the faithful practice of making offerings to a sun god was not the cause of the sun coming up the next morning. Rather, the apparent phenomenon of the rising sun is caused by the daily rotation of the earth on its access.

The classic example of showing the absurdity of inferring causation from correlation was published in the mid 20th century in a paper reporting the results of an analysis of fictional data. Neyman (1952) used an illustration of the correlation between the number of storks and the number of human births in various European countries. The result of the correlation analysis of the relationship between the sightings of storks and the number of births was both high and positive. Without understanding how to interpret the correlation coefficient, someone might conclude from this evidence that storks bring babies. The truth, however, was that the data were analyzed without respect of country size. Since larger northern European countries tend to have both more women and more storks, the observed correlation was due to country size. The correlation was incidental and not causal: correlation tells one nothing about causation. Although this example was originally meant to make people laugh, it was also meant as a warning: as absurd as these examples may sound, coefficients are frequently misinterpreted to imply causation.

### Pearson Product Moment Correlation

The Pearson Product Moment Correlation is a parametric test that makes several assumptions concerning the data that are being analyzed. First, it assumes that the data have been randomly selected from a population that has a normal distribution. In addition, it assumes that the data are interval or ratio in nature. This means that not only do the rank orders of the data have meaning (e.g., a value of 6 is greater than a value of 5) but the intervals between the values also have meaning. For example, weight is a ratio scale. It is clear that the difference between 1 gram of a chemical compound and 2 grams of a chemical compound is the same as the difference between 100 grams of the compound and 101 grams of the compound. These measurements have meaning because the weight scale has a true zero (i.e., we know what it means to have 0 grams of the compound) and the intervals between values is equal. On the other hand, in attitude surveys and other data collection instruments used by sociologists, it may not be quite as clear that the difference between 0 and 1 on a 100 point rating scale of quality of a widget is the same as the difference between 50 and 51 or between 98 and 99. These are value judgments and the scale may not have a true zero. Even if the scale does start at 0, it may be difficult to define what this value means. It is difficult to know whether a score of 0 differs significantly from a score of 1 on an attitude scale. In both cases, the rater had a severe negative reaction to the item being discussed. Since ratings are subjective, even if numerical values are assigned to them, these do not necessarily meet the requirement of parametric statistics that the data be at the interval or ratio level.

### Spearman Rank Correlation Coefficient

Fortunately, the Pearson product moment correlation is not the only...

(The entire section is 3765 words.)