## Regression Analysis

(Research Starters)

Regression analysis is a family of statistical tools that can help sociologists better understand and predict the way that people act and interact. Regression analysis is used to build mathematical models to predict the value of one variable from knowledge of another. Although statistical methods of correlation offer researchers techniques to help them better understand the degree to which two variables are consistently related, such knowledge alone is typically insufficient to predict behavior. Simple linear regression allows the value of one dependent variable to be predicted from the knowledge of one independent variable. Multiple linear regression can be used to develop models to predict the value of a dependent variable from the knowledge of the value of more than one independent variable.

Research Methods

### Overview

Regression analysis is a family of statistical tools that can help sociologists better understand the way that people act and interact in groups and society. Regression analysis allows researchers to build mathematical models that can be used to predict the value of one variable from knowledge of another. There are a number of specific regression techniques that can be used by sociologists to model real-world behavior. These include:

* Simple linear regression analysis, which allows the modeling of two variables, one independent and one dependent

* Multiple linear regression analysis, which allows the modeling of two or more independent variables to predict one dependent variable

* Multiple curvilinear regression, where the relationship between variables is nonlinear (e.g., quadratic)

* Multivariate linear regression, which allows the simultaneous examination of several dependent variables

* Multivariate polynomial regression, which can be used to account for nonlinear relationships

The most commonly used of these techniques, simple linear regression and multiple linear regression, are discussed in the following sections.

### Simple Linear Regression

Statistics offers sociology researchers a number of correlation techniques to help them better understand the degree to which two variables are consistently related. For example, correlation can help one understand the relationship between educational level and income level. Correlation coefficients show the degree of relationship between two variables with a value between zero and one. A correlation of 1.0 shows that the variables are completely related and a change in the value of one variable will signify a corresponding change in the other, while a correlation of 0.0 shows that there is no relationship between the two variables and that knowing the value of one variable will tell us nothing about the value of the other.

In addition to signifying the degree of relationship between two variables, a correlation coefficient also shows how the two variables are related. A positive correlation means that as the value of one variable increases, so does the value of the other variable. A negative correlation, on the other hand, means that as the value of one variable increases, the value of the other variable decreases. An example of a high positive correlation would be the relationship of weight to age for healthy children: the older the child is, the more he or she will probably weigh. An example of a high negative correlation would be the relationship between temperature and the likelihood of snow: the higher the temperature is, the less likely it is to snow.

However, as helpful as knowing what the correlation between two variables is, that knowledge alone does not necessarily give us sufficient information to predict behavior. For example, although we may know that people who do their grocery shopping when they are hungry are more likely to buy impulse items than those who are not, we cannot necessarily accurately predict that just because a person is hungry, he or she will purchase unneeded items at the grocery store. Merely knowing that there is a positive correlation between these two variables is insufficient to allow us to predict whether a given person or type of person is more likely to exhibit this behavior. In situations where one needs to be able to predict the value of one variable from knowledge of another variable based on the data, one needs to use simple linear regression.

Simple linear regression is a bivariate statistical tool that allows the value of one dependent variable to be predicted from the knowledge of one independent variable. Examples of sociological applications of simple linear regression include predicting the crime rate from population density, voting behavior in an election from voting behavior in the primary, and relative income based on gender. The pairs of data used in linear regression analysis are typically graphed on a scatter plot that shows the values of the points for two-variable numerical data. A line of best fit is superimposed on the scatter plot and used to predict the value of the dependent variable based on different values of the independent variable. A sample scatter plot with line of best fit is shown in Figure 1.

The equation for the regression line is determined by the statistics equivalent of the linear slope-intercept equation from basic algebra, y = mx + b:

ŷ = β0 + β1x + ∈

where

ŷ = the predicted value of y

β0 = the population y intercept

β1 = the population slope

∈ = the error term

For example, a sociologist interested in the behavior of small groups might want to determine whether or not the efficacy of the decisions made in small groups could be predicted from the number of people in the group. Although larger group size could mean that there are more ideas, more contribution to the thinking process, and a larger potential for synergistic thinking, a larger group could also mean that more time would be required to reach a decision, the competition of ideas could lead to confusion, and coalitions could form within the group and make it harder to resolve disagreements. A predictive model for group size versus efficacy of decision making could be developed by setting up an experiment that compared the efficacy of decision making on the same problem for groups of various sizes. The slope of the line of best fit passing through the data points on the scatter plot could be mathematically calculated, using these data points to determine the equation of the simple regression line. This equation could then be used by the sociologist to recommend optimal group size for similar types of decisions or projects based on the single variable of number of group members.

The problem with drawing a line of best fit through a scatter plot, of course, is that unless all the pairs of data fall on one straight line, it is possible to draw multiple lines through a data set. The question faced by the researcher is how to determine which of these possible lines will yield the best predictions of the dependent variable from the independent variable. This can be accomplished mathematically through residual analysis.

In regression analysis, a residual is defined as the difference between the actual y values and the predicted y values, or y - y^. To find the line of best fit, it is important to reduce the distance between the points on the scatter plot and the line. This is done by minimizing the sum of the squares of the residuals in order to find the line of best fit. By looking at the residuals, a researcher can better understand how well the regression line fits past data in order to estimate how well it will predict future data.

Standard regression analysis techniques make several Assumptions, including that the model is correct and that the data are good. Unfortunately, the types of real-world data needed by sociologists tend to be messy. As a result, these assumptions are rarely met in practice. Many factors can contribute to the problems in regression analysis, including the use of the incorrect functional form, which is used for the regression function; correlation of variables; inconstant variance; sample data with outliers; and multicollinearity among subsets of the input variables such that they exhibit nearly identical linear relations. If one or more of these problems occur, the entire analysis may be invalidated. This risk is complicated by the fact that there are few indications in standard statistics to indicate when these problems have occurred. Although there are other indicators and potential remedies for these situations, they must be used...

(The entire section is 3800 words.)