Since the enactment of the No Child Left Behind Act of 2001, standardized testing has been at the center of a heated debate: critics claim that the tests betray unfair biases that limit minority students and have serious repercussions on their academic success as well as on the performance of teachers, school districts, and state educational efforts. Test bias can be a tricky subject because it is often difficult to determine whether different scores among test takers are caused by differences in ability or by a test bias. The times are long past when a person could easily spot the most obvious biases in testing instruments, such as referring to minorities in what are now considered derogatory terms. However, there can still be a hidden bias in a testing instrument or test item that favors majority over minority populations.
Keywords Content Validity; Cultural Bias; Gender Bias; High-Stakes Test; Item Bias; Language Bias; No Child Left Behind Act of 2001 (NCLB); Norm-Referenced Test; Socioeconomic Bias; Standardized Tests; Test Bias; Validity
Test bias occurs when a test item or entire test causes students of similar abilities to perform differently because of their ethnic, cultural, religious, or gender differences. For a test to be valid, it must measure student achievement regardless of their divergent backgrounds (Jorgensen, 2005). Because standardized tests have come to have long-range effects on students and school districts, it is important to be aware of and avoid all forms of test bias so that standardized tests accurately measure achievement of all test takers.
Test bias can be a tricky subject because it is often difficult to determine whether different scores among test takers are caused by differences in ability or by a test bias. The times are long past when a person could easily spot the most obvious biases in testing instruments, such as referring to minorities in what are now considered derogatory terms. However, there can still be a hidden bias in a testing instrument or test item that favors majority over minority populations.
There are many types of testing bias, and a testing instrument only needs to have one to be considered biased and, therefore, invalid. Among the types of possible test bias are cultural, socioeconomic, and gender bias, item bias, construct bias, sampling, language bias, and examiner bias:
• Cultural, socioeconomic, and gender bias can occur when a test item favors one gender, cultural, or socioeconomic group over another, uses terms that may be derogatory toward a group, or uses terms that may be more familiar to one group than another.
• Item bias can occur when a test item requires test takers to have secondary abilities, experiences, or knowledge in order to accurately respond to the test item.
• Construct bias can occur when a test is structured in such a way that it requires test takers to have secondary abilities, experiences, or knowledge in order for a test to accurately measure their achievement. Intelligence tests have lately come under close scrutiny, with critics claiming that they do not measure the inherent aptitudes of minority populations but rather how well minorities share white, middle-class values and knowledge (Mercer, 1979, as cited in Skiba, Knesting & Bush, 2002).
• Sampling becomes a potential hotbed when discussing test bias, too. When subpopulations are sampled proportionally, the test will be biased against any minority population since by definition minority populations are in the minority. But if minority populations are over represented on a testing instrument, the test may end up being biased against the majority population. Therefore when random sampling – which is considered statistically valid – occurs in the development of a testing instrument, the test should favor the group that comprises the largest proportion of the defined sample.
• Language can occur when designated subgroups of interest within a population are not equally familiar with test vocabulary or when the meaning of a word is not the same across subgroups.
• Examiner bias can occur if the examiner is not of the same culture or race as the students being tested (Skiba et al., 2002).
As stated above, there can be many different forms of test bias. When evaluating a test item or an entire testing instrument for bias, there are at least three issues that should be considered: fairness, bias, and stereotyping. Some questions that can be asked that address test item fairness include:
• Does the item give a positive representation of designated subgroups of interest?
• Is the test item material equally familiar to every designated subgroup of interest?
• Are designated subgroups of interest represented in relation to their presence in the general population being tested?
• Is there greater opportunity for members of one group to have prior knowledge of the vocabulary used? (A potentially unfair item might reference a regatta, a word that would be known to test takers who live or vacation near water or who own a boat but that might be unfamiliar to those who reside primarily in urban settings.)
• Is there greater opportunity for members of one group to have experience with a test item reference or become familiar with the method that the items presents? (A potentially unfair question might reference a European or cross-country vacation, which some socioeconomic groups may not have experienced, or to reference plowing a field, an activity with which many urban or suburban students may not be familiar) (Hambleton & Rodgers, 1995).
A test item “may be biased if it contains language that is differently familiar to subgroups of test takers, or if the item structure or format is differently difficult for subgroups of test takers” (Hambleton & Rodgers, 1995, p. 2).
A test item “may be language biased if it uses terms that are not used across the tested population or if it uses terms that have different connotations among groups of the tested population” (Hambleton & Rodgers, 1995, p. 2). An example of language bias against African American students occurred in a study in which students had to recognize and then name an object that began with the same sound as hand. One correct response would have been heart, but many African American students chose car instead because in the slang they use, a car is known as a hog, which also has the same sound as hand. Therefore, although the African American students had understood the concept on which they were being tested, their answers were incorrect because of language differences (Scheuneman, 1982, as cited in Hambleton & Rodgers, 1995).
A test item may be content biased if it refers to experiences or information that are not common across the tested population. A vocabulary or reading comprehension test that predominantly refers to rural experiences when knowledge of rural life is not being assessed would be an example of a test that is biased against urban students.
Some questions that can be asked to detect bias include:
• Does the item contain content that is differently familiar to designated subgroups of interest? (content bias)
• Will members of designated subgroups of interest answer a test item differently for reasons not based on the ability being measured? (content bias)
• Does a test item require the test taker to have information or skills that cannot be expected to be within the education background of all the test takers? (content bias)
• Does the test item contain words that have different or unfamiliar meanings for designated subgroups of interest? (language bias)
• Is the test item free of needlessly difficult vocabulary? (language bias)
• Is the test item free of group-specific language, vocabulary, or reference pronouns? (language bias)
• Does a test item contain any clues that would help one group and not another? (structure and format bias)
• Are there any inadequacies or ambiguities in the test instructions? (structure and format bias)
• Do the test instructions tend to differently confuse members of designated subgroups of interest? (structure and format bias)
Questions such as these can help test evaluators determine if there is content, language, and/or item structure and format bias (Hambleton & Rodgers, 1995, p. 4).
Stereotyping and inadequate representation of minorities are other forms of test bias. While this type of bias may not make a test item any harder for test takers, it can cause undue stress, which can prevent test takers from doing their best.
Examples of this type of bias might be test items that reinforce negative stereotypes, imply that a certain subgroup is inferior, or use derogatory terms such as housewife, Chinaman, colored people, red man, and lower class. For the sake of gender neutrality, other terms that should be avoided include those job designations ending in man. Instead of using policeman, test item writers should use “police officer”; and in place of fireman, “fire fighter should be used. Depicting members of designated subgroups of interest as having stereotypical occupations, such as a Chinese launderer, should also be avoided. (Hambleton & Rodgers, 1995).
Some questions used to detect stereotyping might include:
• Does the item represent designated subgroups of interest in a positive way?
• Are designated subgroups of interest referred to in the same way with respect to using names...
(The entire section is 4387 words.)