In a nation where high-stakes testing has become so prevalent in education, it is important that tests provide an accurate assessment of student progress and achievement. In order for test scores to be used as valid resources for making precise judgments about students' abilities, they must be accurate and sound. A test must first be reliable, and then be assessed for its validity. Test validity is comprised of three types: construct validity, content validity, and criterion validity. Consequential validity, a recent and still debated form of test validity, and the relationship of validity and reliability are also covered. A brief history of how test validity has developed and some questions that should be asked when validating a test are also included.
Keywords Assessment; College-Level Examination Program (CLEP) Test; Consequential Validity; Construct Validity; Content Validity; Criterion Validity; High-Stakes Tests; No Child Left Behind Act of 2001 (NCLB); Reliability; Standardized Tests; Test Bias
In a nation where high-stakes testing has become so prevalent in education, it is important that tests provide an accurate assessment of student progress and achievement. In order for test scores to be used to make accurate judgments about students' abilities, they must be both reliable and valid. A test must first be reliable, and then be assessed for its validity.
• Test reliability is the extent to which a test consistently calculates what it is supposed to calculate. Reliability deals with the way a test is constructed. If a test is reliable, it can be counted on to report similar results when taken by similar groups under similar circumstances over a period of time. A reliable test is free from errors in its construction and measurement.
• Test validity refers to how well a test measures what it is supposed to measure. Validity refers to degree, not extent. A test is not completely valid or completely invalid, and results can improve or contradict a test’s previous findings if validity evidence continues to be gathered (Messick, n.d., as cited in College Board, 2007b).
It is also possible for tests to be reliable but not valid. Testing instruments are everywhere and can assess practically anything. With such a high-stakes testing environment, one of the most crucial aspects is how test scores are used and the way they can affect students, schools, districts, and states. Tests can be used to meet the stipulations of the No Child Left Behind Act, determine admission to schools or programs, determine high school graduation or grade retention, and to diagnose educational deficiencies. With stakes this high, it is important that the assessment selected is appropriate for the situation.
Test validation refers to verifying the use of a test in a certain framework of circumstances, such as admission into a gifted and talented program, high school graduation, and college entrance. Therefore, one aspect of test validation is studying test scores in the setting they were used in to see if the results are adequately and appropriately measuring what the test purports to measure (College Board, 2007b).
Test validation has evolved from using a single approach to testing for validity to multiple procedures used sequentially over the development of the testing instrument (Jackson, 1970, 1973; Guion, 1983, as cited in Anastasi, 1986). Validity is part of the test from the beginning of the testing instrument's development. Validation begins with identifying construct definitions by looking at theories, prior research, observation, and/or analysis of relevant behavior. Test items are then arranged to suit construct definitions. Empirical item analysis occurs when the items that are the most valid are selected from the pool of test items. Other analyses may occur, including factor analyses. Once the instrument has been developed, then validation of scores using statistical analyses against other criteria occur (Anastasi, 1986).
Before establishing the validity of a test, decisions need to be made as to which test and test scores to validate. Including a few choices is usually a good idea; and since validity is a matter of degree and not all or nothing, several tests may fit the need or a combination of a test and other factors. For example, if college personnel are looking for an admissions test, they should consider what results arise if they do not use any test; if they use a combination of a test, high school grade point average, and essay; or if they use one particular test. By comparing the results of the possibilities, they will have a good indication whether the test is valid-and if the test is as valid when used alone as with a combination of other factors evaluated, admissions personnel can save a lot of time and effort (College Board, 2007b).
History of Test Validity
One of the earliest references to test validity came in 1928 from psychologist Clark Hull, and in 1937 H. E. Garrett asserted that test validity is the "extent to which a test measures what it purports to measure" (Garrett, 1937, as cited in Geisinger, 1992, p. 199). The first formal publication to include information on test validity came in 1954 after the American Psychological Association convened a committee to develop standards for psychological tests (Gray, 1997). The 1954 handbook they developed provided the first set of professional test standards and was called Technical Recommendations for Psychological Tests and Diagnostic Techniques. In it was the claim that there were four basic categories of test validity:
The handbook stated that it was incumbent on the test users to validate the test for the purposes for which they were planning to use it. This made validating a test the responsibility of both the test publisher and the test user. It also now required test users to identify how they planned to use tests and then validate according to the proposed purpose (Geisinger, 1992).
The APA Handbook of Standards
From the 1950s through the 1970s test validity was considered to be dependent on use-specific and situation-specific correlations (Geisinger, 1992). In 1966 the American Psychological Association revised their 1954 handbook and renamed it Standards for Educational and Psychological Tests and Manuals (APA, 1966, as cited in Geisinger, 1992). The revised edition advocated matching the use of a test with a validation strategy for supporting the test (Messick, 1989, as cited in Geisinger, 1992). They also joined predictive and concurrent validity together and renamed it criterion validity (Geisinger, 1992; Gray, 1997). The handbook also suggested that it would be a good idea to validate a test using more than one approach.
The next revision to the handbook came in 1974. In this edition, social consequences of testing were mentioned, and that adverse impact and test bias should be considered whenever evaluating a test's validity. It also recommended that content validity consider test-taker behavior. In 1985, the American Psychological Association teamed up with the American Educational Research Association and National Council for Measurement in Education and revised the handbook to include test qualities, validity, reliability, and test uses and presented test validation as a unified undertaking (Geisinger, 1992). In 1999, the three entities jointly revised the handbook to reflect the changes in federal law, measurement trends that influence validity, surveying students with disabilities, and testing English language learners, among other things (AERA Books, 2006).
There are now four types of validity measured in any testing instrument:
• Content Validity
• Criterion Validity
• Construct Validity
(The entire section is 3511 words.)