Test reliability describes the degree to which a test consistently measures the knowledge or abilities it is supposed to measure. Many factors can cause a test results to be inconsistent and, therefore, cause a test to be unreliable. A test may contain unclear directions or flawed test items; the students taking a test can be distracted, ill, or fatigued; and test scorers may misunderstand a test rubric or hold biases. Test developers and instructors can check test reliability through a number of statistically based methods including test-retest, split-half, internal consistency, and alternate form. They can also improve test reliability by adjusting a tests' length and improving test item quality.
Keywords Alternate Form Reliability; High-Stakes Tests; Internal Consistency Reliability; Item Quality; No Child Left Behind Act of 2001 (NCLB); Norm-Referenced Test; Performance-Based Assessment; Reliability; Split-Half Reliability; Standardized Tests; Test Bias; Test Length; Test-Retest Reliability; Validity
Reliability has been defined as "the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and repeatable for an individual test taker" (Berkkowitz, Wolkowitz, Fitch & Kopriva, 2000, as cited in Rudner & Schafer, 2001). Reliability is the extent to which the measurements gained from a test are derived from the knowledge or ability being measured; a test with a high degree of reliability has a small degree random error, which should make test scores more consistent as the test is repeatedly administered. However, a test's reliability is dependent upon both the testing instrument and the students taking the test, which means that the reliability of any test can vary from group to group. This is why, before a school, district, or state adopts a test, the reliability of the sample and the reliability of the norming groups should be considered (Rudner & Schafer, 2001).
In order for test scores to be used to make accurate judgments about students' abilities, they must be both reliable and valid. Test validity refers to how well a test truly measures the knowledge and abilities it is designed to measure. A test must first be determined to be reliable and then be assessed for validity. It is possible for tests to be reliable but not valid. If a test is reliable, it can be counted on to report similar results when taken by similar groups under similar circumstances over a period of time (Moriarty, 2002). Reliability is vital to test development because, in the current educational culture of high-stakes testing, tests must be accurate assessments of student progress and achievement.
Test reliability also describes the consistency of students' scores as they take different forms of a test. No two test forms will produce the identical results on a consistent basis for a number of reasons. Since they are not identical, test items may skew results. Additionally, students may make errors, feel ill, or be fatigued on a test day. External factors such as poor lighting, excessive noise, or room temperature can also interfere with testing.
However, even though all scores will not be identical, it is expected that they will be similar, which is one reason that test reliability is described in terms of degree of error. Checking for test reliability can determine the extent to which students' scores reflect random measurement errors, which falls into three categories: test factors, student factors, and scoring factors. Test factors that affect reliability can include test items, test directions, and (for multiple choice tests) ambiguous test item responses. Student factors include lack of motivation, concentration lapses, fatigue, memory lapses, carelessness, and sheer luck. Scoring factors that affect a test's reliability include ambiguous scoring guidelines, carelessness on the part of the test scorer, and computational errors. These are all considered random errors because how they affect students' scores is unpredictable: sometimes they can help students and at other times they can hinder students (Wells & Wollack, 2003).
All tests contain some degree of error; there is no such thing as a perfect test. However while errors may be unavoidable, primary goal of test developers is to limit errors to a level mirrors the purposes of the assessment. For example, a high-stakes test, such as an examination to grant a high school diploma, a license, or college admission, needs to have a small margin of error. In a low-stakes environment, however, an instructor-developed assessment can tolerate a larger margin of error since the results can be offset by other forms of assessment (Rudner & Schafer, 2001). If students' grades will be based solely on one examination, then the examination must have a high degree of reliability; but, in general, classroom tests can have a lower degree of reliability because most instructors also consider assessments like homework, papers, projects, presentations, participation, and other tests when determining student grades (Wells & Wollack, 2003).
Checking for Reliability
The four most commonly used methods of checking test reliability are: test-retest, split-half, internal consistency, and alternate form. All are statistically based and used in order to evaluate the stability of a grouping of test scores (Rudner & Schafer, 2001).
Test-retest reliability is a coefficient that is received after giving the exact same exam twice and then comparing the two results. In theory, this can be a good measure of score consistency as it provides a clear, constant measurement that carries from one administer to another. However, it is not endorsed as a test for reliability because some challenges and limitations go along with it. First, it requires that the same test be given twice to the same group of students. This can be very costly and isn’t a great use of anyone's time. Additionally, it is difficult to say whether the test is truly adequately reliable. If the second administration of the test is given within too short of a time period, then student responses may be too consistent because students can remember the test question. Also, the students may have looked up the answers to questions they couldn't answer on the test. Alternatively, if the second administration of the test is given at too late a date, then students' answers can be skewed by the knowledge that they have acquired during the time between the tests (Rudner & Schafer, 2001).
Split-half reliability is a coefficient attained by halving a test and its contents, and comparing the results of each half. Because more lengthy tests are usually more sound than shorter ones, correcting the coefficient for length must often be done. Tests can be split in half by using the odd-numbered questions for one test and the even-numbered questions for the other test; by randomly selecting which items go on each test; or by manually selecting which items go on each test in an attempt to balance the content and level of difficulty. This method can be advantageous because it only requires the test to be given once. Its disadvantage is that the coefficient will vary based on how the test was divided. Also, the method is not necessarily proper for use on exams on which students' scores may be affected by a time limit (Rudner & Schafer, 2001).
Internal consistency looks at how similar individual test items from one administrator are from that of another administrator. Content sampling, which internal consistency estimates, is usually the largest component of measurement error (Rudner, 1994). The purpose of the exam is more than to simply determine how many items students can correctly answer; it is also to measure students' knowledge of the content covered by the testing instrument. In order to accomplish this, the items on the test must be sampled so that they are representative of the entire domain. The expectation is that students who have mastered the content will perform well and those who have yet to master the content will not do as well regardless of the items used on the testing instrument (Wells & Wollack, 2003). The...
(The entire section is 3681 words.)