This article focuses on computer-assisted and computer-adaptive testing. A computer-adaptive test is able to estimate a student's ability level during the test and can adjust the difficulty of the questions accordingly. The computer program selects the test items deemed appropriate for the student's ability level, the student then selects an answer using either the mouse or keyboard, and then the answer is automatically scored. In this paper, comparisons with paper-and-pencil tests and student achievement are covered. The advantages and disadvantages of using computerized testing are included as well as considerations for implementation.
Keywords Adequate Yearly Progress; Computer-Adaptive Testing; Computer-Assisted Testing; High Stakes Testing; Item Response Theory; Latent Trait; No Child Left Behind Act of 2001 (NCLB); Standardized Testing
Paper-and-pencil tests have long been the standard for testing students. A computerized test can do the same thing, but students use a mouse and keyboard to mark their responses instead of a sheet of paper and a pencil or pen. A computer-adaptive test (CAT) format is able to estimate a student's ability level during the test and can adjust the difficulty of the questions accordingly. This means that students can be taking different versions of the same test and can be answering a different number of questions based on their ability levels.
Many organizations and associations that provide certification and licensure give their tests on computers. Computers are used for college course placement testing, professional certification exams, vocational interest assessments, aptitude tests, and workplace skills assessments (Greenberg, 1998). States have also begun using computer adaptive testing for their state-wide assessments and have found the tests to not only provide results more quickly, but more reliable in testing student knowledge, especially in students at the very low and the very high ends of the spectrum (Davis, 2012).
Item Response Theory
Computer-adaptive testing is based on item response theory. Item response theory uses mathematical functions to predict or explain a student's test performance using a set of factors called latent traits. The relationship between student item performance and these traits can be described by an item characteristic function. The item characteristic function specifies that students with higher scores on the traits have higher expected probabilities for answering an item correctly than students with lower scores on the traits (Hambleton, 1989). Computer-adaptive testing selects test items to match each student's ability. The computer program selects the test items deemed appropriate for the student's ability level, the student then selects an answer using either the mouse or keyboard, and then the answer is automatically scored.
Computer-adaptive testing consists of two steps. The first step is selecting the difficulty level of the first test question to match the student's achievement level (Wainer, 2000, as cited in Latu & Chapman, 2002). In the second step, the question is scored, and the ability level is updated using the new information. If the student correctly answers the question, then a more difficult question or a question at the same level will be asked next. If the student incorrectly answers the question, then an easier question or question at the same level will be asked next. This process allows computer-adaptive testing to adjust the difficulty level of the assessment based on each student's current achievement level (Latu & Chapman, 2002).
How Does Computer-Adaptive Testing Work?
Computer-adaptive testing begins with the first question based on the student's achievement level. This can be based on previous computer-adaptive tests or school reports (Hambleton, Zaal & Pieters, 2000, as cited in Latu & Chapman, 2002). If there is no personalized starting point for students, then the testing process begins at a predefined achievement level that can be set by the test administrator (Wise & Kingsbury, 2000, as cited in Latu & Chapman, 2002). There is some debate about the importance of the starting point. Some people believe that it does not matter what level the test begins at as long as it is short (Hambleton et al., 2000, as cited in Latu & Chapman, 2002), and others believe that beginning at an inappropriate level increases test anxiety for students (Wainer, 2000, as cited in Latu & Chapman, 2002), which can affect test performance.
Since test validity depends on the test items used, computer-adaptive testing relies on the test item bank of questions. A large test item bank can help minimize test security issues because there are so many items, which also allows the test to be tailored for more diverse student skill levels. Increasing the size of the test item bank can also increase the possibility of having flawed test items (Hambleton et al., 2000, as cited in Latu & Chapman, 2002). Since test items are all predicated on previous student responses, having flawed test items can have a large impact on students' scores (Potenza & Stocking, 1994; Wainer, 2000, as cited in Latu & Chapman, 2002). Paper-and-pencil tests allow for the removal of flawed test items because the person scoring can see that there is an issue with a question. This cannot be done with computer-adaptive testing because the scores are calculated automatically and the scorer cannot see that there may be an item in question (Latu & Chapman, 2002).
Computer-adaptive testing also requires a decision on when to stop the test. There are several methods that can be used, including testing until a level of measurement consistency is reached, setting a number of fixed test items for each student (Wainer, 2000, as cited in Latu & Chapman, 2002). The fixed length test is considered reasonable because it ensures that all students are given the same number of items, and it also keeps testing times short. However, there is a possibility that students will fail to answer some of the test items because they are too difficult (Mills & Stocking, 1995, as cited in Latu & Chapman, 2002). There can also be challenges with testing until a level of measurement consistency is reached that then requires the judgment of the test administrator. For example, students taking a college course placement test in mathematics have had issues with computer-adaptive testing placing them in lower-level classes when they feel they should be placed higher. Their contention was that they were asked several questions on the same concept, such as scientific notation, which they did not know; so the testing session stopped and placed them in a lower-level mathematics class. In cases like these, allowing students to retest has generally proven that they were incorrectly placed; and most students were placed in higher-level mathematics classes based on their retest results.
Computerized testing allows tests to be “linked to the district or state administering the test. This allows standards and performance outcomes to be linked to assessment measures and can provide” the district or state with valuable information about student progress (Olson, 2001, as cited in McHenry, Griffith & McHenry, 2004, ¶ 7) and whether or not the school is meeting the educational reforms set forth in President Obama’s 2010 Blueprint for Reform as well as the state Common Core Standards. This information can also be used by instructors to help align their curriculum with mandated standards and outcomes (McHenry et al., 2004).
Computer-adaptive testing came to prominence nationally in 1992 and continues to grow. The Graduate Record Examinations (GRE) was the first national test to be available as a computer-assisted test, and it came in a computer-adaptive form in 1993. Since the late 1990s, the Graduate Management Admission Test (GMAT) has administered the multiple-choice quantitative and verbal sections in a computer-adaptive format. The test is administered on paper in areas of the world with limited computer access. Scores from the quantitative and verbal sections are combined to provide an overall score, which is what business schools tend to focus on.
Nursing licensure examinations were in CAT format beginning in 1994 when they became available in computer-based format only (Bugbee Jr., 1996). Computerized testing has become big business. In 1998, online testing products generated an estimated $750 million in revenues.
Types of Testing Packages
There are many software packages available that allow instructors to create their own exams or use prepackaged tests. An instructor can create a quiz or examination using a software program that allows the creating and uploading of his or her own questions or, the instructor can select questions from the program's test-item bank. After students complete the test, it can be scored instantaneously and downloaded to...
(The entire section is 3955 words.)