Classical Test Theory Research Paper Starter

Classical Test Theory

(Research Starters)

Classical test theory (CTT) is a branch of psychometrics that aims to predict the outcome of entire tests or the responses of specific test items based on completed tests and test items used for data collection. While CTT still has support in the psychometric community, it is generally used today for the purpose of testing reliability, along with other theories of assessment, in various assessment instruments. It is one of the most influential theories regarding test scores in the social science field.

Keywords Attention Deficit Hyperactivity Disorder (ADHD); Classical Test Theory (CTT); Diagnostic and Statistical Assessment of Mental Disorders IV (DSM-IV); Generalizability Theory; Item Response Theory (IRT); Normative Response; Psychometrics; Rasch Analysis; Reliability; Validity


Classical test theory (CTT) is a unit of psychometrics (psychological testing) used to measure and often to predict the outcome of various tests, the difficulty of items within a test, and/or the ability of test-takers. It is one of the most influential theories regarding test scores in the social science field. According to Allyson Lent, Research Assistant at the Neuropsychology Lab at Kessler Medical Rehabilitation Research and Education Center, the purpose of CTT is to understand and improve the reliability of psychological tests, usually by two means. The first is using test-retest reliability - keeping an item when testers repeat the same response over several trials. The second is alternate test reliability - keeping an item when testers repeat the same response on an alternate version of the same test (Personal communication, October 17, 2007).

Charles Spearman created the theory in 1904, and it was loosely utilized until 1966 when M. R. Novick put its use at the forefront of psychological theory (Novick, 1966). CTT can be identified as the theory of a true-test score, taking into account the previous score of a test item or a test-taking population to predict a future score for the same item or population. Using the previous scores, classical test theorists can predict which test questions will be answered correctly and which population tends to answer the questions successfully. Successful responses are then referred to as normative responses.

When considering a population, the entire population must be taken into account. For example, if all of the eleventh graders in the United States took the Advanced Placement Exam (APE) for English and the same overall score was identified trial after trial, that score would be identified as the normative score for the population of eleventh graders in the United States who took the APE for English. This is not to say that Joe, an eleventh grader on my street, will generate that score simply because he is in eleventh grade and lives within the United States; the normative score is meaningless when correlated with any individual - including Joe - if that individual is not grouped within the population for which the score is identified. In this example, the population is all of the eleventh graders in the United States who took the APE for English rather than specific individuals who took the exam. Joe himself could individually score higher or lower than the normative score; however, CTT can make reliable identifications based on populations of people or on individuals, depending upon the purpose of the test itself. Its predictive value applies only to its ability to show that a test item or instrument is reliable over time using either test-retest reliability or alternate test reliability to determine that value.

Item Response Theory

Classical test theorists are often in conflict with item response theorists, as item response theory (IRT) focuses on a correlation of specific items or specific individuals. IRT testing models are based on "the relationship between ability (or trait) and performance for each individual item" (Reid, Kolakowsky-Hayner, Lewis & Armstrong, 2007, p. 179). In some cases, both CTT and IRT theories are used to identify reliability and validity in a test item or question. However, more recent psychologists use IRT, as validity (testing what is supposed to be tested) can be difficult across populations, even if the sample size is small. Generalizability theory encompasses CTT in that the former also is known for its true-test attributes, and the latter was created following the emergence of generalizability theory.

Research is often based on formulas. When CTT is broken down to its simplest forms, it has only one basic "condition." Using X as the observed score (I saw him do that), T as the true score (the actual score on a test or test question), and e as the error allowed by faulty test design or tester performance, the formula is "X = T + e." This shows that the observed score is equal to the true score plus an account for error. From research study to research study, numbers are plugged into this equation and statisticians come up with scores for measurement purposes. One of the biggest concerns about using CTT is that while there can be several different types of error - from testing environment to tester bias - CTT only allows the estimation of one type of error at a time. Therefore, if Joe's score were misread by a scantron and the administrator of the test left the room when he wasn't supposed to, CTT could not be used to profile Joe's test. Generalizability theory takes into account that variables may be multiple at times, and its more complicated formula simply encompasses those multiple variables, while CTT can't.


ADHD Information Reporting

According to the CDC, in 2003, approximately 4.4 million youth ages 4-17 were diagnosed by a healthcare professional as having ADHD. In addition, 2.5 million of those youth were receiving medication treatment for the disorder. With statistics like that, it is imperative that medication be identified as effective in order to be used to treat the disorder. ADHD is noted when inattention, inappropriate or impulsive behavior, and/or hyperactivity in (usually) a child has been identified. These behaviors are generally noticed while children are at school or at home, and often, the time frame for the ADHD "behavior" is specific (Corkum, Andreou, Schachar, Tannock & Cunningham, 2007). As such, Corkum, et al. (2007) describe that

…treatment-sensitive instruments that are feasible, yield valid and reliable scores, and measure outcome in a "time-locked" and "situation- and symptom-specific" manner [need to be created]. These instruments are needed to evaluate the outcome for which the treatment is targeted at specific settings (e.g., school), specific times of day (e.g., the late afternoon or early evening medication dose), and specific symptoms (e.g., hyperactivity) (p. 169).

Using a TIP

The Telephone Interview Probe (TIP) was developed for this purpose and a study was conducted to measure the effects of a medication-treatment based on the specifics described above. CTT as well as generalizability theory were used to evaluate the TIP during the length of the study.

In addition to reliability statistics derived from classical test theory, this study also used generalizability theory in the assessment of reliability. The basic assumption of generalizability theory is that there exist multiple potential sources of error in each observed score. In classical test theory, each form of reliability (intraobserver, interobserver, test-retest, etc.) identifies and quantifies only one source of error, whereas generalizability theory provides a means of combining all sources of variability into a single study (Corkum, et al., 2007, p. 171).

Behavior-rating scales are often used to measure the results of various clinical treatments (Schachar & Tannock, 1993). The TIP includes a rating scale within a semistructured interview. In the interview, impressions of a child's behavior during specific time periods (during the school day and after the school day) are identified. The reporters of the child's behavior are the child's school teacher and the child's parents. The core symptoms of ADHD (including opposing behaviors and problematic situations) are measured.

The sample for the Corkum (2007) study was ninety-one children in a large urban community in Canada who were identified by the Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R) as being pervasively ADHD (p. 171). Children were randomly divided into placebo groups and treatment groups (receiving methylphenidate, a short-term ADHD medication) and monitored over a four-month time period.

Audiotaped interviews were conducted between psychology graduate assistants (as the interviewers) and parents and teachers (as informants). "Because the TIP was designed to be a semistructured interview rather than a questionnaire, the interviewer could discuss reasons for the informant's ratings and help the informants make their ratings" (Corkum, et al., 2007, p. 173).

The TIP allows separate ratings of each core symptom of ADHD (inattention, impulsiveness, and hyperactivity), oppositional behavior, and problem situations for both the morning and afternoon/evening of a particular day. The parent and teacher versions use similar formats but are adapted to reflect their particular settings (i.e., home and school). Respondents rated a child's behavior on a six-point scale to pinpoint the severity of behavior during routine activities. For example, parents were asked about a child's behavior before and after school (getting out of bed, getting dressed, adjusting to being back home and getting ready for bed), while teachers were asked about in-school activities such as getting materials ready for class and working individually throughout the day (Corkum, et. al, 2007).

Corkum, et al. (2007) found a "statistically significant difference between the two groups at 4 months on all of the scales, with less challenging behavior reported for the children in the methylphenidate group" from the teacher interviews (p. 183). As methylphenidate has a half-life of four hours, its affects were not noticeable when the children were with their parents: before or after school (p. 183).

With this type of data, the TIP was identified...

(The entire section is 4530 words.)