What are observational methods in psychology research?

Quick Answer
Humans are poor observers: they omit, overemphasize, and distort various aspects of what they have seen. Observational methods in psychology have been devised to control or to eliminate this problem. These methods increase the accuracy of observations by reducing the effects of perceptual distortion and bias. The development of this methodology has been central to the evolution of scientific psychology.
Expert Answers
enotes eNotes educator| Certified Educator

Humans have tremendous difficulty making accurate observations. Different people will perceive the same event differently; they apply their own interpretations to what they see. One’s perception or recollection of an event, although it seems accurate, may well be faulty. This fact creates problems in science, because science requires objective observation.

In large part, this problem is eliminated through the use of scientific instruments to make observations. Many situations exist, however, in which the experimenter is still the recorder. Therefore, methods must be available to prevent bias, distortion, and omission from contaminating observations. Behavior may be observed within natural settings. When using naturalistic observation, scientists only watch behavior; they do not interfere with it.

History of Research

The need for an observational methodology that ensures objective data became apparent early in the history of scientific psychology. In fact, in 1913, John B. Watson, an early American behaviorist, stated that for psychology to become a science at all, it must eliminate the influence of subjective judgment. Watson’s influence caused psychology to shift from the subjective study of mental processes to the objective study of behavior. Shifting the focus to behavior improved the reliability of observation dramatically. Behavior is tangible and observable. In the 1920s, the operational definition—a description of behavior in terms that are unambiguous, observable, and easily measured—was introduced. Through using such definitions, communication between psychologists improved greatly. Psychologists were then able to develop experiments that met the scientific criterion of repeatability. Repeatability means that different researchers must be able to repeat the experiment and get similar results.

It soon became apparent, however, that this was not enough. It was discovered that the expectations of researchers biased their observations. This was true even when observations were focused on operationally defined behavior. Methods had to be developed to eliminate these effects, and this led to the development of techniques to reduce or control for experimenter bias. The technique of interrater reliability is an example of one such method. Using observers uninformed about the researchers’ expectations also reduces experimenter bias.

In 1976, Robert Rosenthal reported results that showed that subject expectations can also contaminate observational data. It was found that simply observing subjects alters their behavior. How it changes depends on the subjects’ interpretation of the situation and their motivation. If subjects could discover what the experimenters’ expectations were, they could decide to help or to hinder the progress of the research. This type of reactivity severely contaminates the accuracy of observational data. Although this is a problem associated primarily with human research, animals also react to observers. This is why it is important to allow sufficient time for animals under observation to habituate to one’s presence. Efforts to refine and improve observational methodology continue. Attention is now primarily directed at developing equipment to automate the observational process. The goal is to improve objectivity by removing the experimenter from the situation altogether.

Behavioral Taxonomy

To make observation as accurate and objective as possible, researchers use behavioral taxonomy. A behavioral taxonomy is a set of behavioral categories that describe the behavior of the subjects under study. To develop a behavioral taxonomy, the experimenter must first spend time simply watching the population of interest. The observer’s presence will alter the subjects’ behavior at first. Organisms are reactive, so their initial behavior in the presence of an observer is not typical. Once they become accustomed to being observed, however, behavior returns to normal. This initial observation period, called the habituation period, is important for two reasons. First, it allows the subjects time to become accustomed to the observer’s presence. Second, the researcher learns about the subjects by observing them in as many different situations as possible. During this time a diary is kept. Behaviors and their possible functions are jotted down as they are seen. This diary would not be entirely accurate. The observer might distort how often a behavior occurred or perhaps overemphasize interesting behaviors. To overcome these problems, a behavioral taxonomy must be developed.

The taxonomy will include several behavioral categories. Each category describes a specific behavior. During observation, when the behavior is seen, the category is scored. Categories can be either general or specific. Broad categories permit very consistent, and hence reliable, scoring of behavior, but they are less precise. Specific categories are more precise but make scoring behavior more difficult and less reliable. Whether categories of behavior are general or specific, there are three criteria that all taxonomies must meet: A taxonomy must be clearly defined, mutually exclusive, and exhaustive.

All categories within the behavioral taxonomy must be operationally defined. Operationally defining a category means that one will describe, in concrete terms, exactly what one means by the category name. Operational definitions are used to indicate exactly what one must see to score the category. This serves to eliminate subjective judgment when scoring observations. It also permits scientists to communicate precisely about which behaviors are being studied.

Determining Reliability and Validity

The next step is to determine whether category definitions are reliable and valid. The term “reliable” refers to whether the definitions permit one to score the behavioral category consistently. To determine whether a definition is reliable, interrater reliability is established. This tells whether two independent observers agree in scoring behavioral categories. If the rate of agreement is high, the category is reliable. For the taxonomy itself to be reliable, all its categories must be reliable. Validity is established when one can show that one is really measuring what one thinks one is. This is very important, as it is not unusual to infer the function of a behavior, only to discover later that the behavior served an entirely different purpose. One way to establish validity is to show a relationship between the category definition and independent assessments of the same behavior.

Exclusive and Exhaustive Categories

Once taxonomic categories are clearly defined, one must be sure that they are mutually exclusive. This means that each behavior one observes should fit into one, and only one, category; there should be no overlap of meaning between categories. With overlap, the observer will get confused about which behavioral category to score. Such a judgment is subjective, and it will reduce the reliability of the taxonomy and objectivity of the observations.

Finally, the categories should be exhaustive. This means that the categories, as a group, must cover all the behaviors capable of being demonstrated by subjects. Ideally, there should be no behavior that cannot be scored. If the categories are not exhaustive, one will get a distorted idea of how often a particular behavior occurs. Taxonomy must not be developed so as to overrepresent behaviors one finds interesting. Mundane behaviors must be included as well. In this way one can calculate how often each behavior occurs. Although efforts to develop an exhaustive taxonomy must be made, in reality this is impossible. New behaviors will invariably be seen throughout the course of extended observation. To control for this problem, observers will include a category entitled “other.” In this way, one can score a behavior even if one has never seen it before. By examining the number of times the “other” category is scored, one can get an idea of how exhaustive the taxonomy is.

Taxonomy Approaches

In measuring behavior with a taxonomy, one can take several approaches. For example, one could use a clock to measure how long each behavior is observed. Using a duration approach is most useful when low-frequency, high-duration behavior is present. One could also quantify how often each behavioral category is scored. The frequency approach is most useful for scoring high-frequency, short-duration behaviors. One could use either the duration or the frequency approach separately or combine the two. Finally, the length and number of observational periods must be determined. In general, the more observational periods used, the better. With respect to length, the observational period must be long enough to permit adequate observation of behavior, but short enough so that one does not become tired and miss important behavior.

Applied Research

An applied example of behavioral taxonomy is its use by researchers to describe monkey behavior. The first step would be to spend many days watching the monkeys’ behavior. During this time, the observers would be writing down, in diary form, the behaviors that they see. They would also indicate the function they believe that each behavior serves. The monkeys may appear disturbed or agitated during these initial observations; as time goes by, however, their behavior would become less agitated and they would pay less attention to the observers’ presence. Here one can see the importance of the habituation period. If observers had begun recording behavior from the start, they would probably have described the monkeys inaccurately in some respect.

With the information acquired during the habituation period, the researchers would begin to develop a behavioral taxonomy. They must decide how general or specific the categories in the taxonomy will be. This depends primarily on their purpose. If the categories must be very sensitive to change in behavior, they should be specific. If not, broader categories can be used. Once categories are selected, they are operationally defined. A category for aggression, for example, could be operationally defined as “grabbing and shaking the cage fence while maintaining eye contact with the experimenter.” Note that this definition is clear and concrete. That is, it is based on observable behavior.

In developing the list of behavioral categories, researchers must be sure they are mutually exclusive and exhaustive. To be mutually exclusive, categories must be defined so there is no overlap in meaning between them. To illustrate, the vocalization category might be defined as “any discernible vocal output.” It would be unlikely, however, that this category would be mutually exclusive. For example, what if a monkey showed aggression, but while doing so was also vocalizing? Would this be scored as an instance of aggression or vocalization? Because these categories are not mutually exclusive, one would not know. When this occurs, at least one of the categories must be redefined. The listing of categories must also be exhaustive. Observers must form a category for every possible behavior the monkeys might show; also, an “other” category must be included.

Once category definitions have been developed, it must be determined whether they are reliable, valid, mutually exclusive, and exhaustive. This can be determined by having two observers score monkey behavior using the taxonomy. If interrater agreement is high (above 85 percent agreement), the definitions can be considered reliable. If it is low, researchers will revise the necessary category definitions. These observers can also determine if categories are mutually exclusive and exhaustive. They are mutually exclusive if observers found no confusion about which category to score. They are exhaustive if they did not need to score the “other” category. Finally, to establish the validity of category definitions, researchers could ask people familiar with monkey behavior to describe what they would expect to see within each of the categories. If their descriptions agree with the researchers’ definitions, there is some evidence that the taxonomy is valid.

With the taxonomy developed, the researchers must decide how many observational periods to use and how long each period will be. In general, the more observational periods used, the more reliable the results. Twenty observational periods is adequate to produce reliable data in most cases. In deciding how long the observational period should be, the purpose of the study must be considered. If high-frequency behavior that falls into very specific categories is being observed, a short observational period should be used. For example, if eye blinks are being counted, the observational period should be no longer than two minutes. Any longer than this and observers would get tired and make inaccurate observations. On the other hand, if low-frequency behavior that is scored in broader categories (for example, tool use) is being watched, longer observational periods should be used.

Finally, researchers must decide how behavior will be quantified. They can measure how long each category of behavior is seen, how often each category of behavior is seen, or both. If they are interested in how much of the monkeys’ time is spent engaging in each behavior, they will use the duration approach. If, on the other hand, researchers are more interested in determining the likelihood that a particular behavior will occur, they will use the frequency approach.

With an appropriately developed behavioral taxonomy, the behavior of the monkeys can be described accurately and objectively. Researchers can make statements about the likelihood of various behaviors, what the behaviors mean, and how much time the monkeys spend engaged in each type of behavior. From this information, they obtain an in-depth understanding of the monkeys. For example, through the use of behavioral taxonomies it is known that rhesus monkeys have a dominance hierarchy, are very social, can show tool use and other creative adaptations of behavior when necessary, and show rudimentary forms of communication.

Implications for Other Fields

Humans simply do not record events like video cameras. At the scientific level, much care has to be used to ensure that observations are accurate and objective. Understanding how human limitations affect observational capabilities has important implications beyond the field of psychology—for example, in law. Tremendous weight is placed on eyewitness testimony in a court of law. Even though eyewitness accounts are probably biased, distorted, and imperfect, the courts recognize them as the best evidence available. Because of what has been learned about the human capacity to make accurate and objective observations, people are well advised to evaluate eyewitness testimony very carefully.


Bakeman, Roger. Observing Interactions: An Introduction to Sequential Analysis. 2nd ed. New York: Cambridge UP, 1997. Print.

Bordens, Kenneth S., and Bruce B. Abbott. Research Design and Methods: A Process Approach. 7th ed. Boston: McGraw, 2008. Print.

Comer, Jonathan S., and Philip C. Kendall, eds. The Oxford Handbook of Research Strategies for Clinical Psychology. New York: Oxford UP, 2013. Print.

Coolican, Hugh. Research Methods and Statistics in Psychology. 6th ed. New York: Psychology, 2014. Print.

Leahey, Thomas. A History of Psychology: Main Currents in Psychological Thought. 7th ed. Englewood Cliffs: Prentice Hall, 2007. Print.

Nestor, Paul G., and Russel K. Schutt. Research Methods in Psychology. 2nd ed. Thousand Oaks: Sage, 2015. Print.