No matter what your eventual career, you will find that many times you will be implicitly asked to conduct an experiment. For example, an employer might ask; "I wonder if our product would sell better in a red instead of blue package", or "Do you think the children learn better with computer aided instruction", or whatever. No matter what statistics you use or how pretty the colors of your graphs, if the fundamental logic of the experiment is flawed, you can not make any reasonable conclusions. These notes contain a series of experiment "briefs." There is a fundamental design problem in each one of the briefs. The design problem occurs in the discrepancy between what the experimenter did in the experiment and the conclusion that he or she arrived at on the basis of the results. The purpose is to expose you to a series of "classic" flaws in the logic of experiment design and interpretation, and to have you discover some general solutions.

In criticizing the design of the experiment briefs, you should only make use of the information given in the brief. Do not criticize the design by inferring something that is not given. For example, if the experimenter used a pencil-and-paper anxiety test, you should assume that the test is valid and reliable unless information to the contrary is given. There is usually one major defect in each brief and you should concentrate your criticism on this major problem. Be specific as to the defect. For example, do not just say that the experimenter should have used a control group but point out exactly how this control group would be treated.

The following example illustrates how the briefs should be criticized:

A certain investigator hypothesized that the hippocampus (a part of the brain) is related to complex "thinking" processes but not to simple "thinking" processes. He removed the hippocampus from 20 rats. He had ten of the rats learn a very simple maze and had ten of the rats learn a very difficult and complex maze. The first group learned to run the maze without error. Based on these results, he felt his hypothesis had been confirmed, rats without a hippocampus have more trouble learning a complex task than they do learning a simple task.

In criticizing this design, it appears reasonable to assume that any rat would take more trials to learn a complex maze than it would to learn a simple maze. Thus the results found by the experimenter may have nothing to do with the removal of the hippocampus, rats with the hippocampus intact might show the same results. This criticism would suggest that in redesigning the experiment a 2 x 2 factorial design should be used. One factor is hippocampus intact or hippocampus removed. the second factor is simple or complex maze. This is diagramed below:

Hippocampus intact10 rats10 rats
Hippocampus removed10 rats10 rats

We would need a total of 40 rats with 10 rats being assigned to each of the four treatments. This revised design would allow for a more reasonable test of the experiment's hypothesis than the original design.



1. An investigator attempted to ascertain the effects of hunger on aggression in cats. She took 10 cats, kept them in individual cages, and put them on a food deprivation schedule such that at the end of two weeks the cats weighted 90 percent of their normal body weight. (Note the operational definition of hunger here.) She then put the cats together in pairs for 15 minutes and watched to see if aggression or fighting would occur. In all cases the cats showed the threat posture, and in most cases fighting occurred. The experimenter concluded that hunger increases aggression in cats.

2. An experimenter wished to examine the effects of massed versus distributed practice on the learning of nonsense syllables. He used three treatment groups of subjects. Group I practiced a 20 nonsense-syllable list for 30 minutes one day. Group II practiced the same list for 30 minutes per day for two successive days. Group III practiced the same list for 30 minutes per day for three successive days. The experimenter assessed each group's performance with a free recall, test after each group had completed their designated number of session. The mean recall of the 20 syllables for Group I was 5.2; for Group II, 10.0; and for Group III, 14.6. these means were significantly different from one another, and the experimenter concluded that distributed practice is much superior to massed practice.

3. An investigation was undertaken to explore the hypothesis that females tend to react more to emotion whereas males tend to react more to rationality. The procedure used to test the hypothesis was an attitude change situation. The experimenter presented an emotional communication on a specific social issue for example, capital punishment, to 50 females and 50 males. Results indicated that females changed their attitudes (in the direction advocated in the communication) to a significantly greater degree than did the males. The experimenter concluded that her hypothesis was supported.

4. An experiment was designed to examine the relationship between drive and performance on a complex discrimination task. Drive level was measured by the Taylor Manifest Anxiety Scale. The top 15 percent of 300 subjects who took the test were designated as the High Drive group and the bottom 15 percent of the subjects were designated as the Low Drive group. These two groups of subjects performed the discrimination task and much to the experimenter's surprise there was no significant difference in the performance of the two groups. The experimenter concluded that drive level does not affect performance on a complex discrimination.

5. An investigator with a large grant from a professional sports team set out to test the hypothesis that fear of punishment for poor performance has a detrimental rather than a facilitative effect on motor learning. As a measure of performance, the experimenter used a Pursuit Rotor Test in which the subject's task was to keep the mouse cursor on a moving target on the computer screen. The speed of the target was varied: in some cases the task was quite easy and in some cases the task was quite difficult. The experimenter manipulated fear by threatening the subjects with electric shock if they performed poorly on the task. He strapped an electric shock apparatus to the leg of each subject before he performed the task: however, he never shocked the subjects regardless of performance and he called this his Mild Fear Condition. A second group of subjects was threatened with 100 volts of electricity and he called this treatment the High Fear Condition. Contrary to his hypothesis, the High Fear subjects did not perform any worse than the Low Fear subjects--in fact, the means for both groups were approximately the same. Based on these results, the experimenter concluded that fear of punishment has little, if any, effect on motor performance.

6. A certain psychologist was looking for the cause of college failure. She took a group of former students who had flunked out and a group that had received good grades. She gave both groups a self-esteem test and found the group that failed scored lower on the test than the college success group. She concluded that a low self-esteem person probably expects to fail and exhibits defeatist behavior in college-which eventually leads to his or her failure.

7. An experimenter took 20 subjects who said that they believed in astrology and gave them their horoscopes for the previous day and asked them how accurate the horoscope was in predicting the previous day's occurrences. The subjects indicated their opinion on a six-position scale that ranged from extremely accurate to extremely inaccurate. All 20 subjects reported their horoscopes as being accurate to some degree, and none reported his horoscope to be inaccurate. The experimenter concluded that horoscopes are accurate.

8. A 2 x 3 factorial design was used evaluate the effect of dosage level of an experimental drug (Remoh) on the treatment of schizophrenia. Two patient classifications were used: (a) new admission to a particular mental hospital, and (b) patients who had been institutionalized for at least two years at that hospital. Patients received one of three levels of dosage either 3 grams per day, 6 grams per day or 9 grams per day. There were 20 patients in each of the six groups. In addition to administering the drug the experimenters also rated each patient each week as to the presence of absence of schizophrenic symptoms. After two months, it was found that very few (5 to 10 percent) of the long-term patients had improved regardless of dosage level. It was also found that approximately 50 percent of newly arrived patients had improved in each of the three dosage level groups. The researches concluded that (a) Remoh is effective only for new arrivals and not for chronic cases, and (b) a dosage of 3 grams per day is sufficient to maximize the effectiveness of this drug.

9. A teacher of statistics wanted to compare two methods of teaching introductory statistics. One method relied heavily on the teaching of the theory behind statistics; Theory Method. The other method was labeled the Cookbook Method because it consisted of teaching the student various statistical tests and informing him as to when to use each test. This researcher found that a leading engineering school was using the Theory Method in all its introductory statistics classes, and that a state teachers college was using the Cookbook Method in all of its classes. At the end of each semester he administered a standardized test on the applications of statistics to the statistics classes of both schools. The results of this testing indicated the classes that received the Theory Method were far superior to the classes that received the Cookbook Method. The researcher concluded that the Theory Method was the superior method and should be adopted by teachers of statistics.

10. In an effort to determine the effects of the drug chlorpromazine on performance of schizophrenics, two clinical investigators randomly selected 20 acute schizophrenics from a mental hospital population. The task used was one that tested the executive and sequencing functions of the frontal lobes. Basically several stimuli had to be put in order along a dimension, e.g., eight stimuli had to be ordered as to their weight. There were several tasks of this sort. The investigators used a within subject design in which all subjects first performed the tasks after being injected with a saline solution (placebo) and then performed the tasks again (several hours later) after having been injected with chlorpromazine. Results indicated that fewer errors were made in the chlorpromazine treatment which suggested to the investigators that this drug facilitates more adequate cognitive functioning in this type of patient.

11. It was hypothesized that sensory deprivation inhibits the intellectual development of animals. The experimenter got the ethics committee to approve the research because the Risks/Benefits ratio was high considering the number of children growing up in deprived environments. To test this hypothesis an experimenter used two rats each of whom had just given birth to eight pups. One rat and her litter were placed in a large cage with ample space and with objects to explore. The pups of the second rat were separated from the mother and each was placed in a separate cage. These cages were quite small and the only objects they could see (or hear) were the four walls and an automatic food dispenser. After five months, both treatment groups of rats were tested in a multiple-T maze using food as a reward. After 20 trials all of the non-deprived pups were running the maze without error. On the other hand, the deprived pups were still making several errors in the maze. This latter group of rats frequently "froze" in the start box and in the maze, and had to be prodded to move. The experimenter concluded that sensory deprivation inhibits intellectual development such that deprived rats did not have the intellectual ability to learn even a simple maze.

12. During preparations for the "Gulf War" an military psychologist attempted to examine the hypothesis that punishment is more effective for training pilots from the various nations to recognize friendly aircraft than is reward. The problem was the correct identification of enemy and of friendly airplanes. In his experimental situation, he had his subjects sit in front of what looked to be a cockpit display. Silhouettes of enemy and of friendly airplanes were flashed on the screen in very short exposures (one second). Each subject participated in the experiment for two hours on five successive days. On the fist day, as each silhouette was flashed on the screen, the subject pressed either the "Friendly" or "Enemy" buttons and then was told by the experimenter if he had been right or wrong in his identification. Starting on the second day, subjects were randomly assigned to one of two groups. The procedure was similar to the first day except in Group A the subjects were not punished for a wrong identification. In Group B, subject received an electric shock after every wrong identification. This same procedure was continued for days three and four. The fifth day was considered the "test" day, and the subjects followed the same procedure except neither reward, nor punishment, nor information from the experimenter was given to the subject. The number of correct identifications for 100 silhouettes presented was considered a test of the effectiveness of each training method. As expected, there was some loss of subjects over the five-day period; about 5 percent of the Group A subjects and about 35 percent of the Groups B subjects had dropped out of the experiment by the fifth day. Results indicated that on the 100 test trials given on the fifth day, the mean number of correct identifications for Group A was 80 and the mean for Group B was 92. The experimenter concluded that his hypothesis had been confirmed and suggest that all training programs be based on punishment.

13. A government official who ran a "Head Start" type program was concerned that cut-backs in social spending might cancel his program in which area youths could volunteer for leadership training. He wanted some evidence to prove that his program was valuable in training future leaders. He went back to the records and got the names of those boys who were active members' in the program 20 years ago. He also took school records and got the names of boys of the same academic ability but who were not volunteers in the program. He compared the two groups as to their occupations, salaries, etc., at the present and observed that the Head Start group was doing significantly better. He concluded that this result was due to the influence of this program.

14. A psychologist was interest in developing a test which would predict the success of prospective lawyers. She selected a random sample of lawyers listed in Who's Who, under the assumption that they would be "successful" lawyers. She then contacted them by means of a mail questionnaire which contained several hundred questions. The results were analyzed and a profile of successful lawyers was compiled. The questionnaire was given to a group of prospective law students, and those student whose scores were significantly divergent from the successful lawyer group were advised not to pursue a law career.

15. A social psychologist had a theory that as members of a group get to know each other better, the productivity of the group will increase up to a point and then will start to decrease slightly. The decrease ("the honeymoon is over" effect) is a point at which group members stop acting a highly cooperative manner and start jostling for power, etc. To test this theory she formed groups of individuals who were strangers and had them work a series of tasks. There were five tasks each taking 35 minutes to work and she gave the groups a five-minute break between tasks. Her results indicated that group productivity increased with the number of tasks up to the fifth task and for the fifth task there was a significant decrease in group performance. On the basis of the evidence, she considered her theory supported.

16. A clinical researcher examined whether interviews with patients or objective tests are better in the diagnosis of the patient's problems and outcomes. The experiment took place at a large mental hospital. In one group 10 clinical psychology students each interviewed six new patients (during their internship at the hospital). The length of each interview was one to two hours long. Another group of 60 patients were given a battery of standardized psychological test (e.g., MMPI) and the test results were interpreted by three clinical psychologist who had several years of experience in interpreting tests for the hospital. Each psychologist interpreted the test results for 20 patients. Both groups were asked to list the patient's major problems and to assign the patient to a diagnostic category (e.g., schizo-affective disorder, psychotic mood disorder, etc.) They were also asked to predict how long the patient would be in the hospital before he would be in the hospital before he would improve to the extent that he could be released. Results indicate that the interviews were 67 percent accurate in predicting diagnostic categories and 22 percent accurate in predicting length of stay. The "test" group was about 83 percent accurate in predicting diagnostic categories and 65 percent accurate in predicting length of stay. The experimenter concluded that interview are of questionable value in either diagnosis or prediction of outcome and should be discontinued.

17. An experimenter wanted to test the hypothesis that males are more creative then females. He also hypothesized that the male superiority in creativity would be heightened under ego-involving conditions. The design used was 2 x 2 factorial design in which one variable was sex and the other variable was high and low ego-involvement. He manipulated ego-involvement by telling half of the subjects that this task was a measure of how intelligent they are and that he would post their scores on a bulletin board (high ego-involvement). He told the other half of the subjects that he was developing the task and wanted to check its reliability and and further told them not to put their names on the answer sheets (low ego-involvement). His test of creativity was an "Unusual Uses" test in which a person is given the name of an object (e.g., hammer) and he or she has to write down as many different unusual uses for that object as possible in five minutes. Twenty-five males and 25 females participated in each of of the two ego-involvement conditions. The males were members of a senior ROTC class and the females were obtained from sorority pledge classes. Two objects were used; (1) an army compass, and (2) a monkey wrench. Subjects were given five minutes for each object. The number of unusual uses generated by males was 4.1 under low ego-involvement and 7.6 under high ego-involvement. Females generated 3.1 uses under low ego involvement, and 2.4 under high ego-involvement. Since both the main effects and the interaction effect were statistically significant using analysis of variance, the experimenter concluded that his hypotheses were supported.