An an attorney who defends academic integrity disciplinary actions at state universities, I am often ask to give an opinion on charges brought against students for cheating based on similarities of their answers on exams. For example, let’s take a look at the following hypothetical:
A professor suspects two students of cheating, and studies the exam answers for similarities. The professor notes that both students got the same questions wrong. For example, on a multiple choice exam on American presidents, they both incorrectly answered 1) that Theodore Roosevelt was the longest serving president, 2) that George W. Bush initiated the invasion of Panama, and 3) that James Madison presided of the Civil War. The two students each got all other test questions correct.
To some professors, that would seem to be hard evidence of cheating. What are the chances of two unrelated students each getting the same answers wrong in the same way, and all other questions correct? Not so fast, say statistical experts. As one statistician has noted:
It is our position, echoed by courts and statisticians alike, that at no time can one accept probabilistic evidence as sufficient merely because the occurrence of some value of a test statistic is highly probable. Reasonable competing explanation must be considered. The limitations of the mechanistic detection strategies, and the inherent variability in test deign and administration reliability and validity found in all except the most rigors of standardized tests and testing situations, preclude an automatic acceptable of probabilities data as prima facie demonstration of misconduct.
See, Dwyer, David J.; Hecht, Jefrey B. “Cheating Detection: Statistical, Legal, and Policy” (1994). Available at: http://files.eric.ed.gov/fulltext/ED382066.pdf. The author goes on to explain:
Finally, we must answer the question “is the sample of students being compared merely random or is it representative of the class as a whole?” If the class is comprised of distinct subgroups (by achievement, ethnicity, gender, etc?) then the sample from which we draw an inference must representative of the subgroup(s) as well. It is our opinion that no mechanistic detection method currently available sufficiently address these concerns to an adequate degree, casting doubt as to the utility of mechanistic methods to detect wrongdoing with a known and consistent degree of accuracy.
The problem with using statistical error analysis in academic integrity cases is that often times students will work together preparing for a test in a study group. Preparing for a test in a study group can lead students to all have similar understanding of the material, and this includes possible errors or misunderstanding of the subject, and can lead to similar incorrect answers on an exam. There are legal precedents that support this idea. One such case is the court opinion in Boehm v. Univ. of Pennsylvania Sch. of Veterinary Med., 392 Pa. Super. 502, 573 A.2d 575 (1990). In that legal precedent a veterinary student at the University of Pennsylvania was accused of cheating, and was ultimately found to have committed the offense. While the student was found to have been guilty, the university threw out the supposed statistical analysis and ruled:
While the information raises suspicion as to the cheating charges, the panel considers the comparison [of the test answers] to be unreliable due to its lack of statistical foundation particularly since the influence on test scores of studying together is unknown. Accordingly this information was not considered in the panel’s deliberations.
Id, 392 Pa. Super. 502, 516, 573 A.2d 575, 582 (1990) (emphasis added). Similar statistical reasoning was also rejected in the case of Papelino v. Albany Coll. of Pharmacy of Union Univ., 633 F.3d 81 (2d Cir. 2011). In that case three pharmacy students were accused of cheating on a test. The case described the circumstances as follows:
In support of the charges, Nowak [a teacher] presented evidence, which consisted primarily of “statistical” charts that she had prepared based on her review of exams taken by Papelino, Basile, and Yu in various courses. Papelino, Basile, and Yu countered with (1) the lack of evidence of the means by which the three might have managed to cheat; (2) the fact that the three studied together, and therefore had similar knowledge bases; and (3) the lack of validity of the “statistical” evidence.
Papelino v. Albany Coll. of Pharmacy of Union Univ., 633 F.3d 81, 86-87 (2d Cir. 2011). The case goes on to explain that the Supreme Court of New York rejected such evidence, and “…concluded that the Honor Code Committee’s determinations were based ‘solely’ on a ‘statistical compilation’ that was based upon ‘false assumptions’ and did not provide ‘a rational basis to conclude that petitioners cheated.’” Id at 87.
While statistical similarities can be a starting point for an academic integrity investigation, such proof should not be taken as conclusive evidence by itself.