Paper No. 2
Presentation Time: 8:45 AM
ASSESSING GEOSCIENCE STUDENTS' RESPONSES TO SHORT ESSAY QUESTIONS USING COMPUTER GRADING ALGORITHMS
STEER, David, Department of Geology and Environmental Science, The University of Akron, Akron, OH 44325-4101, steer@uakron.edu
This research tests the efficacy of using computer algorithms to grade short essay geoscience questions. Several automated essay grading programs are freely available to instructors seeking a quick, efficient method to assess student learning using short essays. Speed and reproducibility are primary advantages of employing computer-grading of such questions. Preliminary results suggest that question design is a critical factor in conditioning the algorithms to effectively sort answers into mostly correct, partly correct and mostly incorrect bins. Two conceptual approaches for designing questions and conditioning probabilities are recommended to minimize computer-binning errors. A multiple path question design requires predictable but independent answer paths. Instructors develop the valued words and phrases likely to be found in each answer pathway. Since the various responses generate significantly different lexical constructs, pathways are distinct. Those so called trins and proxies in the student responses are compared to the probabilities in the system and used to bin the answer accordingly. A defined content question requires well defined content scope for a complete answer. The database of accepted content used to train and test the system is decremented to form the various response levels.
Data from upper level students answering an open ended geologic time question were used to explore the multiple-pathway question grading approach. A database was constructed that included 55 predicted high-level, 125 moderate-level and 73 low-level trins and proxies. That database was used to generate 500 synthetic student responses at each level. High level responses were comprised of 10 randomly selected hi-level word phrases. Moderate level responses were constructed of five, middle-level trins and mostly incorrect responses were generated using three low-level phrases. Eleven actual student essay responses were tested using the conditioned algorithm. Nine responses were correctly binned, one response was binned too low and one was binned too high. Further testing with much larger data sets is required before drawing definitive conclusions about this assessment grading technique. Since computers cannot effectively evaluate meaning, the binned responses must be reviewed by the instructor.