GSA Connects 2022 meeting in Denver, Colorado

Paper No. 59-7
Presentation Time: 2:00 PM-6:00 PM

GENERATING TEACHING DATASETS FOR MACHINE-LEARNING ALGORITHMS USING CITIZEN SCIENCE: HOW MANY USERS ARE NEEDED TO IDENTIFY BEDDING PLANES?


SPROSS, Erin, Earth and Environmental Sciences, Temple University, 1801 N Broad St, Philadelphia, PA 19122, DAVATZES, Alexandra, Department of Earth and Environmental Science, Temple University, Philadelphia, PA 19122 and SHIPLEY, Thomas, Department of Psychology, Temple University, 1701 North 13th Street, 6th Floor Weiss Hall, Philadelphia, PA 19122

Citizen science has many useful applications for earth and social sciences. It can help scientists analyze large amounts of data quickly, and in doing so it also provides a source for psychologists to examine thinking in large groups of non-experts. We employed citizen scientists to identify bedding planes in outcrops in drone-captured images to build a dataset that could be used to train a machine-learning algorithm to identify bedding planes. Initial inspection of the data found that non-experts occasionally confused linear features such as erosional gullies for bedding planes. To avoid including these errors in the set used for training we collected multiple citizen scientists’ annotations of rock outcrop images using the citizen science website Zooniverse to identify the optimal number of citizen scientists who should annotate an image in order to maximize the number of accurately indicated bedding planes while minimizing the risk of users incorrectly tracing a non-bedding plane. We analyzed the results of our initial trial and found high agreement among users – numerous users correctly indicated bedding planes, with separate users often identifying the same bedding plane more than once. Three users per image was sufficient to have multiple bedding planes identified by more than one user, and not have any single erroneous feature identified by more than one user. When more than six users annotated each image there was an increase in the risk of two or more users making the same mistake, without increasing the number of correct bedding planes identified. Although one could exclude data based on low agreement, these findings offer initial guidance on the minimal number of citizen scientists needed to efficiently develop an accurate dataset for the machine learning algorithm to learn to identify bedding planes, which will aid earth scientists in collecting large numbers of annotations quickly.