STREAMLINING FOSSIL INVENTORY DATA AND ANALYSIS ACROSS U.S. NATIONAL PARKS
To address this, I compiled these tables into a master spreadsheet, then created a word bank of over 2,600 total keywords, including 800 taxonomic terms, 200 geological time intervals, and 800 formations. Using the statistical software R, I mined these notes to generate individual fossil occurrence records and enrich them with beta taxonomic classifications, geologic ages, and specimen metadata (e.g., anatomical elements, type status, preservation).
To promote standardization of future data, I built a tool that augments and structures manual data entry, while maintaining a change log for transparency. Recognizing the limitations of R as a relational database, I also created a global update tool that propagates changes of indexed values across the dataset.
For externally facing applications, I developed an online query tool that generates customized reports with fossil statistics, taxonomic breakdowns, and diversity curves for NPS administrators. Additionally, this dataset, in conjunction with data from the Bureau of Land Management, trained a machine learning algorithm for rating formations on the Potential Fossil Yield Classification (PFYC) Index, aiding in paleo-mitigation planning and identification of under-prospected formations.
Future goals include leveraging large language models like ChatGPT to automate the extraction of fossil occurrence records from publications, and the integration of data from accessioned specimens to further expand this vital paleontological database.