PREDICTING ARTICLE RELEVANCE FOR DISCIPLINARY GEOSCIENCE DATABASES USING MACHINE LEARNING
Data archiving is critical for the research process; many funding agencies require researchers to archive data. Without clear direction, repository choice is left to the researcher, potentially fragmenting the data landscape, with relevant data spread across both disciplinary and generalist data resources.
We present a machine learning pipeline (Database Article Relevance Tagging; DataART) that can be used by disciplinary resources to predict whether published articles are likely to contain relevant data. DataART allows resources to reach out to contributing authors to solicit data contributions post-publication. The framework consists of three elements:
- A PostgreSQL database to store information on publications, user tags, and model metadata
- A Nodejs/Express API to connect to the database to support secure cloud deployment
- A Python library (dataART) and Quarto notebook to demonstrate the system, using five classification models (logistic regression, decision trees, Bayesian approaches and ensemble models) to predict article relevance
Using the Neotoma Paleoecology Database as a test system, model performance is high (test accuracy = 0.92, precision = 0.89; recall = 0.87) with simple logistic regression on a dataset of 2600 labeled articles. We consider Type II errors (false negatives) to be more critical to the model. Type I errors are likely to arise from articles that may not be suitable to the database, but would be of interest to researchers. One significant outcome is the discovery of relevant articles outside the common set of publication journals within the Neotoma database, particularly discovery of articles within more "generalist" publications, such as PNAS or PLoS ONE, where lead authors may be more likely to submit data to a general data repository such as Dryad or PANGEA.
DataART is an important contribution to the continuing development and sustainability of disciplinary data resources, providing an opportunity for data resource discovery, by both researchers and data managers in the geosciences, with more general applications in the long run.