Paper No. 9
Presentation Time: 4:15 PM


TCHENG, David K., Illinois Informatics Institute, National Center for Supercomputing Applications, University of Illinois, 1205 W. Clark St., Room 1008, Urbana, IL 61801, HASELHORST, Derek, School of Integrative Biology, University of Illinois at Urbana-Champaign, 505 S. Goodwin Avenue, Urbana, IL 61801 and PUNYASENA, Surangi W., Department of Plant Biology, University of Illinois, 505 S. Goodwin Ave., Urbana, IL 61801,

We present an advanced human and machine learning system (ARLO) for solving tropical pollen classification problems. We present a case study where our human expert (Haselhorst) can identify a large number of pollen classes (n = 119). For the machine learning process, we created 13,650 training examples where each example is a 3-d image pixel matrix representing a z-stack containing the pollen grain. The number of examples in each class is naturally skewed, with the most observed class containing 2,708 examples and the least observed class containing only single example (12 cases).

Our prediction system, ARLO, has the following components: (1) a high throughput automated slide scanner, (2) a virtual microscope and pollen tagger, (3) a human expert, (4) a space of image feature extraction algorithms, (5) a space of supervised learning algorithms, (6) a bias optimizer for searching these spaces for optimal system configurations, and (7) a two-tiered cross validation framework to quantify and control overfitting.

We demonstrate that as more bias optimization is performed, the divergence between the “estimated” and “true” accuracy widens causing “overfitting” making comparisons between competing systems hazardous. We show how using two-tiered cross validation gives us “corrected” accuracy estimates and a more robust approach to comparing competing algorithms.