2007 GSA Denver Annual Meeting (28–31 October 2007)

Paper No. 11
Presentation Time: 8:00 AM-12:00 PM

TWO-WAY CLUSTERING OF TAXON ABUNDANCE DATA: SEARCHING FOR ENVIRONMENTAL AFFINITIES USING LATENT SEMANTIC ANALYSIS


HANDLEY, John C., Paleontological Research Institution, 1259 Trumansburg Road, Ithaca, NY 14850 and IVANY, Linda C., Department of Earth Sciences, Syracuse University, Department of Earth Sciences, Syracuse University, Syracuse, NY 13244, jhandley@rochester.rr.com

A typical paleoecological analysis involves taxon counts from a number of bulk samples, often arrayed across some presumed paleoenvironmental gradient. The objective is often to use taxon distributions to elucidate the underlying environmental gradient by identifying groups of taxa that co-occur consistently and characterize particular sets of samples that may also share lithologic similarities. A standard approach to such data uses distance metrics (e.g., similarity coefficients) and employs them in cluster analysis to group samples via taxon composition (Q-mode) and to group taxa via their occurrence in samples (R-mode). The two sets of clusters are then reconciled to display sets of similar samples combined with the taxa that dominate them, with the goal that the sample-taxon associations offer insight into how biofacies are arrayed across ancient environments.

A new class of two-way clustering methods has been developed that holds great potential for paleoecological analysis of this sort. ‘Latent semantic analysis', while initially developed for information retrieval, can be adapted to questions of taxon/sample associations as described above and may offer a more natural method for analyzing count data. These methods are based on probabilistic models rather than vector space representations, and are not subject to the limitations of distance-based metrics (e.g., choice of similarity coefficient, clustering algorithm, forced assignment to clusters). We explore the use of one of these models and show some encouraging results in analyzing published data sets. The key notions in this approach are that probabilities are estimated to assign both taxa and samples to ‘latent classes' that emerge from the analysis, and that these latent classes correspond to meaningful associations of taxa and environment. While computationally intensive, routines are available for the R statistical package used by many paleontologists already. We believe this probability-based approach will offer a valuable complement to the more traditional, distance-based clustering.