2007 GSA Denver Annual Meeting (28–31 October 2007)
Paper No. 144-11
Presentation Time: 8:00 AM-12:00 PM

TWO-WAY CLUSTERING OF TAXON ABUNDANCE DATA: SEARCHING FOR ENVIRONMENTAL AFFINITIES USING LATENT SEMANTIC ANALYSIS

HANDLEY, John C., Paleontological Research Institution, 1259 Trumansburg Road, Ithaca, NY 14850, jhandley@rochester.rr.com and IVANY, Linda C., Department of Earth Sciences, Syracuse University, Department of Earth Sciences, Syracuse University, Syracuse, NY 13244

A typical paleoecological analysis involves taxon counts from a number of bulk samples, often arrayed across some presumed paleoenvironmental gradient. The objective is often to use taxon distributions to elucidate the underlying environmental gradient by identifying groups of taxa that co-occur consistently and characterize particular sets of samples that may also share lithologic similarities. A standard approach to such data uses distance metrics (e.g., similarity coefficients) and employs them in cluster analysis to group samples via taxon composition (Q-mode) and to group taxa via their occurrence in samples (R-mode). The two sets of clusters are then reconciled to display sets of similar samples combined with the taxa that dominate them, with the goal that the sample-taxon associations offer insight into how biofacies are arrayed across ancient environments.

A new class of two-way clustering methods has been developed that holds great potential for paleoecological analysis of this sort. ‘Latent semantic analysis', while initially developed for information retrieval, can be adapted to questions of taxon/sample associations as described above and may offer a more natural method for analyzing count data. These methods are based on probabilistic models rather than vector space representations, and are not subject to the limitations of distance-based metrics (e.g., choice of similarity coefficient, clustering algorithm, forced assignment to clusters). We explore the use of one of these models and show some encouraging results in analyzing published data sets. The key notions in this approach are that probabilities are estimated to assign both taxa and samples to ‘latent classes' that emerge from the analysis, and that these latent classes correspond to meaningful associations of taxa and environment. While computationally intensive, routines are available for the R statistical package used by many paleontologists already. We believe this probability-based approach will offer a valuable complement to the more traditional, distance-based clustering.

2007 GSA Denver Annual Meeting (28–31 October 2007)
General Information for this Meeting
Session No. 144--Booth# 94
Paleontology (Posters) II: Environments, Ecosystems, and Interactions
Colorado Convention Center: Exhibit Hall E/F
8:00 AM-12:00 PM, Tuesday, 30 October 2007

Geological Society of America Abstracts with Programs, Vol. 39, No. 6, p. 398

© Copyright 2007 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.