TWO-WAY CLUSTERING OF TAXON ABUNDANCE DATA: SEARCHING FOR ENVIRONMENTAL AFFINITIES USING LATENT SEMANTIC ANALYSIS
A new class of two-way clustering methods has been developed that holds great potential for paleoecological analysis of this sort. Latent semantic analysis', while initially developed for information retrieval, can be adapted to questions of taxon/sample associations as described above and may offer a more natural method for analyzing count data. These methods are based on probabilistic models rather than vector space representations, and are not subject to the limitations of distance-based metrics (e.g., choice of similarity coefficient, clustering algorithm, forced assignment to clusters). We explore the use of one of these models and show some encouraging results in analyzing published data sets. The key notions in this approach are that probabilities are estimated to assign both taxa and samples to latent classes' that emerge from the analysis, and that these latent classes correspond to meaningful associations of taxa and environment. While computationally intensive, routines are available for the R statistical package used by many paleontologists already. We believe this probability-based approach will offer a valuable complement to the more traditional, distance-based clustering.