AVOIDING PITFALLS IN PCA FOR DESCRIBING TAPHONOMY: REFINING THE CHEMOSPACE APPROACH
Here we discuss three scenarios that require methodological adjustments to the PCA to produce useful and interpretable results. First, proportional data, such as data produced by , requires specific transformations before applying PCA to account for the inherent non-negative and constant sum constraints in the data. Second, in the case of performing PCA before supervised learning methods (e.g., classification, clustering), samples of unknown group membership should be excluded when performing PCA to avoid including variance from the out-of-sample data and to improve model generalizability. In the particular case of classification or clustering, it is advisable to use a secondary method beyond PCA to conclude group membership. Lastly, certain spectroscopy methods, such as Raman, can introduce unwanted noise with heterogeneous geologic samples, for example, fossil samples that contain a mixture of mineral and organic materials that can produce overlapping vibrational signals. While PCA is a valid technique to characterize the nature of these spectra, there is a question of how noise can influence the resulting principal components and thus the final interpretations. Here, we test a set of known, generated spectra with varying levels of noise to evaluate the effectiveness of PCA in describing the material.