GSA Annual Meeting in Denver, Colorado, USA - 2016

Paper No. 312-1
Presentation Time: 1:30 PM

EFFECTS OF RANDOMLY AND NON-RANDOMLY DISTRIBUTED MISSING DATA IN SUPPORT VALUES OF BAYESIAN AND PARSIMONY ANALYSIS (Invited Presentation)


POL, Diego and HOLLEY, Alfredo, Museo Paleontológico Egidio Feruglio, CONICET, Av Fontana 140, Trelew, 9100, Argentina, dpol@mef.org.ar

Paleontological datasets are characterized by the copious amount of missing data and their problematic effects in phylogenetic analyses have long been noted. In terms of parsimony analyses, recent advances in numerical methods and their efficient implementation in phylogenetic software currently allows incorporating numerous characters or taxa with large amounts of missing entries without creating problems related to the large numbers of equally parsimonious trees. Furthermore, the taxa that are unstable among the most parsimonious trees can be identified and removed for achieving well-resolved reduced consensus trees. The effects that missing data has on support values, however, is much less understood.

Regarding Bayesian analyses, recent studies using both empirical and simulated data matrices have shown that missing data also affect the performance of this method, especially when the missing data is non-randomly distributed. Non-random distribution of missing data in paleontological data matrices is quite common as it is usually concentrated on highly incompletely scored taxa and highly incompletely scored characters. As in parsimony, the effects of the amount of missing data (and the different patterns of distribution) on posterior clade probability is poorly understood.

Here we present a study on the effect of randomly and non-randomly distributed missing entries have on a set of empirical data matrices of morphological characters in support values for both Bayesian and parsimony analyses. Different regimes of missing entries were artificially added to these datasets and the support/credibility values obtained for the modified datsets were compared with those of the original matrices (without missing data). The results of these analyses show that support/credibility values are highly sensitive to the presence of non-randomly distributed missing entries, in particular for the case of highly incompletely scored taxa. A major difference in the results of both methods is found in the frequency of high credibility values obtained for erroneous groups in the case of Bayesian analyses.