Kates, H. R., J. R. Doby, C. M. Siniscalchi, R. LaFrance, D. E. Soltis, P. S. Soltis, R. P. Guralnick, and R. A. Folk. 2021. The Effects of Herbarium Specimen Characteristics on Short-Read NGS Sequencing Success in Nearly 8000 Specimens: Old, Degraded Samples Have Lower DNA Yields but Consistent Sequencing Success. Front. Plant Sci. 12. Frontiers.

Abstract

Phylogenetic datasets are now commonly generated using short-read sequencing technologies unhampered by degraded DNA, such as that often extracted from herbarium specimens. The compatibility of these methods with herbarium specimens has precipitated an increase in broad sampling of herbarium specimens for inclusion in phylogenetic studies. Understanding which sample characteristics are predictive of sequencing success can guide researchers in the selection of tissues and specimens most likely to yield good results. Multiple recent studies have considered the relationship between sample characteristics and DNA yield and sequence capture success. Here we report an analysis of the relationship between sample characteristics and sequencing success for nearly 8,000 herbarium specimens. This study, the largest of its kind, is also the first to include a measure of specimen quality (“greenness”) as a predictor of DNA sequencing success. We found that taxonomic group and source herbarium are strong predictors of both DNA yield and sequencing success and that the most important specimen characteristics for predicting success differ for DNA yield and sequencing: greenness was the strongest predictor of DNA yield, and age was the strongest predictor of proportion-on-target reads recovered. Surprisingly, the relationship between age and proportion-on-target reads is the inverse of expectations; older specimens performed slightly better in our capture-based protocols. We also found that DNA yield itself is not a strong predictor of sequencing success. Most literature on DNA sequencing from herbarium specimens considers specimen selection for optimal DNA extraction success, which we find to be an inappropriate metric for predicting success using next-generation sequencing technologies.