Human Population Genetics and Genomics ISSN 2770-5005

Human Population Genetics and Genomics 2024;4(1):0005 |

Original Research Open Access

Simulation-based benchmarking of ancient haplotype inference for detecting population structure

Jazeps Medina-Tretmanis 1 , Flora Jay 2,† , María C. ávila-Arcos 3,† , Emilia Huerta-Sanchez 1,4,†

  • Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA
  • CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, Université Paris-Saclay, 91400, Orsay, France
  • International Laboratory for Human Genome Research, UNAM, 76230, Querétaro, Mexico
  • Department of Ecology, Evolution and Organismal Biology, Brown University, Providence, RI 02912, USA
  • These authors contributed equally to this work.

Correspondence: Flora Jay; María C. ávila-Arcos; Emilia Huerta-Sanchez

Academic Editor(s): Daniel Wegmann

Received: Sep 28, 2023 | Accepted: Feb 19, 2024 | Published: Mar 19, 2024

This article belongs to the Special Issue

Cite this article: Medina-Tretmanis J, Jay F, Ávila-Arcos M, Huerta-Sanchez E. Simulation-based benchmarking of ancient haplotype inference for detecting population structure. Hum Popul Genet Genom 2024; 4(1):0005.


Paleogenomic data has informed us about the movements, growth, and relationships of ancient populations. It has also given us context for medically relevant adaptations that appear in present-day humans due to introgression from other hominids, and it continues to help us characterize the evolutionary history of humans. However, ancient DNA (aDNA) presents several practical challenges as various factors such as deamination, high fragmentation, environmental contamination of aDNA, and low amounts of recoverable endogenous DNA, make aDNA recovery and analysis more difficult than modern DNA. Most studies with aDNA leverage only SNP data, and only a few studies have made inferences on human demographic history based on haplotype data, possibly because haplotype estimation (or phasing) has not yet been systematically evaluated in the context of aDNA. Here, we evaluate how the unique challenges of aDNA can impact phasing and imputation quality, we also present an aDNA simulation pipeline that integrates multiple existing tools, allowing users to specify features of simulated aDNA and the evolutionary history of the simulated populations. We measured phasing error as a function of aDNA quality and demographic history, and found that low phasing error is achievable even for very ancient individuals (~ 400 generations in the past) as long as contamination and average coverage are adequate. Our results show that population splits or bottleneck events occurring between the reference and phased populations affect phasing quality, with bottlenecks resulting in the highest average error rates. Finally, we found that using estimated haplotypes, even if not completely accurate, is superior to using the simulated genotype data when reconstructing changes in population structure after population splits between present-day and ancient populations. We also find that the imputation of ancient data before phasing can lead to better phasing quality, even in cases where the reference individuals used for imputation are not representative of the ancient individuals.


ancient DNA, phasing, haplotype, simulation, imputation, population structure

Share this article

About Us Journals Join Us Submit Fees Contact