• Le Sy Vinh VNU University of Engineering and Technology



Amino acid substitution model, whole genomes, maximum likelihood estimation methods, phylogenomics


Modeling amino acid substitution process is a core task in bioinformatics. New advanced sequencing technologies have generated huge datasets including whole genomes from various species. Estimating amino acid substitution models from whole genome datasets provides us unprecedented opportunities to accurately investigate relationships among species. In this paper, we review state-of-the-art computational methods to estimate amino acid substitution models from large datasets. We also describe a comprehensive pipeline to practically estimate amino acid models from whole genome datasets. Finally, we apply amino acid substitution models to build phylogenomic trees from bird and plant genome datasets. We compare our newly reconstructed phylogenomic trees and published ones and discuss new findings.


Baca, SM, EFA Toussaint, and KB Miller. 2017. “Molecular Phylogeny of the Aquatic Beetle Family Noteridae (Coleoptera: Adephaga) with an Emphasis on Data Partitioning Strategies.” Molecular Phylogenetics and Evolution 107: 282–92.

Brandley MC Schmitz A, Reeder T W. 2005. “Partitioned Bayesian Analyses, Partition Choice, and the Phylogenetic Relationships of Scincid Lizards.” Syst Biol 54: 373–90.

Bui, Minh, Cuong Dang, Vinh Le, and Robert Lanfear. 2021. “QMaker: Estimating Empirical Models of Protein Evolution from Large Collections of Alignments.” Systematic Biology In press.

Bui Quang, Minh, Heiko Schmidt, Olga Chernomor, Dominik Schrempf, Michael Woodhams5, Arndt von Haeseler, and Robert Lanfear. 2020. “IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era.” Molecular Biology and Evolution 37 (5): 1530–34.

Dang, Cuong Cao, Quang Si Le, Olivier Gascuel, and Le Sy Vinh. 2010. “FLU, an Amino Acid Substitution Model for Influenza Proteins.” BMC Evolutionary Biology 10: 99.

Dang, Cuong Cao, Vincent Lefort, Le Sy Vinh, Quang Si Le, and Olivier Gascuel. 2011. “Replacementmatrix: A Web Server for Maximum-Likelihood Estimation of Amino Acid Replacement Rate Matrices.” Bioinformatics 27 (19): 2758–60.

Dang, Cuong Cao, Le Sy Vinh, Olivier Gascuel, Bart Hazes, and Quang Si Le. 2014. “FastMG: A Simple, Fast, and Accurate Maximum Likelihood Procedure to Estimate Amino Acid Replacement Rate Matrices from Large Data Sets.” BMC Bioinformatics 15 (1): 341.

Dayhoff, Mo, and RM Schwartz. 1978. “A Model of Evolutionary Change in Proteins.” In Atlas of Protein Sequence and Structure 22: 345–52.

Frandsen, Paul B., Brett Calcott, Christoph Mayer, and Robert Lanfear. 2015. “Automatic Selection of Partitioning Schemes for Phylogenetic Analyses Using Iterative K-Means Clustering of Site Rates.” BMC Evolutionary Biology 15: 13.

Guindon, Stéphane, and Olivier Gascuel. 2003. “A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood.” Systematic Biology 52 (5): 696–704.

Jarvis, Erich D., Siavash Mirarab, Andre J. Aberer, Bo Li, Peter Houde, Cai Li, Simon Y.W. Ho, et al. 2015. “Phylogenomic Analyses Data of the Avian Phylogenomics Project.” GigaScience.

Jones, David T., William R. Taylor, and Janet M. Thornton. 1992. “The Rapid Generation of Mutation Data Matrices from Protein Sequences.” Bioinformatics 8: 275–82.

Kalyaanamoorthy, Subha, Bui Quang Minh, Thomas K.F. Wong, Arndt Von Haeseler, and Lars S. Jermiin. 2017. “ModelFinder: Fast Model Selection for Accurate Phylogenetic Estimates.” Nature Methods 14 (587): 589.

Kim, Thu Le, and Vinh Le Sy. 2020. “MPartition: A Model-Based Method for Partitioning Alignments.” Journal of Molecular Evolution.

Lanfear, R, B Calcott, S Ho, and S Guindon. 2012. “PartitionFinder: Combined Selection of Partitioning Schemes and Substitution Models for Phylogenetic Analyses.” Molecular Biology and Evolution 29: 1695–1701.

Lartillot N, Philippe H. 2004. “A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process.” Mol Biol Evol 21: 1095–1109.

Le, Si Quang, and Olivier Gascuel. 2008. “An Improved General Amino Acid Replacement Matrix.” Molecular Biology and Evolution 25 (7): 1307–20.

Le, Thu Kim, and Le Sy Vinh. 2020. “FLAVI: An Amino Acid Substitution Model for Flaviviruses.” Journal of Molecular Evolution 88 (5): 445–52.

Lemey, Philippe, Marco Salemi, and Anne-Mieke Vandamme. 2009. The Phylogenetic Handbook. The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. 2nd ed. Cambridge University Press.

Minh, Bui Quang, Le Sy Vinh, Arndt von Haeseler, and Heiko A. Schmidt. 2005. “PIQPNNI: Parallel Reconstruction of Large Maximum Likelihood Phylogenies.” Bioinformatics 21 (19): 3794–96.

Nguyen, Lam Tung, Heiko A. Schmidt, Arndt Von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74.

Nickle, David C., Laura Heath, Mark A. Jensen, Peter B. Gilbert, James I. Mullins, and Sergei L. Kosakovsky Pond. 2007. “HIV-Specific Probabilistic Models of Protein Evolution.” PLoS ONE 2 (6): e503.

Nylander, J, F Ronquist, J Huelsenbeck, and J Nieves-Aldrey. 2004. “Bayesian Phylogenetic Analysis of Combined Data.” Syst Biol 53: 47–67.

Ran, J-H, T-T Shen, M-M Wang, and X-Q Wang. 2018. “Phylogenomics Resolves the Deep Phylogeny of Seed Plants and Indicates Partial Convergent or Homoplastic Evolution between Gnetales and Angiosperms.” Proceedings of the Royal Society B 285(1881).

Rota, Jadranka, Tobias Malm, Nicolas Chazot, Carlos Peña, and Niklas Wahlberg. 2018. “A Simple Method for Data Partitioning Based on Relative Evolutionary Rates.” PeerJ 6: e5498.

Simon Whelan, Nick Goldman. 2001. “A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach.” Molecular Biology and Evolution 18: 691–99.

Stamatakis, Alexandros. 2015. “Using RAxML to Infer Phylogenies.” Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al.] 51: 6.14.1-6.14.14.

Thorne, Jeffrey L. 2000. “Models of Protein Sequence Evolution and Their Applications.” Current Opinion in Genetics and Development 10: 602–5.

Vinh, Le Sy, and Arndt Von Haeseler. 2004. “IQPNNI: Moving Fast through Tree Space and Stopping in Time.” Molecular Biology and Evolution 21 (8): 1565–71.

Whelan, S, and N Goldman. 2001. “A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach.” Molecular Biology and Evolution 18 (5): 691–99.

Whelan, Simon, and Nick Goldman. 2001. “A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach.” Molecular Biology and Evolution 18 (5): 691–99.

Yang, Z. 1993. “Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites.” Molecular Biology and Evolution 10 (6): 1396–1401.