Cuong Cao Dang, Le Sy Vinh
Author affiliations


  • Cuong Cao Dang University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Ha Noi, Viet Nam
  • Le Sy Vinh University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, 10000 Ha Noi, Viet Nam



Amino acid substitution models, Bacterial protein sequences, Time-non-reversible models, Time-reversible models.


Reconstructing phylogenetic trees from protein sequences normally requires empirical amino acid substitution models to calculate the likelihood of trees or genetic distances between species. The tree of life is classified into three domains of Eukaryotes, Archaea, and Bacteria. The amino acid substitution models have been intensively studied for decades, but few are related to Bacteria. Rooting bacterial trees remains a challenging problem in the phylogenetic analysis due to the long branch separating Bacteria and other domains. The two main objectives of this paper are estimating amino acid substitution models Q.bac and NQ.bac for bacterial evolutionary studies and assessing the capability of the time non-reversible model NQ.bac in rooting bacterial trees. Experiments showed that both the time-reversible model (Q.bac) and the time-non-reversible model (NQ.bac) were significantly better than the existing models in analyzing bacterial protein sequences. Interestingly, the time non-reversible model NQ.bac helped reconstruct maximum likelihood bacterial trees with reliable roots for 177 (23.7\%) out of 748 testing alignments without requiring predefined outgroups. This outgroup-free rooting method enhances the studies of bacterial evolution. We recommend researchers employ both Q.bac and NQ.bac models in analyzing bacterial protein sequences. The datasets and scripts used in this manuscript are available at


Metrics Loading ...


Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol 2017;35:518–22.">

Amparo R Del, Arenas M. Consequences of Substitution Model Selection on Protein Ancestral Sequence Reconstruction. Mol Biol Evol 2022;39.">

Whelan S, Goldman N. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol Biol Evol 2001;18:691–9.">

Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol 2008;25.">

Minh BQ, Dang CC, Vinh LS, Lanfear R. QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution. Syst Biol 2021;70:1046–60.">

Dang CC, Minh BQ, McShea H, Masel J, James JE, Vinh LS, et al. nQMaker: estimating time non-reversible amino acid substitution models. Syst Biol 2022.">

Huy TN, Dang CC, Vinh LS. Estimating amino acid substitution models from genome datasets: A simulation study on the performance of estimated models. BioRxiv 2023.">

Maddison WP, Donoghue MJ, Maddison DR. Outgroup analysis and parsimony. Syst Biol 1984;33.">

Yang Z, Roberts D. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol Biol Evol 1995;12:451–8.">

Huelsenbeck JP, Bollback JP, Levine AM. Inferring the root of a phylogenetic tree. Syst Biol 2002;51.">

Ho SYW, Duchêne S. Molecular-clock methods for estimating evolutionary rates and timescales. Mol Ecol 2014;23:5947–65.">

Bettisworth B, Stamatakis A. Root Digger: a root placement program for phylogenetic trees. BMC Bioinformatics 2021;22.">

Naser-Khdour S, Quang Minh B, Lanfear R. Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Nonreversible Models for Mammals. Syst Biol 2022;71.">

Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A 1989;86.">

Lake JA, Herbold CW, Rivera MC, Servin JA, Skophammer RG. Rooting the tree of life using nonubiquitous genes. Mol Biol Evol 2007;24.">

Tria FDK, Landan G, Dagan T. Phylogenetic rooting using minimal ancestor deviation. Nat Ecol Evol 2017;1:193.">

Mai U, Sayyari E, Mirarab S. Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS One 2017;12:e0182238.

Coleman GA, Davín AA, Mahendrarajah TA, Szánthó LL, Spang A, Hugenholtz P, et al. A rooted phylogeny resolves early bacterial evolution. Science (1979) 2021;372:eabe0511.">

Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, et al. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res 2008;37:D471–8.">

Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 1993;10.">

Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 2017;14:587–9.">

Dang CC ao, Le VS y., Gascuel O, Hazes B, Le QS i. FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. BMC Bioinformatics 2014;15.">

Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics 1978;6:461 – 464.">

Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 2020;37.">

Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol 1981;17:368–76.">

Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr 1974;19:716–23.">

Le VS, Dang CC, Le QS. Improved mitochondrial amino acid substitution models for metazoan evolutionary studies. BMC Evol Biol 2017;17:136.">

Kishino H, Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 1989;29:170–9.

Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci 1981;53:131–47.">

Shimodaira H. An approximately unbiased test of phylogenetic tree selection. Syst Biol 2002;51.">

Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 1992;89:10915–9.">

Dayhoff M, Schwartz R, Orcutt B. A model of evolutionary change in proteins. vol. 5. National Biomedical Research Foundation; 1978.

Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Bioinformatics 1992;8:275–82.">

Veerassamy S, Smith A, Tillier ERM. A transition probability model for amino acid substitutions from blocks. J Comput Biol 2003;10:997–1010.

Müller T, Vingron M. Modeling amino acid replacement. J Comput Biol 2000;7:761–76.

Abascal F, Posada D, Zardoya R. MtArt: A New Model of Amino Acid Replacement for Arthropoda. Mol Biol Evol 2006;24:1–5.">

Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 1998;15:1600–11.">

Adachi J, Hasegawa M. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 1996;42:459–68.">

Rota-Stabelli O, Yang Z, Telford MJ. MtZoa: A general mitochondrial amino acid substitutions model for animal evolutionary studies. Mol Phylogenet Evol 2009;52:268–72.">

Dang CC, Le QS, Gascuel O, Le VS. FLU, an amino acid substitution model for influenza proteins. BMC Evol Biol 2010;10:99.">

Le TK, Vinh LS. FLAVI: An Amino Acid Substitution Model for Flaviviruses. J Mol Evol 2020;88:445–52.">

Dimmic MW, Rest JS, Mindell DP, Goldstein RA. rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny. J Mol Evol 2002;55:65–73.">




How to Cite