CLW_SUMO: A hybrid deep learning model  for predicting protein SUMOylation sites

Thi-Xuan Tran; Thi-Thu-Huong Tran; Nguyen Quoc Khanh Le; Van Nui Nguyen

doi:10.15625/1813-9663/19626

Author affiliations

Authors

Thi-Xuan Tran University of Economics and Business Administration, Tan Thinh Ward, Thai Nguyen City, Viet Nam
Thi-Thu-Huong Tran Thai Binh University, Tan Binh Ward, Thai Binh City, Viet Nam
Nguyen Quoc Khanh Le Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Yuantong Road., Zhonghe District., Taipei City, Taiwan
Van Nui Nguyen Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Yuantong Road., Zhonghe District., Taipei City, Taiwan

DOI:

https://doi.org/10.15625/1813-9663/19626

Keywords:

SUMOylation, prediction, convolutional neural networks, long short-term memory, natural language processing, Word2Vec.

Abstract

Protein SUMOylation is one of the most important post-translational modifications in Eukaryotes species and plays significant roles in many biological processes. The mechanism underlined the SUMOylation process will be an important cause leading to many common serious diseases, such as breast cancer, cardiac, Parkinson’s, Alzheimer’s disease, etc. Due to the very important roles regulated by SUMOylation, the demand for an in-depth understanding of SUMOylation and its mechanism is currently a hot topic that interests many scientists. In this study, we propose a novel approach, called CLW-SUMO, for predicting SUMOylation sites using a hybrid deep learning model that combines convolutional neural networks (CNN) and long short-term memory (LSTM), using Word2Vec as the word embedding technique. The 10-fold cross-validation demonstrates that our proposed model achieves the best performance with an accuracy of 82.33%, MCC of 0.589 and AUC of 0.829. Besides, the independent testing also shows that our proposed model obtains the highest performance, reaching an accuracy of 90.03%, MCC of 0.773 and AUC of 0.889. Furthermore, when compared to several existing predictors of SUMOylation using an independent dataset, our proposed model exhibits the highest performance with an ACC value of 90.03% and an MCC value of 0.773. We hope that our findings will provide effective suggestions and greatly help researchers in their studies related to protein SUMOylation identification.

References

Geiss-Friedlander, R. and F. Melchior, Concepts in sumoylation: a decade on. Nat Rev Mol Cell Biol, 2007. 8(12): p. 947-56.

Hay, R.T., SUMO: a history of modification. Mol Cell, 2005. 18(1): p. 1-12.

Muller, S., et al., SUMO, ubiquitin's mysterious cousin. Nat Rev Mol Cell Biol, 2001. 2(3): p. 202-10.

Zhao, Q., et al., GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic acids research, 2014. 42(W1): p. W325-W330.

Xue, Y., et al., SUMOsp: a web server for sumoylation site prediction. Nucleic acids research, 2006. 34(suppl_2): p. W254-W257.

Ren, J., et al., Systematic study of protein sumoylation: Development of a site‐specific predictor of SUMOsp 2.0. Proteomics, 2009. 9(12): p. 3409-3412.

Jia, J., et al., pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics, 2016. 32(20): p. 3133-3141.

Qian, Y., et al., SUMO-Forest: a cascade forest based method for the prediction of SUMOylation sites on imbalanced data. Gene, 2020. 741: p. 144536.

Lopez, Y., et al., C-iSUMO: a sumoylation site predictor that incorporates intrinsic characteristics of amino acid sequences. Computational Biology and Chemistry, 2020. 87: p. 107235.

Zhu, Y., et al., ResSUMO: A deep learning architecture based on residual structure for prediction of lysine SUMOylation sites. Cells, 2022. 11(17): p. 2646.

Lv, H., et al., DeepIPs: comprehensive assessment and computational identification of phosphorylation sites of SARS-CoV-2 infection using a deep learning-based approach. Briefings in Bioinformatics, 2021. 22(6): p. bbab244.

Sharma, A., et al., HseSUMO: Sumoylation site prediction using half-sphere exposures of amino acids residues. BMC genomics, 2019. 19(9): p. 1-7.

Beauclair, G., et al., JASSA: a comprehensive tool for prediction of SUMOylation sites and SIMs. Bioinformatics, 2015. 31(21): p. 3483-3491.

Chen, Y.-Z., et al., SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PloS one, 2012. 7(6): p. e39195.

Lu, C.-T., et al., DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic acids research, 2013. 41(D1): p. D295-D305.

Teng, S., H. Luo, and L. Wang, Predicting protein sumoylation sites from sequence features. Amino acids, 2012. 43: p. 447-455.

Nguyen, V.-N., et al. Exploiting two-layer support vector machine to predict protein sumoylation sites. in Advances in Engineering Research and Application: Proceedings of the International Conference, ICERA 2018. 2019. Springer.

Nguyen, V.-N., et al. Characterization and identification of ubiquitin conjugation sites with E3 ligase recognition specificities. in BMC bioinformatics. 2015. BioMed Central.

Nguyen, V.-N., et al., A new scheme to characterize and identify protein ubiquitination sites. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016. 14(2): p. 393-403.

Nguyen, V.-N., H.-M. Nguyen, and T.-X. Tran, An approach by exploiting support vector machine to characterize and identify protein SUMOylation sites. JASSA. 505: p. 877.

Tran, T.-X., V.-N. Nguyen, and N.Q.K. Le. Incorporating Natural Language-Based and Sequence-Based Features to Predict Protein Sumoylation Sites. in Conference on Information Technology and its Applications. 2023. Springer.

Kao, H.J., et al., SuccSite: Incorporating Amino Acid Composition and Informative k-spaced Amino Acid Pairs to Identify Protein Succinylation Sites. Genomics Proteomics Bioinformatics, 2020. 18(2): p. 208-219.

Huang, Y., et al., CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010. 26(5): p. 680-682.

Mikolov, T., et al., Efficient estimation of word representations in vector space. arXiv preprint (2013). arXiv preprint arXiv:1301.3781, 2019. 10.

Fu, H., et al., DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC bioinformatics, 2019. 20(1): p. 1-10.

Crooks, G.E., et al., WebLogo: a sequence logo generator. Genome research, 2004. 14(6): p. 1188-1190.

Vacic, V., L.M. Iakoucheva, and P. Radivojac, Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics, 2006. 22(12): p. 1536-7.