• Trinh Thi Anh Loan Hong Duc University, Viet Nam
  • Pham The Anh Hong Duc University, Viet Nam
  • Le Viet Nam Hong Duc University, Viet Nam
  • Hoang Van Dung Ho Chi Minh City University of Technology and Education, Viet Nam



Deep learning models, Classification losses, Feature pyramid network


This paper presents a deep learning model to address the problem of recognition of animals and plants. The context of this work is to make an effort in protection of rare species that are seriously faced to the risk of extinction in Vietnam such as Panthera pardus, Dalbergia cochinchinensis, Macaca mulatta. The proposed approach exploits the advanced learning ability of convolutional neural networks and Inception residual structures to design a lightweight model for classification task. We also apply the transfer learning technique to fine-tune the two state-of-the-art methods, MobileNetV2 and InceptionV3, specific to our own dataset. Experimental results demonstrate the superiority of our object predictor (e.g., 95.8% accuracy) in comparison with other methods. In addition, the proposed model works very efficiently with the inference speed of around 113 FPS on a CPU machine, enabling it for deployment on mobile environment.


[1] A. Berg, J. Deng, and L. Fei-Fei, “Large scale visual recognition challenge 2010,” 2010. [Online]. Available:

[2] L. G. Hafemann, L. S. Oliveira, and P. Cavalin, “Forest species recognition using deep convolutional neural networks,” in 2014 22nd International Conference on Pattern Recognition, 2014, pp. 1103–1107.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

[4] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv:1207.0580 [cs.NE], 2012.

[5] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861 [cs.CV], 2017.

[6] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size,” arXiv:1602.07360 [cs.CV], 2016.

[7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv:1502.03167 [cs.CV], 2015.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012.

[9] Z. Miao, K. M. Gaynor, J. Wang, Z. Liu, O. Muellerklein, M. S. Norouzzadeh, A. McInturff, R. C. K. Bowie, R. Nathan, S. X. Yu, and W. M. Getz, “Insights and approaches using deep learning to classify wildlife,” Scientific Reports, vol. 9, no. 1, pp. 1–9, 2019.

[10] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10, 2010, p. 807–814.

[11] T.-A. Pham, “Semantic convolutional features for face detection,” Machine Vision and Applications, vol. 33, no. 3, pp. 1–18, 2021. [Online]. Available: https: //

[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252,

[13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” arXiv:1801.04381 [cs.CV], 2019.

[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 09 2014.

[15] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

[16] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17. AAAI Press, 2017, pp. 4278—-4284.

[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” arXiv:1512.00567 [cs.CV], 2015.

[18] M. Willi, R. T. Pitman, A. W. Cardoso, C. Locke, A. Swanson, A. Boyer, M. Veldthuis, and L. Fortson, “Identifying animal species in camera trap images using deep learning and citizen science,” Methods in Ecology and Evolution, vol. 10, no. 1, pp. 80–91, 2019.

[19] S. Zhang, X.Wang, Z. Lei, and S. Z. Li, “Faceboxes: A cpu real-time and accurate unconstrained face detector,” Neurocomputing, vol. 364, pp. 297–309, 2019.

[20] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” arXiv:1707.01083 [cs.CV], 2017.