Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions

Phung Trung Nghia; Luong Chi Mai; Masato Akagi

doi:10.15625/1813-9663/31/1/5064

Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions

Phung Trung Nghia, Luong Chi Mai, Masato Akagi

Author affiliations

Authors

Phung Trung Nghia Thai Nguyen University of Information and Communication Technology
Luong Chi Mai Institute of Information Technology, Hanoi Viet Nam
Masato Akagi Japan Advanced Institute of Science and Technology

DOI:

https://doi.org/10.15625/1813-9663/31/1/5064

Keywords:

Concatenative speech synthesis, temporal decomposition, co-articulation, tone transformation, limited data, Vietnamese speech

Abstract

Building a large speech corpus is a costly and time-consuming task. Therefore, how tobuild high-quality speech synthesis under limited data conditions is an important issue, specicallyfor under-resourced languages such as Vietnamese. As the most natural-sounding speech synthesisis currently concatenative speech synthesis (CSS), it was the target speech synthesis we studied inthis research. All possible units of a specic phonetic unit set are required for CSS. This requirementmay be easy for verbal languages, in which the number of all units of a specic phonetic unit set suchas phoneme is relatively small. However, the numbers of all tonal phonetic units are signicant intonal languages, and it is dicult to design a small corpus covering all possible tonal phonetic units.Additionally, as all context-dependent phonetic units are required to ensure the naturalness of corpusbasedCSS, it needs a large database with a size up to dozens of gigabytes for concatenation. Therefore,the motivation for this work is to improve the naturalness of CSS under limited data conditions, andwe solved both these two mentioned problems. First, we attempted to reduce the number of tonalunits required for the CSS of tonal languages by using a method of tone transformation. Second, weattempted to reduce mismatch-context errors in concatenation regions to make the CSS available ifmatching-context units could not be found from the database. Temporal Decomposition (TD), whichis an interpolation method decomposing a spectral or prosodic sequence into its sparse event targetsand corresponding temporal event functions, was used for both tasks. Previous studies have revealedthat TD can eciently be used for spectral transformation. Therefore, a TD-based transformationof fundamental frequency (F0) contours, which represents the lexical tones in tonal languages, isproposed. The concept of TD is also close to that of co-articulation of speech, which is related tothe contextual eect in CSS. Therefore, TD is also used to model, select, and modify co-articulatedtransition regions to reduce the mismatch-context errors. The experimental results obtained froma small Vietnamese corpus demonstrated that the proposed lexical tone transformation was able totransform lexical tones, and the proposed method of reducing the mismatch-context errors in the CSSof the general language was ecient. As a result, the two proposed methods are useful to improvethe naturalness of Vietnamese CSS under limited data conditions.

Metrics

PDF views

257

Downloads

Published

16-03-2015

How to Cite

[1]

P. T. Nghia, L. C. Mai, and M. Akagi, “Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions”, J. Comput. Sci. Cybern., vol. 31, no. 1, pp. 1–16, Mar. 2015.

Download Citation

Issue

Vol. 31 No. 1 (2015)

Section

Computer Science

License

1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.

Most read articles by the same author(s)

Nguyễn Văn Huy, Lương Chi Mai, Vũ Tất Thắng, Applying Bottle Neck Feature for Vietnamese speech recognition , Journal of Computer Science and Cybernetics: Vol. 29 No. 4 (2013)
Bạch Hưng Khang, Ngô Quốc Tạo, Phạm Ngọc Khôi, Lương Chi Mai, Đỗ Năng Toàn, Nguyễn Đức Dũng, Vu Van Thinh, An examination of techniques for raster-to-vector process and implementation of software package for automatic map data entry-mapscan , Journal of Computer Science and Cybernetics: Vol. 12 No. 2 (1996)
Do Quoc Truong, Pham Ngoc Phuong, Tran Hoang Tung, Luong Chi Mai, DEVELOPMENT OF HIGH-PERFORMANCE AND LARGE-SCALE VIETNAMESE AUTOMATIC SPEECH RECOGNITION SYSTEMS , Journal of Computer Science and Cybernetics: Vol. 34 No. 4 (2018)
Phạm Anh Phương, Ngô Quốc Tạo, Lương Chi Mai, Vietnamese handwritten character recognition by combining SVM classifers , Journal of Computer Science and Cybernetics: Vol. 25 No. 1 (2009)
Phan Thị Hoài Phương, Nguyễn Minh Hằng, Lương Chi Mai, A genetic algorithm-based approach to the set covering problem. , Journal of Computer Science and Cybernetics: Vol. 24 No. 2 (2008)
Luong Chi Mai, Preface , Journal of Computer Science and Cybernetics: Vol. 34 No. 4 (2018)
Nguyễn Thị Thanh Tân, Lương Chi Mai, A new method for off-line word handwriting recognition using four layers neural-networks combined with vocabulary statistics. , Journal of Computer Science and Cybernetics: Vol. 22 No. 2 (2006)
Luong Chi Mai, Parallel object classification algorithms in images , Journal of Computer Science and Cybernetics: Vol. 10 No. 3 (1994)
Nguyen Van Huy, Luong Chi Mai, Vu Tat Thang, Do Quoc Truong, Vietnamese recognition using tonal phoneme based on multi space distribution , Journal of Computer Science and Cybernetics: Vol. 30 No. 1 (2014)
Đặng Ngọc Đức, Lương Chi Mai, Recognition of vietnamese words with different tones. , Journal of Computer Science and Cybernetics: Vol. 19 No. 2 (2003)

1 2 > >>