Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/31/1/5064Keywords:
Concatenative speech synthesis, temporal decomposition, co-articulation, tone transformation, limited data, Vietnamese speechAbstract
Building a large speech corpus is a costly and time-consuming task. Therefore, how tobuild high-quality speech synthesis under limited data conditions is an important issue, specicallyfor under-resourced languages such as Vietnamese. As the most natural-sounding speech synthesisis currently concatenative speech synthesis (CSS), it was the target speech synthesis we studied inthis research. All possible units of a specic phonetic unit set are required for CSS. This requirementmay be easy for verbal languages, in which the number of all units of a specic phonetic unit set suchas phoneme is relatively small. However, the numbers of all tonal phonetic units are signicant intonal languages, and it is dicult to design a small corpus covering all possible tonal phonetic units.Additionally, as all context-dependent phonetic units are required to ensure the naturalness of corpusbasedCSS, it needs a large database with a size up to dozens of gigabytes for concatenation. Therefore,the motivation for this work is to improve the naturalness of CSS under limited data conditions, andwe solved both these two mentioned problems. First, we attempted to reduce the number of tonalunits required for the CSS of tonal languages by using a method of tone transformation. Second, weattempted to reduce mismatch-context errors in concatenation regions to make the CSS available ifmatching-context units could not be found from the database. Temporal Decomposition (TD), whichis an interpolation method decomposing a spectral or prosodic sequence into its sparse event targetsand corresponding temporal event functions, was used for both tasks. Previous studies have revealedthat TD can eciently be used for spectral transformation. Therefore, a TD-based transformationof fundamental frequency (F0) contours, which represents the lexical tones in tonal languages, isproposed. The concept of TD is also close to that of co-articulation of speech, which is related tothe contextual eect in CSS. Therefore, TD is also used to model, select, and modify co-articulatedtransition regions to reduce the mismatch-context errors. The experimental results obtained froma small Vietnamese corpus demonstrated that the proposed lexical tone transformation was able totransform lexical tones, and the proposed method of reducing the mismatch-context errors in the CSSof the general language was ecient. As a result, the two proposed methods are useful to improvethe naturalness of Vietnamese CSS under limited data conditions.Metrics
Metrics Loading ...
Downloads
Published
16-03-2015
How to Cite
[1]
P. T. Nghia, L. C. Mai, and M. Akagi, “Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions”, JCC, vol. 31, no. 1, pp. 1–16, Mar. 2015.
Issue
Section
Computer Science
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.