Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions

Phung Trung Nghia, Luong Chi Mai, Masato Akagi
Author affiliations

Authors

  • Phung Trung Nghia Thai Nguyen University of Information and Communication Technology
  • Luong Chi Mai Institute of Information Technology, Hanoi Viet Nam
  • Masato Akagi Japan Advanced Institute of Science and Technology

DOI:

https://doi.org/10.15625/1813-9663/31/1/5064

Keywords:

Concatenative speech synthesis, temporal decomposition, co-articulation, tone transformation, limited data, Vietnamese speech

Abstract

Building a large speech corpus is a costly and time-consuming task. Therefore, how tobuild high-quality speech synthesis under limited data conditions is an important issue, specicallyfor under-resourced languages such as Vietnamese. As the most natural-sounding speech synthesisis currently concatenative speech synthesis (CSS), it was the target speech synthesis we studied inthis research. All possible units of a specic phonetic unit set are required for CSS. This requirementmay be easy for verbal languages, in which the number of all units of a specic phonetic unit set suchas phoneme is relatively small. However, the numbers of all tonal phonetic units are signicant intonal languages, and it is dicult to design a small corpus covering all possible tonal phonetic units.Additionally, as all context-dependent phonetic units are required to ensure the naturalness of corpusbasedCSS, it needs a large database with a size up to dozens of gigabytes for concatenation. Therefore,the motivation for this work is to improve the naturalness of CSS under limited data conditions, andwe solved both these two mentioned problems. First, we attempted to reduce the number of tonalunits required for the CSS of tonal languages by using a method of tone transformation. Second, weattempted to reduce mismatch-context errors in concatenation regions to make the CSS available ifmatching-context units could not be found from the database. Temporal Decomposition (TD), whichis an interpolation method decomposing a spectral or prosodic sequence into its sparse event targetsand corresponding temporal event functions, was used for both tasks. Previous studies have revealedthat TD can eciently be used for spectral transformation. Therefore, a TD-based transformationof fundamental frequency (F0) contours, which represents the lexical tones in tonal languages, isproposed. The concept of TD is also close to that of co-articulation of speech, which is related tothe contextual eect in CSS. Therefore, TD is also used to model, select, and modify co-articulatedtransition regions to reduce the mismatch-context errors. The experimental results obtained froma small Vietnamese corpus demonstrated that the proposed lexical tone transformation was able totransform lexical tones, and the proposed method of reducing the mismatch-context errors in the CSSof the general language was ecient. As a result, the two proposed methods are useful to improvethe naturalness of Vietnamese CSS under limited data conditions.

Metrics

Metrics Loading ...

Downloads

Published

16-03-2015

How to Cite

[1]
P. T. Nghia, L. C. Mai, and M. Akagi, “Improving the naturalness of concatenative Vietnamese speech synthesis under limited data conditions”, JCC, vol. 31, no. 1, pp. 1–16, Mar. 2015.

Issue

Section

Computer Science