This paper presents a voice transformation algorithm which modifies the speech of a source speaker such that it is perceived as if spoken by a target speaker. A novel method which is based on dynamic programming approach is proposed. The designed system obtains speaker-specific codebooks of line spectral frequencies (LSFs) for both source and target speakers. Those codebooks are used to train a mapping histogram matrix, which is used for LSF transformation from one speaker to the other. The baseline system uses the maxima of the histogram matrix for LSF transformation. The shortcomings of this system, which are the limitations of the target LSF space and the spectral discontinuities due to independent mapping of subsequent frames, have been overcome by applying the dynamic programming approach. Dynamic programming approach tries to model the long-term behaviour of LSFs of the target speaker, while it is trying to preserve the relationship between the subsequent frames of the source LSFs, during transformation. Both objective and subjective evaluations have been conducted and it has been shown that dynamic programming approach improves the performance of the system in terms of both the speech quality and speaker similarity.
The aim of voice transformation (VT) is to modify the speech of a source speaker such that it is perceived as if spoken by a target speaker. A considerable amount of effort has been dedicated to the problem of voice transformation in the last two decades (Abe et al., 1988, Valbret et al., 1992, Childers, 1995, Mizuno and Abe, 1995, Lee et al., 1995, Stylianou et al., 1998, Arslan, 1999 and Kain, 2001). There are various applications of a VT system. Using VT technology, new synthesis voices can be created by transforming the voice of the existing inventory to a new speaker’s voice in a text-to-speech system. VT system would require a much smaller inventory than the original text-to-speech inventory, which saves time and disk space. Another application can be developing the voice of a speaking-impaired person, who can provide limited amount of speech data. A VT system could also be used as a preliminary step to speech recognition to reduce speaker variability.
In general, all VT systems have two modes: training and transformation. In the training mode, the system uses source and target speech inventory to estimate a transformation function that maps the acoustic space of the source speaker to that of the target speaker. Once the training is achieved, the system is ready to transform the source speaker’s speech to the target speaker’s speech. The acoustic space of the speakers can be represented by various acoustic features. Formant frequencies (Abe et al., 1988 and Mizuno and Abe, 1995), LPC cepstrum coefficients (Lee et al., 1995 and Stylianou et al., 1998), and line spectral frequencies (Arslan, 1999, Kain, 2001 and Salor et al., 2003) have been used. The transformation function can be a continuous function applied to the features (Stylianou, 1999, Kain, 2001, Toda, 2003 and Salor, 2005), or it can be a discrete mapping from the feature space of the source speaker to that of the target speaker (Abe et al., 1988, Arslan, 1999 and Salor and Demirekler, 2004). The discrete mapping is in general a codebook mapping, in which a one-to-one correspondence between the spectral codebooks of the source speaker and the target speaker is developed. These methods usually face several problems such as degradation of the speech quality because the parameter space of the converted envelope is limited to a discrete set of envelopes. These methods may also result in high distortions between LPC spectrums of the neighboring frames due to independent transformation of the successive frames, which cause audible buzzy sounds or clicks.
In this work, we have aimed to obtain a voice transformation system inside the decoder part of a MELP speech coding algorithm. The idea is that the coded parameters could be used to produce the voice of another person at the end point of the coder. Therefore, we have focused on improving the quality of a codebook based voice transformation system. Here, we propose a dynamic programming approach to codebook based VT methods to overcome the problems of discontinuities and high distortions in speech. Dynamic programming approach considers the spectral distance between successive frames of the source speaker during transformation, while it is giving the chance to one of several target codewords to be selected at every frame instead of using a one-to-one mapping between the source and target speaker codewords. It has been observed that dynamic programming increases speech quality.
A new approach to the concept of voice transformation has been developed in this study. The algorithm is based on the idea of codebook mapping of the spectral features of the source and the target speakers. Some shortcomings of the codebook-mapping based systems, which are the limitation of the target spectral feature space and spectral discontinuities due to the independent mapping of subsequent frames, are overcome by applying a dynamic programming approach. This approach considers all target codewords corresponding to one source codeword using the probabilities obtained from the histogram matrix. Dynamic programming also considers the feature distances between the subsequent frames of the source speaker which reduces residual-filter mismatches in the transformed speech when an LPC-based speech model is used. The performance of the system was tested by objective and subjective listening tests. The objective evaluations verified that the target speaker characteristics are obtained to a large extend when the dynamic programming approach is applied to a baseline codebook mapping based voice transformation system. The subjective evaluations verified that the proposed approach results in convincing voice transformation in terms of speaker identity.