دانلود مقاله ISI انگلیسی شماره 150786
ترجمه فارسی عنوان مقاله

الگوریتم های مقیاس پذیر برای خوشه بندی بدون نظارت داده های صوتی برای تشخیص گفتار

عنوان انگلیسی
Scalable algorithms for unsupervised clustering of acoustic data for speech recognition
کد مقاله سال انتشار تعداد صفحات مقاله انگلیسی
150786 2017 16 صفحه PDF
منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computer Speech & Language, Volume 46, November 2017, Pages 233-248

پیش نمایش مقاله
پیش نمایش مقاله  الگوریتم های مقیاس پذیر برای خوشه بندی بدون نظارت داده های صوتی برای تشخیص گفتار

چکیده انگلیسی

In this paper an unsupervised clustering algorithm is developed for acoustic data in the context of speech recognition tasks. One of the key features of the algorithm is scalability to large data sets. Specifically, given the unlabeled training and test sets, the class-labels of the utterances are obtained in an automatic manner. The extracted labels may correspond to the speakers in the speech corpus if the data is relatively clean. The proposed scheme is attractive from an industrial perspective as it alleviates the need to store the speaker-labels manually, saving considerable amount of human efforts and expenses. The core of the algorithm comprises a three-stage architecture that processes the input data one after the other, while each stage is designed to perform a well-defined and specific task. In more detail, the first-pass involves a bottom-up clustering mechanism, the second-pass comprises a cluster splitting operation and the third-pass consists of a cluster refining process. Each of the stages allows for data parallelization using multiple CPUs that leads to faster computation. Two alternative forms of the algorithm are presented – the first considers Gaussian distributions and the other i-Vectors – to facilitate the clustering. Although the algorithm may find applications in various realms of speech recognition, in this paper, the effectiveness of the schemes are evaluated by means of speaker adaptive training (SAT) and speaker-aware training of DNN-HMM acoustic models. In particular, experiments are conducted on the Switchboard task to extract the speaker-labels for the utterances in the training and test sets. It is shown that the SAT DNN-HMM trained using the Gaussian based scheme yields a 7.2% relative improvement in the ASR accuracy over the speaker independent DNN-HMM, whereas the i-Vector approach provides an additional improvement, amounting to a 10.8% relative gain overall. The standard SAT DNN-HMM developed using the ground-truth speaker-labels is found to be only 2.7% relative better than the proposed scheme. Similar observation is made as with speaker-aware training. The analysis of computational complexity, conducted stage by stage, demonstrates the scalability of the proposed algorithms.