ترجمه فارسی عنوان مقاله

برآورد پاداش برای بهینه سازی سیاست گفتمان

عنوان انگلیسی

Reward estimation for dialogue policy optimisation

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
112650	2018	20 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computer Speech & Language, Volume 51, September 2018, Pages 24-43

ترجمه کلمات کلیدی

سیستم های گفتگو، تقویت یادگیری، یادگیری عمیق، برآورد پاداش، روند گاوسی، یادگیری فعال،

کلمات کلیدی انگلیسی

Dialogue systems; Reinforcement learning; Deep learning; Reward estimation; Gaussian process; Active learning;

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Viewing dialogue management as a reinforcement learning task enables a system to learn to act optimally by maximising a reward function. This reward function is designed to induce the system behaviour required for the target application and for goal-oriented applications, this usually means fulfilling the userâs goal as efficiently as possible. However, in real-world spoken dialogue system applications, the reward is hard to measure because the userâs goal is frequently known only to the user. Of course, the system can ask the user if the goal has been satisfied but this can be intrusive. Furthermore, in practice, the accuracy of the userâs response has been found to be highly variable. This paper presents two approaches to tackling this problem. Firstly, a recurrent neural network is utilised as a task success predictor which is pre-trained from off-line data to estimate task success during subsequent on-line dialogue policy learning. Secondly, an on-line learning framework is described whereby a dialogue policy is jointly trained alongside a reward function modelled as a Gaussian process with active learning. This Gaussian process operates on a fixed dimension embedding which encodes each varying length dialogue. This dialogue embedding is generated in both a supervised and unsupervised fashion using different variants of a recurrent neural network. The experimental results demonstrate the effectiveness of both off-line and on-line methods. These methods enable practical on-line training of dialogue policies in real-world applications.