برنامه ریزی پویا تقریبی با پارامتر فازی
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|25662||2010||11 صفحه PDF||سفارش دهید||10490 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Automatica, Volume 46, Issue 5, May 2010, Pages 804–814
Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values in a small discrete set. In practice, it is necessary to approximate the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy partition of the state space, and on a discretization of the action space. This fuzzy Q-iteration algorithm works for deterministic processes, under the discounted return criterion. We prove that fuzzy Q-iteration asymptotically converges to a solution that lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained in a finite number of iterations is also derived. Under continuity assumptions on the dynamics and on the reward function, we show that fuzzy Q-iteration is consistent, i.e., that it asymptotically obtains the optimal solution as the approximation accuracy increases. These properties hold both when the parameters of the approximator are updated in a synchronous fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to converge at least as fast as the synchronous one. The performance of fuzzy Q-iteration is illustrated in a two-link manipulator control problem.
Dynamic programming (DP) is a powerful paradigm for solving optimal control problems, thanks to its mild assumptions on the controlled process, which can be nonlinear or stochastic (Bertsekas, 2007 and Bertsekas and Tsitsiklis, 1996). In the DP framework, a model of the process is assumed to be available, and the immediate performance is measured by a scalar reward signal. The controller then maximizes the long-term performance, measured by the cumulative reward. DP algorithms can be extended to work without requiring a model of the process, in which case they are usually called reinforcement learning (RL) algorithms (Sutton & Barto, 1998). Most DP and RL algorithms work by estimating an optimal value function, i.e., the maximal cumulative reward as a function of the process state and possibly also of the control action. Representing value functions exactly is only possible when the state-action space contains a relatively small number of discrete elements. In large discrete spaces and in continuous spaces, the value function generally has to be approximated. This is especially the case in automatic control, where the state and action variables are usually continuous. Therefore, this paper proposes an algorithm for approximate DP that represents state-action value functions (called Q-functions in the sequel) using a fuzzy rule base with singleton consequents ( Kruse, Gebhardt, & Klowon, 1994, Section 4.2). This algorithm works for deterministic problems, under the discounted return criterion. It is called fuzzy Q-iteration, because it combines the classical Q-iteration algorithm with a fuzzy approximator. The fuzzy rule base receives the state as input, and produces the Q-values of the discrete actions as outputs. The set of discrete actions is selected beforehand from the (possibly continuous) original action space. The membership functions of the fuzzy antecedents can also be seen as state-dependent basis functions or features ( Bertsekas & Tsitsiklis, 1996). We show that fuzzy Q-iteration asymptotically converges to an approximate Q-function that lies within a bounded distance from the optimal Q-function. The suboptimality of the Q-function obtained after a finite number of iterations is also bounded. Both of these Q-functions lead to policies with a bounded suboptimality. We also show that fuzzy Q-iteration is consistent: under appropriate continuity assumptions on the process dynamics and on the reward function, the approximate Q-function converges to the optimal one as the approximation accuracy increases. These properties hold both when the parameters of the approximator are updated in a synchronous fashion, and when they are updated asynchronously. Additionally, the asynchronous algorithm is proven to converge at least as fast as the synchronous one. In a simulation example, fuzzy Q-iteration is used to control a two-link manipulator, and compared with the state-of-the-art fitted Q-iteration algorithm ( Ernst, Geurts, & Wehenkel, 2005). The remainder of this paper is structured as follows. Section 2 gives a brief overview of the literature related to our results. Section 3 describes Markov decision processes and the Q-iteration algorithm. Section 4 introduces fuzzy Q-iteration. This novel algorithm is analyzed in Section 5, and applied to a two-link manipulator example in Section 6. Section 7 concludes the paper and outlines some ideas for future work.
نتیجه گیری انگلیسی
In this paper, we have considered fuzzy Q-iteration, an approximate dynamic programming algorithm that represents state-action value functions using a fuzzy partition of the state space and a discretization of the action space. The algorithm was shown to be convergent to a near-optimal solution, and consistent under continuity assumptions on the dynamics and the reward function. The performance of fuzzy Q-iteration was illustrated in an example involving the control of a two-link manipulator, in which fuzzy Q-iteration also compared favorably with fitted Q-iteration. An important next step is extending fuzzy Q-iteration to stochastic problems. Section 4 indicated some possible approaches to perform this extension. The fuzzy approximator influences the computational complexity of fuzzy Q-iteration, as well as the accuracy of the solution. While we have considered in this paper that the MFs were given a priori, a technique to construct for a given accuracy an approximator with a small number of MFs would be very useful in practice. Furthermore, action-space approximators more powerful than discretization could be studied, e.g., approximators based on a fuzzy partition of the action space.