# انتخاب تقریبی و طراحی تابع جریمه برای یک روش کنترل مبتنی بر برنامه ریزی پویا تقریبی

کد مقاله | سال انتشار | مقاله انگلیسی | ترجمه فارسی | تعداد کلمات |
---|---|---|---|---|

24938 | 2006 | 22 صفحه PDF | سفارش دهید | 11540 کلمه |

**Publisher :** Elsevier - Science Direct (الزویر - ساینس دایرکت)

**Journal :** Journal of Process Control, Volume 16, Issue 2, February 2006, Pages 135–156

#### چکیده انگلیسی

This paper investigates the choice of function approximator for an approximate dynamic programming (ADP) based control strategy. The ADP strategy allows the user to derive an improved control policy given a simulation model and some starting control policy (or alternatively, closed-loop identification data), while circumventing the ‘curse-of-dimensionality’ of the traditional dynamic programming approach. In ADP, one fits a function approximator to state vs. ‘cost-to-go’ data and solves the Bellman equation with the approximator in an iterative manner. A proper choice and design of function approximator is critical for convergence of the iteration and the quality of final learned control policy, because an approximation error can grow quickly in the loop of optimization and function approximation. Typical classes of approximators used in related approaches are parameterized global approximators (e.g. artificial neural networks) and nonparametric local averagers (e.g. k-nearest neighbor). In this paper, we assert on the basis of some case studies and a theoretical result that a certain type of local averagers should be preferred over global approximators as the former ensures monotonic convergence of the iteration. However, a converged cost-to-go function does not necessarily lead to a stable control policy on-line due to the problem of over-extrapolation. To cope with this difficulty, we propose that a penalty term be included in the objective function in each minimization to discourage the optimizer from finding a solution in the regions of state space where the local data density is inadequately low. A nonparametric density estimator, which can be naturally combined with a local averager, is employed for this purpose.

#### مقدمه انگلیسی

In dynamic programming (DP), one computes an optimal policy off-line by solving the ‘Bellman equation’ [1]. The objective of DP is to obtain the optimal ‘cost-to-go’ function, which is a map between a starting state and minimum achievable total cost. The optimal cost-to-go information allows us to calculate an optimal control decision for any given system state without having to consider the system’s dynamic behavior in a long future time horizon. Hence it can potentially reduce the large on-line computational burden associated with the math programming based approaches like model predictive control (MPC). In addition, unlike MPC, DP can be used to derive an optimal feedback policy for an uncertain system. However, the approach leads to an exponential growth in the computation with respect to the state dimension, and the computational and storage requirements for all problems of practical interest remain unwieldy even with today’s computing hardware. Whereas the control community was quick to discard the DP approach as a viable approach, researchers in the fields of artificial intelligence (AI) and machine learning (ML) have explored the possibility of applying the theories of psychology and animal learning to solve DP problems [2]. These efforts were made under the umbrella of reinforcement learning (RL) [3] and neuro-dynamic programming (NDP) [4]. Though RL and NDP are established techniques for robotics and operations research problems, it is difficult to apply them directly to process control problems for several reasons. Most RL algorithms improve the cost-to-go function on-line by trial and error. This is potentially costly and risky for process control because one cannot, for example, operate a chemical reactor in a random manner just to explore the state space and learn. Though NDP algorithms are based on off-line update of cost-to-go function, their main premise is that a very large set of data densely populating all relevant parts of the state space for control are available. A typical chemical process, however, has a huge state space and it is unrealistic to attempt to populate all regions of the state space that may potentially come into play during operation. Furthermore, both the RL and NDP approaches assume that a failure in learning merely means a need for new data through additional exploration rather than a catastrophic situation. Finally, their common update rules based on a ‘temporal difference’ term are suited for “discrete” state space rather than a continuous space common for process control problems [4]. Inspired by the work done in the RL/NDP community, we have recently introduced an approximate dynamic programming (ADP) strategy suited for process control problems [5] and [6]. Our approach is to solve the Bellman equation for those state points sampled from closed-loop simulations or experiments and to obtain improved control policies through ‘value-’ or ‘policy iteration’. Since typical process control problems involve a continuous state space, a function approximator is used to estimate a cost-to-go value for any given state. In our previous work, we have shown the efficacy of the approach through applications to several nonlinear process control problems [6], [7], [5] and [8]. One important lesson learned from our experience with the approach is that stability of learning and quality of a learned control policy are critically dependent on the structure of the function approximator. In this paper, the stability for off-line learning means that the infinity norm of the difference between cost-to-go values of current and previous rounds of iteration continues to converge within any specified tolerance value. The typical approach in the NDP and RL literature is to fit a global approximator like a neural network to cost-to-go data. While this approach has seen some notable successes in certain applications (e.g. a backgammon player at a world champion level [9]), it has also met with failures in many other applications, for example, due to the problems of local convergence and overfitting [10] and [11]. In certain instances, the off-line iteration would even fail to converge, with the cost-to-go approximation showing extreme nonmonotonic behavior or instability with the iteration. The failure with a general function approximator was first explained by Thrun and Schwartz [12] with what they called an “overestimation” effect. They assumed uniformly distributed and independent error in the approximation and derived bounds on the necessary accuracy of the function approximator and discount factor. Sabes [13] showed that bias in optimality can be large when a global approximator with a linear combination of basis functions is employed. Boyan and Moore [14] listed several simple simulation examples where popular approximators fail miserably in off-line learning. Sutton [15] modified the experimental setup for the same examples and adopted a model-free on-line learning scheme to make them work. In summary, experiments with global function approximation schemes have produced mixed results, which are consistent with our experience. Gordon [16] presented a ‘stable’ cost-to-go learning scheme with off-line iteration for a fixed set of states. A class of function approximators with a ‘nonexpansion’ property (e.g. k-nearest neighbor) was shown to guarantee off-line convergence of cost-to-go values during the value iteration. Gordon also provided a result for the accuracy of converged cost-to-go values in the case of 1-nearest neighbor, which has a fixed point property in the approximation. Tsitsiklis and Van Roy [17] provided a proof of convergence and its accuracy for linear function approximators when applied to finite MDPs under temporal difference learning with a particular on-line state sampling scheme. They commented that convergence properties with general nonlinear function approximators (e.g. neural network) remained unclear. For systems with a continuous state space (infinite MDP), there are no proofs for convergence in the learning of cost-to-go function with general function approximators. For linear quadratic regulation problems, there exist proofs of convergence and its accuracy bounds for continuous state-action space problems with specific approximator structures [18]. Recently, a kernel-based local averaging structure was shown to have a property of convergence to optimal cost-to-go with increasing number of samples and decreasing kernel bandwidth under a certain model-free learning scheme [19] and [20]. In this paper we show that, even for problems involving continuous state and action spaces, the nonexpansion property guarantees stable learning in the off-line value iteration step of our ADP strategy. We back up this assertion with a theoretical result and several case studies. It still remains to be ascertained that the converged cost-to-go indeed leads to optimal or at least improved closed-loop performance. In this regard, we show that simply using a local averager does not necessarily guarantee closed-loop stability and performance, especially for a system with high state/input dimension. Analysis of the failed cases will show that a safeguard against ‘over-extrapolation’ is needed for the approach to be successful. This problem is fundamental in that high dimensionality, complex nonlinear dynamics, and limited opportunities for experimentation often lead to sparse training data sets. One noteworthy work dealing with the issue of excessive extrapolation in the context of cost-to-go function approximation is found in Smart and Kaelbling [21]. They construct an approximate convex hull, which is called independent variable hull (IVH) taking an elliptic form. Their scheme finds a fixed number of training data that lie closer than some threshold value from a query point and builds the IVH. Whenever they have to estimate cost-to-go for a query point, the IVH is calculated and the query point is checked whether it lies inside the hull or not. Any queries within the convex hull are considered to be reliable and those outside are deemed unreliable. This approach is not computationally attractive. It also gives up making a prediction for the point outside the hull and requires more random exploration. In addition, the design of a convex hull may be misleading if the elliptic hull contains a significant empty region around the query point. One plausible approach to deal with this problem is to calculate and use information on the local data density in solving the minimization appearing in the Bellman equation. To this end, we propose to use a probability density estimator proposed by Parzen [22] for estimation of local data density. The density estimate is then translated into a quadratic penalty term added to the cost-to-go estimate. The penalty term discourages the optimizer from finding a solution (i.e., a successor state) in regions where the training data density is inadequately low. This way we can make the off-line iteration steps and on-line control calculations more robust against potential approximation errors from insufficient sampling. The rest of the paper is organized as follows. In Section 2, we explain the basic ADP strategy suited for process control problems and discuss a class of local approximators with the nonexpansion property. In Section 3, our criteria for evaluating different types of approximators are first provided, and then several case studies are presented to compare the performances of global approximators vs. local approximators. Section 4 presents the design of a penalty function based on a multi-dimensional Parzen density estimator and discusses the modified ADP strategy that makes use of the penalty function. In Section 5, we revisit the failed case study of Section 3, and show that the modified ADP strategy is able to provide good performance. Finally, Section 6 offers some conclusions.

#### نتیجه گیری انگلیسی

In this paper, we first investigated the choice of approximators for the approximate dynamic programming strategy for process control. Specifically, we compared a parameterized global approximator and a nonparameterized local averager with the nonexpansion property in terms of their off-line convergence behavior and eventual on-line performance. Some theoretical analysis as well as simulation results for three process control problems let us conclude that local averagers result in more predictable off-line convergence behavior and often better on-line performance. Though global and parametric representation such as artificial neural networks has been used successfully as a cost-to-go function approximator in many reported applications, training data should be large and densely occupy the state space, which is rare for process control problems. Despite the nice behavior of the local averagers in the off-line learning phase, it is shown that its use does not automatically guarantee an acceptable on-line performance, especially in the case that learning data is sparse in the state space. This is because uncontrolled extrapolations can deteriorate the quality of cost-to-go approximations since the learned cost-to-go information is valid only in the regions of the state space where learning data were available—a fact that is obvious but not always given attention to in applications. To address the issue, a penalty function based approach was proposed. In the minimization for off-line value iteration and on-line control, a penalty function based on an estimate of local data density was added to systematically bias the search to those regions of acceptable data density, thereby “trapping” the solution within them. The modification provided a control policy that regulated the system successfully and significantly improved on-line performances compared to the starting control policy. Simulations with different scenarios and control policies will enlarge the domain of solution and help improve the solution significantly. We finally note that the suggested approach in this paper requires either a fundamental or an empirical model for learning and using a cost-to-go function. The suggested scheme, however, can be extended to a case where any type of model is unavailable by including an action vector into the argument of a cost-to-go function. Readers who are interested in the model-free ADP approach are referred to Lee and Lee [35].