حداکثر یادگیری تقویتی پاداش: یک معیار غیر تجمعی پاداش
|کد مقاله||سال انتشار||مقاله انگلیسی||ترجمه فارسی||تعداد کلمات|
|27050||2006||9 صفحه PDF||سفارش دهید||5699 کلمه|
Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)
Journal : Expert Systems with Applications, Volume 31, Issue 2, August 2006, Pages 351–359
Existing reinforcement learning paradigms proposed in the literature are guided by two performance criteria; namely: the expected cumulative-reward, and the average reward criteria. Both of these criteria assume an inherently present cumulative or additivity of the rewards. However, such inherent cumulative of the rewards is not a definite necessity in some contexts. Two possible scenarios are presented in this paper, and are summarized as follows. The first concerns with learning of an optimal policy that is further away in the existence of a sub-optimal policy that is nearer. The cumulative rewards paradigms suffer from slower convergence due to the influence of accumulating the lower rewards, and take time to fade away the effect of the sub-optimal policy. The second scenario concerns with approximating the supremum values of the payoffs of an optimal stopping problem. The payoffs are non-cumulative in nature, and thus the cumulative rewards paradigm is not applicable to resolve this. Hence, a non-cumulative reward reinforcement-learning paradigm is needed in these application contexts. A maximum reward criterion is proposed in this paper, and the resulting reinforcement-learning model with this learning criterion is termed the maximum reward reinforcement learning. The maximum reward reinforcement learning considers the learning of non-cumulative rewards problem, where the agent exhibits a maximum reward-oriented behavior towards the largest rewards in the state-space. Intermediate lower rewards that lead to sub-optimal policies are ignored in this learning paradigm. The maximum reward reinforcement learning is subsequently modeled with the FITSK-RL model. Finally, the model is applied to an optimal stopping problem with a nature of non-cumulative rewards, and its performance is encouraging when benchmarked against other model.
Reinforcement learning is a trial and error learning method, where the agent tries to explore and maximize the numerical state values from the feedback it receives over the long run. The process that the agent follows is modeled as the Markov decision process (MDP). In the MDP context, the two most well-studied optimality criteria are the expected cumulative reward and the average reward criteria ( Puterman, 1994). The expected cumulative reward criterion has been adopted as the de facto objective measure in reinforcement learning research ( Bertsekas and Tsitsiklis, 1996 and Kaelbling et al., 1996; Kaelbling, Littman & Moore, 1996; Rummery & Niranjan, 1994; Sutton, 1984, 1988; Sutton & Barto, 1998; Watkins, 1989). This criterion employs a technique to exponentially discount the future rewards, such that the near-future rewards are more valuable than far-future rewards. However, some research has demonstrated that the expected cumulative reward criterion is unsuitable for some problems, particularly in the undiscounted reward domain ( Schwartz, 1993). Some reinforcement learning methods based on the average reward criterion have been proposed to resolve the undiscounted reward reinforcement learning problem ( Mahadevan, 1996; Schwartz, 1993; Tadepalli & Ok 1998). In the essence of this, both classes of methods employ some form of cumulative rewards criteria; the former with a reward discounting approach such that their infinite cumulative rewards are finite, and the latter with a reward averaging approach such that both short and long-term rewards are indistinguishable. However, a central issue is a problem of whether such a cumulative rewards formulation is a definite necessary requirement in all-learning contexts. Is there any problem that the cumulative rewards formulation is unnecessary, and unjustifiable? This paper discusses two possible scenarios where the cumulative reward formulation is to be avoided for superior learning result. This is demonstrated in 3 and 4. In Section 3, a two-cycle task from (Mahadevan, 1996) is discussed. One of the cycle leads to a sub-optimal policy and the other cycle leads to an optimal policy, but the rewards of the optimal policy is further away. Therefore, the focus of the learning is on how to bypass the sub-optimal policy in the search for an optimal policy. The classical Q-learning (Watkins, 1989) was demonstrated to suffer from this problem because it was being dragged down by the sub-optimal policy, causing it to be lagging behind from switching to a better policy. In Section 4, an optimal stopping problem arising from financial derivative pricing (Tsitsiklis & Van Roy, 1999) is discussed. This is a case where the cumulative rewards formulation is unjustifiable. On the other hand, this problem is concerned with finding the supremum of the payoff values of each state, which is a function of the expected discounted maximum rewards in a non-cumulative manner. Furthermore, a discounted reward is significant in this domain as the discounting operation represents a compensation of future rewards with respect to the drift rate of financial pricing. This represents a contrary argument against the claim by (Schwartz, 1993) that discounting of rewards is not necessarily important. Having these, a reinforcement learning formulation with non-cumulative maximum reward criterion is proposed to accommodate the learning needs of such tasks. The maximum reward reinforcement-learning agent exhibits a maximum reward-oriented behavior. It attempts to search for the policy that leads to the largest reward; irregardless of intermediate lower rewards along the path. This positions it to have superior immunity to the existence of sub-optimal policies in the problem domain, such as the two cycles problem in Section 3. Furthermore, the maximum reward learning is demonstrated to be effective in approximating the future payoff in a financial pricing context. The learning system exhibits an asymptotic behavior of approximating the supremum value with the maximum reward formulation, as will be discussed in Section 4. This paper is organized as follows. Section 2 describes the general formulation of the maximum reward reinforcement learning and its update formulae as part of the modular design of the earlier proposed generic reinforcement-learning framework from (Quah & Quek, submitted for publication). Section 3 studies the learning behavior of the maximum reward reinforcement learning using a two-cycle task, and compares against two existing reinforcement-learning models. Section 4 applies the proposed maximum reward reinforcement learning with its generic learning framework to an optimal stopping problem arising from the financial derivative pricing context. Section 5 concludes this paper.
نتیجه گیری انگلیسی
In this paper, a new maximum reward criterion is proposed as the objective of reinforcement learning. The maximum reward criterion is subsequently incorporated into the reinforcement learning update formulae from the Markov decision process perspective. Subsequently, the maximum reward reinforcement learning is mapped to the generic reinforcement-learning framework to solve the modelling issue of this new learning paradigm. The distinct characteristic of the maximum reward reinforcement learning is a non-cumulative, maximum reward-oriented behaviour of the learning agent, as opposed to the usual cumulative rewards-oriented behaviour. It attempts to search for the policy that leads to the largest reward, irregardless of intermediate lower rewards. This positions it to have superior immunity to the existence of sub-optimal policies. This is demonstrated with the example in Section 3, where the maximum reward agent established a convergence to the optimal policy in the shortest timeframe. However, other agents with underlying cumulative rewards-oriented behaviour are being dragged down by the sub-optimal policy and lower reward. As a result, they suffer from slower convergence to the optimal policy. Section 4 demonstrates a successful application of maximum reward reinforcement learning to asymptotically approximate the supremum of payoff values in an optimal stopping problem. The maximum reward FITSK-RL model proposed in Section 2.3 has been adopted to resolve this problem. This problem cannot be effectively solved using the usual cumulative rewards reinforcement-learning paradigm because the underlying payoff values are non-cumulative. Despite reducing the problem to a two-state dimension problem, the approximation result using the FITSK-RL model is slightly superior to the original formulation with 100 state dimensions in (Tsitsiklis & Van Roy, 1999). Furthermore, this improves the interpretability of the learning results and reduces the modelling complexity of the problem. The approximation result is visualized such that the rationale of the result can be examined with ease. This would not be possible with the original formulation of the problem as the state dimensions are too large and it is hard to gain insight into the learning result of the manually defined local basis functions. However, this is possible with the FITSK-RL model as the operations are systematically defined, and the state dimensions are small. In summary, the maximum reward reinforcement learning approach is proposed as a possible new learning paradigm to exhibit superior performance for problems with underlying non-cumulative nature of reward values. Examples and simulations are provided to illustrate this. The modeling issue of the maximum reward reinforcement learning is solved by mapping it into the generic reinforcement learning framework. However, the theoretical research of maximum reward reinforcement learning is still at its infancy stage, particularly from the optimal control perspective. The convergence of maximum reward reinforcement learning resulted from a linear function approximator is yet to be established. Some possible resemblance of the convergence result may possibly be extended from the Tsitsiklis' papers (Tsitsiklis and Van Roy, 1997 and Tsitsiklis and Van Roy, 1999), which justifies a need of possible further research. The relationship to the sensitive discount optimality criterion (Mahadevan, 1996; Puterman, 1994) has not been studied, and is worthy of a further research effort.