ترجمه فارسی عنوان مقاله

برنامه ریزی پویا تقریبی برای مشکل موجودی: مقایسه تجربی

عنوان انگلیسی

Approximate dynamic programming for an inventory problem: Empirical comparison

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
25692	2011	25 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Computers & Industrial Engineering, Volume 60, Issue 4, May 2011, Pages 719–743

ترجمه کلمات کلیدی

(1)/(1 - 1) - برنامه ریزی پویا تقریبی - کنترل موجودی - یادگیری تقویت - شبیه سازی - عدم تجانس -

کلمات کلیدی انگلیسی

Approximate dynamic programming, Inventory control, Reinforcement learning, Simulation, Heterogeneity, AR(1)/GARCH(1,1),

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

This study investigates the application of learning-based and simulation-based Approximate Dynamic Programming (ADP) approaches to an inventory problem under the Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model. Specifically, we explore the robustness of a learning-based ADP method, Sarsa, with a GARCH(1,1) demand model, and provide empirical comparison between Sarsa and two simulation-based ADP methods: Rollout and Hindsight Optimization (HO). Our findings assuage a concern regarding the effect of GARCH(1,1) latent state variables on learning-based ADP and provide practical strategies to design an appropriate ADP method for inventory problems. In addition, we expose a relationship between ADP parameters and conservative behavior. Our empirical results are based on a variety of problem settings, including demand correlations, demand variances, and cost structures.

مقدمه انگلیسی

Inventory management is one of the major business activities. A well-managed inventory can help a business stay competitive by keeping its cash flow at a controllable level. Recently, Zhang (2007) analyzed data obtained from the M3 Census Bureau and established the appropriateness of the GARCH(1,1) model in inventory demand. He also showed that the inventory costs may increase significantly when the GARCH(1,1) model is not accounted for and analytically developed an order-upto-level policy for a problem without a setup cost. However, his solution is highly problem specific—a change in the structure of the problem requires reanalysis of the problem. For example, a setup cost is common in practice, but it is not included in Zhang (2007). Inclusion of a setup cost into a problem makes a cost function highly nonlinear and renders an analytical approach very difficult. Furthermore, an analytical approach is time consuming and requires highly specialized skill and effort, as discussed in Kleinau and Thonemann, 2004 and Jiang and Sheng, 2009. Inventory problems appear in various forms and their forms often change over time. Therefore an analytical approach is not suitable for practical inventory problems. Many articles, e.g., Silver, 1981, Lee and Billington, 1992 and Bertsimas and Thiele, 2006 to name a few, addressed the need for an efficient flexible inventory solution simple to implement in practice. Approximate Dynamic Programming (ADP) is a method to solve a practical Markov decision problems. It does not rely on an analytical treatment of the problem or hard-to-obtain information, e.g., transition probabilities. Therefore ADP has received extensive attention in many control applications, as discussed by Werbos (2004). ADP uses an approximation technique, which mostly is attained by either a learning-based scheme or a simulation-based scheme. A learning-based ADP method uses a learning technique, e.g., temporal difference learning, to approximate state-action costs. A simulation-based ADP method uses simulation, e.g., Monte Carlo simulation, to approximate state-action costs. Most previous studies of an ADP application to inventory management focused on learning-based ADP; see, e.g., Van Roy et al., 1997, Giannoccaro and Pontrandolfo, 2002, Shervais et al., 2003, Kim et al., 2005, Kwon et al., 2008, Kim et al., 2008, Chaharsooghi et al., 2008 and Jiang and Sheng, 2009. Only Choi, Realff, and Lee (2006) investigated simulation-based ADP. Nonetheless, these two schemes are not mutually exclusive. For example, Kim et al. (2008) used simulation to evaluate consequences of actions not taken for accelerating a learning process in their learning-based ADP. Although all these studies showed satisfying results in their domain problems, to the best of our knowledge, there is no previous work investigating an ADP application to a problem with the GARCH(1,1) model. GARCH(1,1) introduces two latent state variables. The latent state variables will be inadvertently left out if the GARCH(1,1) model is not accounted for. This has posed a challenge to the model-free property of a learning-based ADP method. When a model of the problem is available, a simulation-based ADP method is another alternative. A simulation-based ADP method is an approach intermediate between an analytical approach and a learning-based ADP method. An analytical approach requires a model of the problem and a development of an analytical solution. A learning-based ADP method requires minimum knowledge of the problem. A simulation-based ADP method requires a model of the problem and, rather than relying purely on analysis, it uses simulation to provide information for assisting action selection. Two simulation-based ADP methods are investigated here: Rollout and Hindsight Optimization (HO). Though it has been used in other applications, Rollout had been investigated for inventory problems in only one study. HO, as discussed by Chong, Givan, and Chang (2000), had not been studied for an inventory problem previously. Also, how these simulation-based ADP methods perform compared to a learning-based ADP method had not been investigated previously. Our study is intended to fill in this space. It investigates the application of ADP to an inventory problem with demand of AR1/GARCH(1,1) and provides empirical comparison among Sarsa, Rollout, and HO. The findings here provide a practical approach to design an ADP method for an inventory problem and an insight into relations of ADP components, performance, and control behavior. Furthermore, the results reaffirm the model-free property of a learning-based ADP method even in the presence of latent state variables introduced by GARCH(1,1). The understanding exposed in this paper will help promote efficient inventory management and aid in the transfer of inventory research into practice. This section has provided an overview of our study. Section 2 provides a review of the related literature. Section 3 establishes the general background of the three methods. Section 4 explains the problem under investigation, how our empirical study is conducted, how each controller is set up, how experiments are performed, and how the experimental results are evaluated. Section 5 explains what the experimental results indicate. The last section presents conclusions and further discussions.

نتیجه گیری انگلیسی

Regarding the performance3 of ADP methods under different scenarios, Fig. 4 and Fig. 5 indicate the following. • As the demand correlation decreases, all ADP methods seem to perform much better than the Look-Ahead method. The improvement seems to settle around 55% for zero to negative correlation. • As the demand variance decreases, Sarsa seems to perform worse compared to the Look-Ahead method. At low demand variance, Sarsa seems to have similar performance as the Look-Ahead method. This may be explained by the better performance of the Look-Ahead method, because the Look-Ahead future projection uses zero demand noise, which provides a good approximate cost for the problem with low variance. • Across different set up cost ratios, the performance of Sarsa seems to be quite constant. • As the penalty cost ratio increases, Sarsa seems to perform better compared to the Look-Ahead method. • As the holding cost ratio increases, Sarsa and HO perform substantially worse than the Look-Ahead method. Rollout still performs relatively similar to the Look-Ahead method at high holding cost ratio. Since the cost of holding inventory and stock out are the same, high holding cost ratio (h/b = 1) may confuse Sarsa learning mechanism and lead to substantially bad performance. Both Sarsa (with all state variables) and Sarsa without GARCH(1,1) state variables perform well compared to the Look-Ahead method, when holding cost is low compared to penalty cost. Regarding the issue of the significance of the GARCH model when applying learning-based ADP, the performance of Sarsa with all state variables and Sarsa without GARCH(1,1) state variables is not significantly different. This result assuages the concern about learning-based ADP for problems with latent state variables. The explanation may be that the GARCH(1,1) state variables do not have a direct effect on a period cost (Eq. (16)). They only give extra information about uncertainty in the variance in the demand forecast. The robustness of Sarsa with some information missing shown in our experiments confirms the observations of Csáji and Monostori (2008) that temporal difference learning, a learning technique used in Sarsa, can tolerate inaccurate information to some extent. However, the good performance of Zhang’s solution (in problems without set up costs) and Rollout (in all scenarios) indicate the significance of GARCH model in inventory management and supports Zhang’s claim. Sarsa’s ability to tolerate some degree of missing information is beneficial and can be exploited. Inclusion of the only more relevant state variables allows the use of a smaller RBF. Sarsa with a smaller RBF requires less computational effort. Bertsekas and Tsitsiklis (1996) also support this exploitation within a more general concept, using feature extraction. Feature extraction is an ADP state preparation technique such that the system raw state variables can be preprocessed to extract features that represent more important aspects of state. These features are then used as ADP state variables. To address the issue of function approximation, the evenly distributed structure and the midpoint strategy were used and were effective in setting up RBF. The evenly distributed structure and the midpoint strategy provide a systematic approach to design RBF for an ADP application to inventory problems. Pertinent to a comparison among different methods, Rollout performs significantly better than the Look-Ahead method, Sarsa, and HO in most cases. HO results in lower costs (aggregate costs in Periods 13–60) than Sarsa in most cases, but only few cases (demand correlation a1 = 0 and −0.8) that significance tests can confirm the differences. Rollout has been investigated in Choi et al. (2006). Choi et al. (2006) used heuristic search over sets of pre-specified (s, S) policies as Rollout’s base policy. The heuristic search is computationally time consuming compared to the EOQ parameterized (s, S) policy used in our study. In addition to a base policy, another decision in the implementation of Rollout is a selection of its parameter values. Rollout has two parameters: the number of iterations N and the simulation horizon T. Rollout with N = 1 shows a monotonically downtrend in its relation between the simulation horizon length and the resultant average total cost. This monotonic downward trend is not apparent for N > 1. To delve into this behavior, average on-site inventory x, an average period cost and a maximum period cost were used as shown in Fig. 8 (based on Problem 1 experiment). The plots in Fig. 8 are arranged such that plots in the left column show average on-site inventory x versus T, plots in the middle column show average single cost versus T, and plots in the right column show the maximum single cost versus T. Each column has four plots for each value of N: N = 1 (top row), N = 5 (second row), N = 10 (third row), and N = 50 (last row), as indicated in each plot title. Full-size image (76 K) Fig. 8. Plot of average inventory and average and maximum single-period cost for each Rollout setting. The plot is arranged along number of period(s) of Rollout simulation. Figure options Plots of average on-site inventory show an uptrend in every N. An average cost when Rollout has lower N and the maximum cost when Rollout has higher N show a more noticeable downtrend. The explanation may lie in the role that the combination of algorithmic parameters, N and T, has in approximating state-action cost-to-go. With a few iterations (small N), Rollout may not have seen many rare demand surges, and a longer horizon (large T) helps to promote an averaging effect, which results in a better decision for the average case. With more iterations and a longer horizon (large N and T), Rollout may have a better chance to see more rare demand surges, which, even though rare, can substantially effect the total cost when this happens. This may drive Rollout to choose a more conservative action, such as stocking a higher inventory level, to safeguard against any chance of unusually high demand. Therefore it generates a higher average cost, but lower maximum cost. This finding on the relationship between Rollout parameters and conservative behavior is new and needs further study to further confirm its nature as well as to better formulate the new relationship. Understanding this relationship may contribute to a new design for ADP, which may explicitly incorporate a factor to control conservative behavior, or to create a systematic strategy to determine ADP parameters to balance out average and worst-case performance. Fig. 9 shows average inventory levels of each Rollout parameter under different demand scenarios. The upper row shows different demand correlations, a1 = 0.8, 0.4, 0, −0.4, and −0.8, as indicated in the legend of the upper left plot. The lower row shows different demand variances, ν0 = 70, 40, and 10, as indicated in the legend of the lower left plot. The left, middle, and right columns show plots when using Rollout with N = 1, 5, and 10, respectively, as indicated in each plot title. Full-size image (69 K) Fig. 9. Plot of average inventory levels of each Rollout parameter under different demand correlations (upper row) and variances (lower row). Figure options Fig. 10 shows average inventory levels of each Rollout parameter under different cost structures. The top row shows different set up cost ratios, K/g = 1.5, 1, 0.5, and 0, as indicated in the legend of the top left plot. The middle row shows different penalty cost ratios, b/g = 2, 1, and 0.5, as indicated in the legend of the middle left plot. The bottom row shows different holding cost ratios, h/b = 1, 0.2, and 0.02, as indicated in the legend of the top left plot. The plot title indicates a value of the Rollout parameter N. Full-size image (87 K) Fig. 10. Plot of average inventory levels of each Rollout parameters under different set up costs (top row), penalty costs (middle row) and holding costs (bottom row). Figure options The same uptrend relationship between Rollout simulation horizon (T) and average inventory level is found in every scenario, as shown in Fig. 9 and Fig. 10. In addition, at N = 10 and T = 12, these two figures indicate the following. • As the demand correlation increases, the average inventory level increases. • As the demand variance increases, the average inventory level increases. This may be explained by that the inventory is raised to safeguard against higher risk of large demand. • As the set up cost ratio increases, the average inventory level seems to increase. This may be explained by that as the inventory is raised in order to reduce the number of transactions. • As the penalty cost ratio increases, the average inventory level increases. This is explained by that the inventory is raised to reduce the chance of inventory shortage. • As the holding cost ratio increases, the average inventory level decreases. This is explained by that the inventory is lower in response to higher cost of holding items. In summary, Sarsa, Rollout, and HO can be used as inventory controllers for a problem with temporal demand heteroscedasticity, in most cases. Sarsa and HO are not recommended for an inventory problem with high holding to penalty cost ratio. In other cases, Sarsa can be employed effectively without a model of the problem. The absence of latent state variables, introduced by GARCH(1,1), does not apparently deteriorate Sarsa performance. Furthermore, omitting these latent state variables is beneficial in terms of shortening the computation time. When there is a model of the problem, Rollout is seen to provide better performance for all scenarios. Our study compares learning-based and simulation-based ADP methods to each other. As mentioned earlier, each method has its own strength. Learning-based ADP is more adaptive. Simulation-based ADP is more effective at reducing cost and does so more reliably. A combination of learning-based and simulation-based schemes may provide a good balance between adaptability and reliability. A forecasting technique can be used to estimate uncertainty and variance, or a distribution, for simulation. Then, with known system dynamics, simulation can be used to approximate a state-action value. A forecasting technique provides adaptability, while simulation provides reliability as it reduces variation in control quality. A hybrid system may provide a new inventory management approach that is more adaptive and reliable, as well as new approaches for other decision/control applications. Kim et al. (2008) have started in this direction. They used known system dynamics to simulate consequences of actions not taken, after demand is observed. Then, not only is a value of a taken action updated, but values of all other actions are also updated. Therefore, their approach provides a fast-adapting inventory control system. However, the work of Kim et al. (2008) is limited to a stateless inventory problem. The extension to a state-based case will produce a wider range of applications.