Dynamic Policy Programming - Computer Science > LearningReport as inadecuate

Dynamic Policy Programming - Computer Science > Learning - Download this document for free, or read online. Document in PDF available to download.

Abstract: In this paper, we propose a novel policy iteration method, called dynamicpolicy programming DPP, to estimate the optimal policy in theinfinite-horizon Markov decision processes. We prove the finite-iteration andasymptotic l\infty-norm performance-loss bounds for DPP in the presence ofapproximation-estimation error. The bounds are expressed in terms of thel\infty-norm of the average accumulated error as opposed to the l\infty-norm ofthe error in the case of the standard approximate value iteration AVI and theapproximate policy iteration API. This suggests that DPP can achieve a betterperformance than AVI and API since it averages out the simulation noise causedby Monte-Carlo sampling throughout the learning process. We examine thistheoretical results numerically by com- paring the performance of theapproximate variants of DPP with existing reinforcement learning RL methodson different problem domains. Our results show that, in all cases, DPP-basedalgorithms outperform other RL methods by a wide margin.

Author: Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

Source: https://arxiv.org/

Related documents