The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret - Mathematics > Optimization and ControlReportar como inadecuado




The Non-Bayesian Restless Multi-Armed Bandit: a Case of Near-Logarithmic Regret - Mathematics > Optimization and Control - Descarga este documento en PDF. Documentación en PDF para descargar gratis. Disponible también para leer online.

Abstract: In the classic Bayesian restless multi-armed bandit RMAB problem, there are$N$ arms, with rewards on all arms evolving at each time as Markov chains withknown parameters. A player seeks to activate $K \geq 1$ arms at each time inorder to maximize the expected total reward obtained over multiple plays. RMABis a challenging problem that is known to be PSPACE-hard in general. Weconsider in this work the even harder non-Bayesian RMAB, in which theparameters of the Markov chain are assumed to be unknown \emph{a priori}. Wedevelop an original approach to this problem that is applicable when thecorresponding Bayesian problem has the structure that, depending on the knownparameter values, the optimal solution is one of a prescribed finite set ofpolicies. In such settings, we propose to learn the optimal policy for thenon-Bayesian RMAB by employing a suitable meta-policy which treats each policyfrom this finite set as an arm in a different non-Bayesian multi-armed banditproblem for which a single-arm selection policy is optimal. We demonstrate thisapproach by developing a novel sensing policy for opportunistic spectrum accessover unknown dynamic channels. We prove that our policy achievesnear-logarithmic regret the difference in expected reward compared to amodel-aware genie, which leads to the same average reward that can be achievedby the optimal policy under a known model. This is the first such result in theliterature for a non-Bayesian RMAB.



Autor: Wenhan Dai, Yi Gai, Bhaskar Krishnamachari, Qing Zhao

Fuente: https://arxiv.org/







Documentos relacionados