Files in this item

FilesDescriptionFormat

application/pdf

application/pdfMAO-THESIS-2021.pdf (1MB)
(no description provided)PDF

Description

Title:Model-free reinforcement learning in non-stationary Markov Decision Processes
Author(s):Mao, Weichao
Advisor(s):Başar, Tamer
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:M.S.
Genre:Thesis
Subject(s):reinforcement learning
Markov decision process
non-stationarity
Abstract:Reinforcement learning (RL) studies the problem where an agent maximizes its cumulative reward through sequential interactions with an initially unknown environment, usually modeled by a Markov Decision Process (MDP). The classical RL literature typically assumes that the state transition functions and the reward functions of the MDP are time-invariant. Such a stationary model, however, cannot capture the dynamic nature of many sequential decision-making problems in practice. In this thesis, we consider the problem of reinforcement learning in \emph{non-stationary} MDPs. In our setting, both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. We theoretically show that RestartQ-UCB outperforms existing solutions in terms of dynamic regret, a notion commonly utilized to measure the performance of an online learning algorithm in a non-stationary environment. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further show that our algorithm is nearly optimal by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, which to the best of our knowledge is the first impossibility result that characterizes the fundamental limits of non-stationary RL in general. To the best of our knowledge, RestartQ-UCB is the first model-free algorithm for non-stationary RL. Compared with model-based solutions, our algorithm is more time- and space-efficient, flexible, and compatible with the model deep RL architectures. We empirically evaluate RestartQ-UCB on RL tasks with both abrupt and gradual types of non-stationarity. Simulation results validate the advantages of RestartQ-UCB in terms of cumulative rewards and computational efficiency. We further demonstrate the power of our results through a ``learning to collaborate'' example in the context of multi-agent RL, where non-stationarity is a key challenge.
Issue Date:2021-04-09
Type:Thesis
URI:http://hdl.handle.net/2142/110451
Rights Information:Copyright 2021 Weichao Mao
Date Available in IDEALS:2021-09-17
Date Deposited:2021-05


This item appears in the following Collection(s)

Item Statistics