Files in this item
Files  Description  Format 

application/pdf Ramachandran_Deepak.pdf (6MB)  (no description provided) 
Description
Title:  Knowledge and Ignorance in Reinforcement Learning 
Author(s):  Ramachandran, Deepak 
Director of Research:  Roth, Dan 
Doctoral Committee Chair(s):  Roth, Dan 
Doctoral Committee Member(s):  Amir, Eyal; DeJong, Gerald F.; Singh Baveja, Satinder 
Department / Program:  Computer Science 
Discipline:  Computer Science 
Degree Granting Institution:  University of Illinois at UrbanaChampaign 
Degree:  Ph.D. 
Genre:  Dissertation 
Subject(s):  Reinforcement Learning
Apprenticeship Learning Partially Observable Markov Decision Process (POMDP) 
Abstract:  The field of Reinforcement Learning is concerned with teaching agents to take optimal decisions to maximize their total utility in complicated environments. A Reinforcement Learning problem, generally described by the Markov Decision Process formalism, has several complex interacting components, unlike in other machine learning settings. I distinguish three: the statespace/transition model, the reward function, and the observation model. In this thesis, I present a framework for studying how the state of knowledge or uncertainty of each component affects the Reinforcement Learning process. I focus on the reward function and the observation model, which has traditionally received little attention. Algorithms for learning good policies when these components are completely specified are well understood. However, it is less clear what to do when they are unknown, uncertain or irrelevant. In this thesis, I describe how to adapt Reinforcement Learning algorithms to cope with these situations. Recently there has been great interest in the Inverse Reinforcement Learning problem where the objective is to learn the reward function from evidence of an agent's rewardmaximizing policy. The usual goal is to perform apprenticeship learning where the agent learns the optimal action to perform from an expert. However, sometimes the reward function is of independent interest as well. I describe a Bayesian Inverse Reinforcement Learning approach to this problem. BIRL uses a generative model to describe the decisionmaking process and by inverting it we can infer the reward function from action observations. It is distinguished from other IRL approaches by placing emphasis on the accuracy of the reward function in itself, and not just as an intermediate step to apprenticeship learning. BIRL is also able to handle incomplete and contradictory information from the expert. It has been applied successfully to preference elicitation problem for computer games and robot manipulation. In a recent comparison of IRL approaches, BIRL was the best performing general IRL algorithm. I also extend this model to do a related task, Reward Shaping. In reward shaping, we seek to adjust a known reward function to make the learning agent converge on the optimal policy as fast as possible. Reward shaping has been proposed and studied previously in many applications, typically using additive potential functions. However the requirement of absolute policyinvariance is too strict to admit many useful cases of shaping. I define Bayesian Reward Shaping, which is a generalization to a soft form of reward shaping, and provide algorithms for achieving it. The impact of observation models on reinforcement learning has been studied even less than reward functions. This is surpising, considering how adding partial observability to an MDP model blows up the computational complexity and hence a better understanding of the tradeoffs between representational accuracy and efficiency would be helpful. In this work, I describe how in certain cases POMDPs can be approximated by MDPs or slightly more complicated models with bounded performance loss. I also present an algorithm, called Smoothed QLearning for learning policies when the observation models are uncertain. Smoothed Sarsa is based on the idea that in many realworld POMDPs better state estimates can be made at later time steps and thus delaying the backup step of a temporal differencebased algorithm can shortcut the uncertainty in the obervation model and approximate the underlying MDP better. Combining these approaches together (Bayesian Reward Shaping and Smoothed Sarsa), a mobile robot was trained to execute delivery tasks in an office environment. 
Issue Date:  20110525 
URI:  http://hdl.handle.net/2142/24337 
Rights Information:  Copyright 2011 Deepak Ramachandran 
Date Available in IDEALS:  20110525 
Date Deposited:  201105 
This item appears in the following Collection(s)

Dissertations and Theses  Computer Science
Dissertations and Theses from the Dept. of Computer Science 
Graduate Dissertations and Theses at Illinois
Graduate Theses and Dissertations at Illinois