Files in this item

FilesDescriptionFormat

application/pdf

application/pdfChakraborty, Neeloy-Thesis.pdf (1MB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Hierarchical self-imitation learning in single-agent sparse reward environments
Author(s):Chakraborty, Neeloy
Contributor(s):Driggs-Campbell, Katherine
Degree:B.A. (bachelor's)
Genre:Thesis
Subject(s):reinforcement learning
sparse/delayed rewards
self-imitation learning
hierarchical learning
Abstract:Reinforcement learning problems with sparse and delayed rewards are challenging to solve because the algorithms explore environments to gain experience from high performing rollouts. Classical methods of encouraging exploration during training such as epsilon-greedy and noisebased exploration are not adequate on their own to explore large state spaces (Fortunato et al., 2018). Self-imitation learning (SIL) has been shown to allow an agent to learn to mimic high performing, long-horizon trajectories, but SIL is heavily reliant on exploration to find such trajectories (Oh et al., 2018). On the other hand, hierarchical learning (HL) may be unstable during training but incorporates noise and failures that effectively explore the environment and may learn tasks with higher sample efficiency (Levy et al., 2019). This thesis presents a single agent reinforcement learning algorithm that combines the effects of SIL and HL – Generative Adversarial Self Imitation Learning + Hierarchical Actor-Critic (GASIL+HAC). GASIL+HAC represents the policy as multiple trainable levels of Deep Deterministic Policy Gradient (DDPG) optimizers from Lillicrap et al., (2016), where the purpose of the higher-level policies is to set waypoints to guide the lower-level policies to receive the highest cumulative return. The highest-level policy of the hierarchy is trained with GASIL on the sparse environment reward to set goals that imitate past well-performing trajectories, while the lower levels are trained on an artificial reward signal to set intermediate goals and achieve the desired high-level path. We perform experiments in OpenAI’s Multi-Agent Particle Environment in sparse and delayed reward stochastic scenarios to identify benefits and hinderances of GASIL +HAC compared to DDPG, GASIL, and HAC in sample efficiency, generalizability, exploration, and goal reachability. Through these experiments, we find that GASIL+HAC has the potential to increase sample efficiency in stochastic tasks and increase the number of explored states during training. However, there is an inherent increase in instability of training hierarchical methods and SIL-based methods are still highly dependent on exploration to find high-return trajectories. Further experiments over several more seeds must be run to come to a complete conclusion on the effectiveness of the proposed algorithm.
Issue Date:2021-05
Genre:Dissertation / Thesis
Type:Text
Language:English
URI:http://hdl.handle.net/2142/110267
Date Available in IDEALS:2021-08-11


This item appears in the following Collection(s)

Item Statistics