|Abstract:||Deterministic Bellman residual minimization is the problem of learning values in a Markov Decision Process (MDP) by optimizing for the Bellman residuals directly without using any heuristic surrogates. Compared to the traditional Approximate Dynamic Programming (ADP) methods, this approach can have both advantages and disadvantages. One of the advantages of such an approach is that the underlying optimization problem can provably converge to the desired solution with better theoretical sample efficiency guarantees, while ADP heuristics are prone to divergence and have worse theoretical sample efficiency. On the other hand, the disadvantage of Bellman residual minimization is the requirement of two independent future samples, a.k.a. the double sampling requirement, in order to form an unbiased estimate of the Bellman residual. Despite that some versions of BRM have superior theoretical properties, the superiority comes from the double sampling trick, limiting their applicability to simulator environments with state resetting functionality. Hence the algorithms cannot be applied to non-simulator environments where state resetting is not available. This requirement can trivially be waived for deterministic dynamics, since any combination of observations and actions are guaranteed to generate identical future observations. For this double sampling requirement, Bellman residual minimization methods tended to be overlooked in the literature, and this work tries to evaluate the efficacy of this approach compared to the more common ADP heuristics.
In this work, we make a simple observation that BRM can be applied without state resetting if the environment is deterministic, which Baird has hinted in his original paper on residual algorithms from 1995. The resulting algorithm is simple, convergent, and works well in benchmark control problems. Also, its implementation could be as simple as changing a line of code in their ADP counterparts. We compare Q-learning to its DBRM version theoretically, confirm the theoretical predictions from experiments, and also discover some surprising empirical behaviors and provide explanations.