|Abstract:||Online recommendation systems have been widely used by retailers, digital marketing, and especially in e-commerce applications. Popular sites such as Netflix and Amazon suggest movies or general merchandise to their clients based on recommendations from peers. At core of recommendation systems resides a prediction algorithm, which based on recommendations received from a set of experts (users), recommends objects to other users. After a user ``consumes" an object, his feedback provided to the system is used to assess the performance of experts at that round and adjust the predictions of the recommendation system for the future rounds. This so-called ``learning from expert advice'' framework has been extensively studied in the literature. In this dissertation, we investigate various settings and applications ranging from partial information, adversarial scenarios, to limited resources. We propose provable algorithms for such systems, along with theoretical and experimental results.
In the first part of the thesis, we focus our attention to a generalized model of learning from expert advice in which experts could abstain from participating at some rounds. Our proposed online algorithm falls into the class of weighted average predictors and uses a time varying multiplicative weight update rule. This update rule changes the weight of an expert based on his relative performance compared to the average performance of available experts at the current round. We prove the convergence of our algorithm to the best expert, defined in terms of both availability and accuracy, in the stochastic setting.
Next, we study the optimal adversarial strategies against the weighted average prediction algorithm. All but one expert are honest and the malicious expert's goal is to sabotage the performance of the algorithm by strategically providing dishonest recommendations. We formulate the problem as a Markov decision process (MDP) and apply policy iteration to solve it. For the logarithmic loss, we prove that the optimal strategy for the adversary is the greedy policy, whereas for the absolute loss, in the $2$-experts, discounted cost setting, we prove that the optimal strategy is a threshold policy. We extend the results to the infinite horizon problem and find the exact thresholds for the stationary optimal policy. As an effort to investigate the extended problem, we use a mean field approach in the $N$-experts setting to find the optimal strategy when the predictions of the honest experts are i.i.d.
In addition to designing an effective weight update rule and investigating optimal strategies of malicious experts, we also consider active learning applications for learning with expert advice framework. In this application, the target is to reduce the number of labeling while still keeping the regret bound as small as possible. We proposed two algorithms, EPSL and EPAL, which are able to efficiently request label for each object. In essence, the idea of two algorithms is to examine the opinion ranges of experts, and decide to acquire labels based on the maximum difference of those opinion using a randomized policy. Both algorithms obtain nearly optimal regret bound up to some constant depending on the characteristics of experts' predictions.
Last but not least, we turn our attention to the generalized ``best arm identification" problem in which, at each time, there is a subset of products whose rewards or profits are unknown (but follow some fixed distributions), and the goal is to select the best product to recommend to users after trying on a number of sampling. We propose UCB based (Upper Confidence Bound) algorithms that provide flexible parameter tuning based on the availability of each arm in the collection. We also propose a simple, yet efficient, uniform sampling algorithm for this problem. We proved that, for these algorithms, the error of selecting the incorrect arm decays exponentially over time.