Policy Evaluation Under Markovian Noise Using The Online Bootstrap Inference Algorithm

We can view the world as an 'environment', where for each period an agent is in a state and takes one of the actions available in that state; depending on the configuration of the environment, the agent receives a reward and reaches the next state. In turn, we say that a policy is a vector of probabilities, such that for each state, this vector determines the probability of taking each of the available actions. An environment configured together with a given policy is what is formally known in the RL literature as a Markov Reward Process (MRP).

The objective is going to be to establish how much is the value of following that given policy, from a trajectory of dependent observations (this is framed as an MRP). It may be the case that these observations are generated with the same policy being evaluated or that they are generated with a different policy; in the first case, we will be in an on-policy scenario while in the second we refer to an off-policy scenario (Ramprasad et al, 2022).

Knowing the value of a policy allows us to compare different policies and choose one over the other. A simple example to see this is the case of quarantines, social welfare, and COVID-19; quarantining would allow for greater control of contagion, but at the same time increases poverty levels as people's income-generating capacity is reduced. Suppose you are a ruler and you must decide for each of the 365 days of the year, whether or not to quarantine that day, one policy might be to quarantine only on even days, another policy might be to quarantine every 3 months, and so on. These are part of the range of policies you can implement, you want to compare them and based on that decide which one to implement.

Now, finding the value function of a policy is not trivial; in the best case, you know all the information about the environment and you can easily access it, when this happens, the Bellman equation allows you to find the value function without much effort (Sutton & Barto, 2018). However, it is common that in RL there is uncertainty or the environment is very large, and therefore we cannot use this equation.

In these cases, RL can use alternatives, one of them being Linear Stochastic Approximation (LSA). This is an iterative algorithm that allows finding solutions to linear equations starting from observed data with uncertainties (Haskell, 2018). In the general framework, the LSA error is a martingale difference, in simple words, it is to relax the independence assumption and that the error has "zero mean" so that the LSA algorithm converges. The problem is that when LSA is applied to RL, using a linear approximation of the value function, this assumption is violated and the noise becomes Markovian, referring to the fact that now the noise depends on a discrete stochastic process such that the probability of occurrence of an event depends on the immediately preceding event.

Still, Liang (2010) shows that it is theoretically possible to write the Markovian noise as a martingale noise and another part of decreasing residual noises. Although this allows us to continue applying LSA; in practice, it is not possible to obtain these values separately, so the results will be biased. Among the classical LSA methods applied to RL are Time Differences (TD) for the on-policy case, while for the off-policy case, they are the Gradient of Time Differences (GTD) methods. These methods, while performing quite acceptably, can be improved by trying to correct for the bias mentioned above.

Figure 1: TD vs. the Online Bootstrap.

Precisely to make this improvement, Ramprasad et al (2022) propose the Online Bootstrap Inference method. This consists of adding a statistical bootstrap to the classical methods, where for each period B penalized iterations of the LSA parameter are performed and from this, confidence intervals for the value function are constructed. Graph 1 shows a comparison between TD and the Online Bootstrap for the Frozen Lake environment of OpenAI gym, and an epsilon-greedy policy with e=0.2. This is constructed with the code proposed by Ramprasad et al (2022). The results show that TD (black line) closely approximates the true value of the value function (dashed line) although it has a mistake up to 2000 iterations, while the Online Bootstrap (blue lines) guarantees that the true value function will be within the confidence interval from 600 iterations onwards. On the other hand, we found that as the randomness of the policy to be evaluated increases, the convergence of the Online Bootstrap algorithm slowed down and the width of the confidence intervals increased significantly.

References:

Haskell, W. B. (2018). Introduction to dynamic programming. Na- tional University of Singapore. ISE 6509: Theory and Algorithms for Dynamic Programming.
Liang, F. (2010). Trajectory averaging for stochastic approximation mcmc algorithms. The Annals of Statistics.Vol. 38, No. 5 (October 2010), pp. 2823- 2856.
Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W., and Cheng, G. (2022). Online bootrstrap inference for policy evaluation in reinfor- cement learning. Journal of the American Statistical Association, pp. 1-14.
Sutton, R. and Barto, A. (2018). Reinforcement learning: An introduction. The MIT Press.

Tags
Technology

Newsletter

Get information about Data Science, Artificial Intelligence, Machine Learning and more.

Recent articles

In the Blog articles, you will find the latest news, publications, studies and articles of current interest.

Mathematics of Discontent: A Study of the Panamanian Protests from Graph and Game Theory

During the second half of 2022, Panama faced an unprecedented social event. Although in the past there have been protests by certain social sectors, never before had there been such a massive demonstration that included different sectors of Panamanian society ...

Algorithmic Justice

Fairness in artificial intelligence models: how to mitigate discrimination in the presence of multiple sensitive attributes?

Let's suppose we have a machine learning model, 𝑓, that predicts the price of an insurance premium, Y, based on data that includes a sensitive attribute, such as gender. Discrimination may occur due to a statistical bias...

Technology

Translation models for the preservation of indigenous languages in Colombia

According to the National Indigenous Organization of Colombia (ONIC) there are 69 languages spoken in Colombian territory, 65 of which are indigenous languages. This makes Colombia the third most linguistically diverse country in Latin America, after Brazil and Mexico, with a notable concentration in the Amazon and Vaupés...

Economía

Optimal designs for electricity auctions

This blog entry is based on my master's thesis in Industrial Engineering and Economics at the Universidad de los Andes, titled "Optimal Design for Electricity Auctions: A Deep Learning Approach."

Technology

Invisible Victims: Estimating Underreporting in the Armed Conflict

The internal armed conflict in Colombia represents a large portion of the country's history. The dispute for power and territorial control between different armed groups and state institutions has unleashed the violation of human rights.

Algorithmic Justice

Trade-off between justice and adjustment: a case study of crime

The study of algorithmic justice emerged in 2011 with Cynthia Dwork [1], who based it on the principle of equal opportunity: all people, regardless of their characteristics, should be able to access the same opportunities and benefits.