Policy Evaluation Under Markovian Noise Using The Online Bootstrap Inference Algorithm

We can view the world as an 'environment', where for each period an agent is in a state and takes one of the actions available in that state; depending on the configuration of the environment, the agent receives a reward and reaches the next state. In turn, we say that a policy is a vector of probabilities, such that for each state, this vector determines the probability of taking each of the available actions. An environment configured together with a given policy is what is formally known in the RL literature as a Markov Reward Process (MRP).

The objective is going to be to establish how much is the value of following that given policy, from a trajectory of dependent observations (this is framed as an MRP). It may be the case that these observations are generated with the same policy being evaluated or that they are generated with a different policy; in the first case, we will be in an on-policy scenario while in the second we refer to an off-policy scenario (Ramprasad et al, 2022).

Knowing the value of a policy allows us to compare different policies and choose one over the other. A simple example to see this is the case of quarantines, social welfare, and COVID-19; quarantining would allow for greater control of contagion, but at the same time increases poverty levels as people's income-generating capacity is reduced. Suppose you are a ruler and you must decide for each of the 365 days of the year, whether or not to quarantine that day, one policy might be to quarantine only on even days, another policy might be to quarantine every 3 months, and so on. These are part of the range of policies you can implement, you want to compare them and based on that decide which one to implement.

Now, finding the value function of a policy is not trivial; in the best case, you know all the information about the environment and you can easily access it, when this happens, the Bellman equation allows you to find the value function without much effort (Sutton & Barto, 2018). However, it is common that in RL there is uncertainty or the environment is very large, and therefore we cannot use this equation.

In these cases, RL can use alternatives, one of them being Linear Stochastic Approximation (LSA). This is an iterative algorithm that allows finding solutions to linear equations starting from observed data with uncertainties (Haskell, 2018). In the general framework, the LSA error is a martingale difference, in simple words, it is to relax the independence assumption and that the error has "zero mean" so that the LSA algorithm converges. The problem is that when LSA is applied to RL, using a linear approximation of the value function, this assumption is violated and the noise becomes Markovian, referring to the fact that now the noise depends on a discrete stochastic process such that the probability of occurrence of an event depends on the immediately preceding event.

Still, Liang (2010) shows that it is theoretically possible to write the Markovian noise as a martingale noise and another part of decreasing residual noises. Although this allows us to continue applying LSA; in practice, it is not possible to obtain these values separately, so the results will be biased. Among the classical LSA methods applied to RL are Time Differences (TD) for the on-policy case, while for the off-policy case, they are the Gradient of Time Differences (GTD) methods. These methods, while performing quite acceptably, can be improved by trying to correct for the bias mentioned above.

Figure 1: TD vs. the Online Bootstrap.

Precisely to make this improvement, Ramprasad et al (2022) propose the Online Bootstrap Inference method. This consists of adding a statistical bootstrap to the classical methods, where for each period B penalized iterations of the LSA parameter are performed and from this, confidence intervals for the value function are constructed. Graph 1 shows a comparison between TD and the Online Bootstrap for the Frozen Lake environment of OpenAI gym, and an epsilon-greedy policy with e=0.2. This is constructed with the code proposed by Ramprasad et al (2022). The results show that TD (black line) closely approximates the true value of the value function (dashed line) although it has a mistake up to 2000 iterations, while the Online Bootstrap (blue lines) guarantees that the true value function will be within the confidence interval from 600 iterations onwards. On the other hand, we found that as the randomness of the policy to be evaluated increases, the convergence of the Online Bootstrap algorithm slowed down and the width of the confidence intervals increased significantly.

References:

Haskell, W. B. (2018). Introduction to dynamic programming. Na- tional University of Singapore. ISE 6509: Theory and Algorithms for Dynamic Programming.
Liang, F. (2010). Trajectory averaging for stochastic approximation mcmc algorithms. The Annals of Statistics.Vol. 38, No. 5 (October 2010), pp. 2823- 2856.
Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W., and Cheng, G. (2022). Online bootrstrap inference for policy evaluation in reinfor- cement learning. Journal of the American Statistical Association, pp. 1-14.
Sutton, R. and Barto, A. (2018). Reinforcement learning: An introduction. The MIT Press.

Recent articles

In the Blog articles, you will find the latest news, publications, studies and articles of current interest.

IA

AI for the Common Good: Capabilities, Power, and Participation

How should we understand the concept of developing Artificial Intelligence for the common good? This is a key question, which, according to philosopher Diana Acosta Navas, opens up two central dimensions: one philosophical and the other political…

IA

SESGO: A Critical Look at AI Biases in Spanish

In recent years, language models have transformed the way we interact with information. From virtual assistants to decision-support systems, these tools have become omnipresent…

Algorithmic Justice

Justice in Artificial Intelligence Models: A New Perspective Based on Algorithm Redesign

In recent years, artificial intelligence models have demonstrated incredible potential to transform industries, from healthcare to finance. However, they have also exposed a troubling issue: algorithmic bias.

Machine Learning

Robust Inference and Uncertainty Quantification for Data-Driven Decision Making

Machine learning models have become essential tools for decision-making in critical sectors such as healthcare, public policy, and finance. However, their practical application faces two major challenges: selection bias in the data and the proper quantification of uncertainty.

Neural Networks

The Potential Impact of Machine Learning on Public Policy Design in Colombia: A Decade of Experiences

This blog is an extended summary of the article Riascos, A. (2025). Since the beginning of the so-called third wave of neural networks (Goodfellow et al., (2016)) in the first decade of this century, there has been great hope in the possibilities of artificial intelligence to transform all human activities. At the same time, warnings have been raised about the risks involved in the introduction of this new technology (Bengio et al., (2024)).

Deep Learning

Exploring Graph Neural Networks for the Classification of Informal Settlements in Bogotá, Colombia

Informal settlements are defined as residential areas whose inhabitants do not have legal tenure of the land, the neighborhoods lack basic services and urban infrastructure, and they do not meet planning requirements. They can also be found in areas of environmental and geographical risk (ONU, 2015).