Policy Evaluation Under Markovian Noise Using The Online Bootstrap Inference Algorithm

15/08/2023
Ana María Patrón
Technology

We can view the world as an 'environment', where for each period an agent is in a state and takes one of the actions available in that state; depending on the configuration of the environment, the agent receives a reward and reaches the next state. In turn, we say that a policy is a vector of probabilities, such that for each state, this vector determines the probability of taking each of the available actions. An environment configured together with a given policy is what is formally known in the RL literature as a Markov Reward Process (MRP).

The objective is going to be to establish how much is the value of following that given policy, from a trajectory of dependent observations (this is framed as an MRP). It may be the case that these observations are generated with the same policy being evaluated or that they are generated with a different policy; in the first case, we will be in an on-policy scenario while in the second we refer to an off-policy scenario (Ramprasad et al, 2022).

Knowing the value of a policy allows us to compare different policies and choose one over the other. A simple example to see this is the case of quarantines, social welfare, and COVID-19; quarantining would allow for greater control of contagion, but at the same time increases poverty levels as people's income-generating capacity is reduced. Suppose you are a ruler and you must decide for each of the 365 days of the year, whether or not to quarantine that day, one policy might be to quarantine only on even days, another policy might be to quarantine every 3 months, and so on. These are part of the range of policies you can implement, you want to compare them and based on that decide which one to implement.

Now, finding the value function of a policy is not trivial; in the best case, you know all the information about the environment and you can easily access it, when this happens, the Bellman equation allows you to find the value function without much effort (Sutton & Barto, 2018). However, it is common that in RL there is uncertainty or the environment is very large, and therefore we cannot use this equation.

In these cases, RL can use alternatives, one of them being Linear Stochastic Approximation (LSA). This is an iterative algorithm that allows finding solutions to linear equations starting from observed data with uncertainties (Haskell, 2018). In the general framework, the LSA error is a martingale difference, in simple words, it is to relax the independence assumption and that the error has "zero mean" so that the LSA algorithm converges. The problem is that when LSA is applied to RL, using a linear approximation of the value function, this assumption is violated and the noise becomes Markovian, referring to the fact that now the noise depends on a discrete stochastic process such that the probability of occurrence of an event depends on the immediately preceding event.

Still, Liang (2010) shows that it is theoretically possible to write the Markovian noise as a martingale noise and another part of decreasing residual noises. Although this allows us to continue applying LSA; in practice, it is not possible to obtain these values separately, so the results will be biased. Among the classical LSA methods applied to RL are Time Differences (TD) for the on-policy case, while for the off-policy case, they are the Gradient of Time Differences (GTD) methods. These methods, while performing quite acceptably, can be improved by trying to correct for the bias mentioned above.

Figure 1: TD vs. the Online Bootstrap.

Precisely to make this improvement, Ramprasad et al (2022) propose the Online Bootstrap Inference method. This consists of adding a statistical bootstrap to the classical methods, where for each period B penalized iterations of the LSA parameter are performed and from this, confidence intervals for the value function are constructed. Graph 1 shows a comparison between TD and the Online Bootstrap for the Frozen Lake environment of OpenAI gym, and an epsilon-greedy policy with e=0.2. This is constructed with the code proposed by Ramprasad et al (2022). The results show that TD (black line) closely approximates the true value of the value function (dashed line) although it has a mistake up to 2000 iterations, while the Online Bootstrap (blue lines) guarantees that the true value function will be within the confidence interval from 600 iterations onwards. On the other hand, we found that as the randomness of the policy to be evaluated increases, the convergence of the Online Bootstrap algorithm slowed down and the width of the confidence intervals increased significantly.

References:

Haskell, W. B. (2018). Introduction to dynamic programming. Na- tional University of Singapore. ISE 6509: Theory and Algorithms for Dynamic Programming.
Liang, F. (2010). Trajectory averaging for stochastic approximation mcmc algorithms. The Annals of Statistics.Vol. 38, No. 5 (October 2010), pp. 2823- 2856.
Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W., and Cheng, G. (2022). Online bootrstrap inference for policy evaluation in reinfor- cement learning. Journal of the American Statistical Association, pp. 1-14.
Sutton, R. and Barto, A. (2018). Reinforcement learning: An introduction. The MIT Press.

Justicia en los Modelos de Inteligencia Artificial: Nueva Perspectiva Basada en el Re-diseño de Algoritmos

En los últimos años, los modelos de inteligencia artificial han demostrado un potencial increíble para transformar industrias, desde la salud hasta las finanzas. Sin embargo, también han expuesto un problema preocupante: el sesgo algorítmico.

Read article

Machine Learning

Inferencia Robusta y Cuantificación de Incertidumbre para la Toma de Decisiones Basada en Datos

Los modelos de aprendizaje automático se han convertido en herramientas esenciales para la toma de decisiones en sectores críticos como la salud, las políticas públicas y las finanzas. Sin embargo, su aplicación práctica enfrenta dos grandes desafíos: el sesgo de selección en los datos y la cuantificación adecuada de la incertidumbre.

Read article

Redes Neuronales

El Potencial Impacto del Aprendizaje de Máquinas en el Diseño de las Políticas Públicas en Colombia: Una década de experiencias

Este blog es un resumen extendido del articulo Riascos, A. (2025).1 Desde el inicio de la llamada tercera ola de redes neuronales (Goodfellow et al., (2016)), en la primera década de este siglo, se ha generado una gran esperanza en las posibilidades de la inteligencia artificial para transformar todas las actividades humanas. Asimismo, se han levantado alertas sobre los riesgos que conlleva la introducción de esta nueva tecnología (Bengio et al., (2024)).

Read article

Deep Learning

Explorando Redes Neuronales en Grafos para la Clasificación de Asentamientos Informales en Bogotá, Colombia

Los asentamientos informales son definidos como áreas residenciales cuyos habitantes no poseen tenencia legal de las tierras, los barrios carecen de servicios básicos e infraestructura urbana y no cumplen con requisitos de planificación, así como se pueden encontrar en zonas de peligro ambiental y geográfico (ONU, 2015).

Read article

Technology

Reinforcement Learning para Optimización de Portafolios

En el contexto de los mercados financieros, la optimización de portafolios consiste en identificar la combinación óptima de activos para maximizar la relación retorno-riesgo. No obstante, esta toma de decisiones se realiza en un entorno de incertidumbre, ya que el comportamiento de los activos no es estacionario a lo largo del tiempo.

Read article

Technology

Clustering de datos genómicos

La secuenciación de RNA es una técnica que permite analizar la actividad de los genes en una muestra, como sangre, cerebro u otro tejido animal. Actualmente, es una de las herramientas más utilizadas en biología computacional y medicina, ya que facilita el estudio del impacto de las enfermedades en la expresión génica, lo que, a su vez, afecta la síntesis de proteínas y, en consecuencia, el funcionamiento celular.

Policy Evaluation Under Markovian Noise Using The Online Bootstrap Inference Algorithm

Tags

Newsletter

Recent articles

Algorithmic Justice

Justicia en los Modelos de Inteligencia Artificial: Nueva Perspectiva Basada en el Re-diseño de Algoritmos

Read article

Machine Learning

Inferencia Robusta y Cuantificación de Incertidumbre para la Toma de Decisiones Basada en Datos

Read article

Redes Neuronales

El Potencial Impacto del Aprendizaje de Máquinas en el Diseño de las Políticas Públicas en Colombia: Una década de experiencias

Read article

Deep Learning

Explorando Redes Neuronales en Grafos para la Clasificación de Asentamientos Informales en Bogotá, Colombia

Read article

Technology

Reinforcement Learning para Optimización de Portafolios

Read article

Technology

Clustering de datos genómicos

Read article

Let's keep in touch

Our social networks

Services

Resources