In the previous articles in this series, we discussed Monte-Carlo (MD) and Temporal-Difference (TD) Learning methods and how they estimate value functions of unknown Markov Decision Processes (MDPs). In this article, we compare these methods to understand their similarities and differences.

This article is the third of a multi-part series that focuses on model-free agents, arranged into individual articles for prediction and control algorithms. Model-Free Prediction (MFP) focuses on estimating the value function of an unknown MDP. While its counterpart, Model-Free Control (MFC), focuses on optimising the unknown MDPs value function. For more information on MFC, check out the article series!

## Article Contents

## MC and TD Comparsion

MC and TD methods are great RL algorithms that provide a solid foundation for solving various environments, where MC solves episodic tasks, and TD solves both continuous and episodic tasks.

Recall that a Markov property captures all the relevant information from the history in a single state (the present state). When using this property, algorithms can increase their computational efficiency as they do not have to rely on looking through the whole environment history. Interestingly, MC methods rarely use the Markov property, making them more efficient in non-Markov environments (e.g., partially observed environments). Comparatively, TD methods tend to exploit it, making these methods way more efficient in Markov environments. Without utilising the Markov property, MC is likely to perform slower with large MDPs.

### Bias/Variance Trade-off

Next, let’s discuss a critical topic in any Machine Learning application: bias/variance trade-off. In RL, bias and variance refer to how well the reinforcement signal (agent’s performance) reflects the true value function \(v_\pi(S_t)\). Variance refers to noise in the average accuracy of the value estimate, and bias refers to a stable but inaccurate value estimate.

MC methods have high variance and zero bias, providing good convergence properties (even with function approximation) and are simple to implement. Its return \(G_t\) (equation 1.1.1) depends on many random actions, transitions and rewards, causing an unbiased estimate of the true value function \(v_\pi(S_t)\) and produces lots of noise within the data samples.

\(\begin{equation} G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \end{equation}\)

(1.1.1)

\(\begin{equation} R_{t+1} + \gamma v_\pi(S_{t+1}) \end{equation}\)

(1.1.2)

TD methods, however, have low variance and low bias, making them more efficient than MC. In rare circumstances (when using function approximation), TD methods cannot converge to the true value function. Its form of return, the TD target (equation 1.1.2), starts as a biased estimate and changes to an unbiased one when it reaches the true TD target. Due to the TD target only depending on one random action, transition and reward, the noise generated is substantially lower than the MC return.

### Bootstrapping vs Sampling

Over the past few articles, we have discussed the three foundational RL methods in detail: Monte-Carlo Learning, Temporal-Difference Learning, and Dynamic Programming. However, we haven't examined the format that the algorithms use. These formats can be combined or used separately and are a critical component to how they operate. There are two formats: bootstrapping and sampling.

Algorithm | Bootstrapping | Sampling |

Monte-Carlo | No | Yes |

Temporal-Difference | Yes | Yes |

Dynamic Programming | Yes | No |

Bootstrapping is an approach that uses one or more estimated values during the update step to update other estimated values. More formally, they update estimates based on other estimates. To better understand this, consider the update step in TD(0) (equation 1.2.1).

\(\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha \Big( R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \Big) \end{equation}\)

(1.2.1)

Focusing on the secondary component, we can see that \(R_{t+1} + \gamma V(S_{t+1})\) represents the TD target, an estimate of the true value function, and \(V(S_t)\) is the current estimated value function. Here we update an approximation of the value function with another one.

Sampling is a widely used technique in Machine Learning. In RL specifically, it involves taking batches of observations to train the agent. New samples are generated each training iteration, where different algorithms use different sampling methods, such as random, reward-based, and stratified.

Table 1.2.1 highlights a clear comparison between the three algorithms. In summary, the table shows that MC is a non-bootstrapping method that uses sampling, Dynamic Programming is exclusively a bootstrapping method, and TD uses both formats.

Another way to see a clear difference between the algorithms is through their backup diagrams. Figure 3.2.1 shows backup diagrams of each of the algorithms. Each one contains a red shaded area that indicates the algorithm's learning path, a region the agent explores before updating its value function. As the diagrams progress from DP to TD, the exploration area gets smaller, reducing the work needed before the agent learns.