# CS代写|强化学习代写Reinforcement learning代考|CS394R Categorical Temporal-Difference Learning

## CS代写|强化学习代写Reinforcement learning代考|Categorical Temporal-Difference Learning

Categorical dynamic programming (CDP) computes a sequence $\left(\eta_k\right){k \geq 0}$ of return-distribution functions, defined by iteratively applying the projected distributional Bellman operator $\Pi{\mathrm{C}} \mathcal{T}^\pi$ to an initial return-distribution function $\eta_0:$
$$\eta_{k+1}=\Pi_{\mathrm{C}} \mathcal{T}^\pi \eta_k .$$
As we established in Section 5.9, the sequence generated by CDP converges to the fixed point $\hat{\eta}{\mathrm{C}}^\pi$. Let us express this fixed point in terms of a collection of probabilities $\left(\left(p_i^\pi(x)\right){i=1}^m: x \in \mathcal{X}\right)$ associated with $m$ particles located at $\theta_1, \ldots, \theta_m$
$$\hat{\eta}{\mathrm{C}}^\pi(x)=\sum{i=1}^m p_i^\pi(x) \delta_{\theta_i} .$$
To derive an incremental algorithm from the categorical-projection Bellman operator, let us begin by expressing the projected distributional operator $\Pi_{\mathrm{C}} \mathcal{T}^\pi$ in terms of an expectation over the sample transition $\left(X=x, A, R, X^{\prime}\right)$ :
$$\left(\Pi_{\mathrm{C}} \mathcal{T}^\pi \eta\right)(x)=\Pi_{\mathrm{C}} \mathbb{E}\pi\left[\left(\mathrm{b}{R, \gamma}\right)_{#} \eta^\pi\left(X^\gamma\right) \mid X=x\right]$$
Following the line of reasoning from Section 6.2, in order to construct an unbiased sample target by substituting $R$ and $X^{\prime}$ with their realisations, we need to rewrite Equation $6.8$ with the expectation outside of the projection $\Pi_C$. The following establishes the validity of exchanging the order of these two operations.

## CS代写|强化学习代写Reinforcement learning代考|Quantile Temporal-Difference Learning

Quantile regression is a method for determining the quantiles of a probability distribution incrementally and from samples. ${ }^{47}$ In this section, we develop an algorithm that aims to find the fixed point $\hat{\eta}_{\mathrm{Q}}^\pi$ of the quantile-projected Bellman operator $\Pi_Q \mathcal{T}^\pi$ via quantile regression.

To begin, suppose that given $\tau \in(0,1)$ we are interested in estimating the $\tau^{\text {th }}$ quantile of a distribution $\nu$, corresponding to $F_\nu^{-1}(\tau)$. Quantile regression maintains an estimate $\theta$ of this quantile. Given a sample $z$ drawn from $\nu$, it adjusts $\theta$ according to
$$\theta \leftarrow \theta+\alpha\left(\tau-\mathbb{1}{{z<\theta}}\right) .$$ One can show that quantile regression follows the negative gradient of the quantile $\operatorname{loss}^{48}$ \begin{aligned} \mathcal{L}\tau(\theta) &=\left(\tau-\mathbb{1}{z<\theta}}\right)(z-\theta) \ &=\left|\mathbb{1}{{z<\theta}}-\tau\right| \times|z-\theta| . \end{aligned} In Equation 6.12, the term $\left|\mathbb{1}{{z<\theta}}-\tau\right|$ is an asymmetric step size which is either $\tau$ or $1-\tau$, according to whether the sample $z$ is greater or smaller than $\theta$, respectively. When $\tau<0.5$, samples greater than $\theta$ have a lesser effect on it than samples smaller than $\theta$; the effect is reversed when $\tau>0.5$. The update rule in Equation $6.11$ will continue to adjust the estimate until the equilibrium point $\theta^$ is reached (Exercise $6.4$ asks you to visualise the behaviour of quantile regression with different distributions). This equilibrium point is the location at which smaller and larger samples have an equal effect in expectation. At that point, letting $Z \sim \nu$, we have \begin{aligned} 0 &=\mathbb{E}\left[\tau-\mathbb{1}{\left{Z<\theta^\right}}\right] \ &=\tau-\mathbb{E}\left[\mathbb{1}{\left{Z<\theta^\right}}\right] \ &=\tau-\mathbb{P}\left(Z<\theta^\right) \ \Longrightarrow \mathbb{P}\left(Z<\theta^\right) &=\tau \ \Longrightarrow \theta^ &=F\nu^{-1}(\tau) . \end{aligned}

$$\theta \leftarrow \theta+\alpha(\tau-1 z<\theta) .$$
Exercise $6.4$ asks you to visualise the behaviour of quantile regression with different distributions.

