# CS代写|强化学习代写Reinforcement learning代考|CS285 Policy Improvement

## CS代写|强化学习代写Reinforcement learning代考|Policy Improvement

Policy Improvement
Improve from a given policy $\pi$ and known value function

greedy policy improvement
$$\pi^{\prime}(a \mid s)= \begin{cases}1 & \text { if } a=\arg \max a \sum{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right)\left(r\left(s, a, s^{\prime}\right)+\gamma V_\pi\left(s^{\prime}\right)\right) \ & \text { i.e., } a=\arg \max a Q\pi(s, a) \ 0 & \text { otherwise }\end{cases}$$

it is model-free and easier to obtain policy from $Q_\pi(s, a)$
$\epsilon-$ greedy policy improvement
$$\pi^{\prime}(a \mid s)= \begin{cases}\frac{\epsilon}{|\mathcal{A}|}+1-\epsilon, & a=\arg \max a Q\pi(s, a) \ \frac{\epsilon}{|\mathcal{A}|}, & \text { o.w }\end{cases}$$
$\epsilon$-greedy policy ensures continual exploration, all actions are tried

## CS代写|强化学习代写Reinforcement learning代考|Policy iteration

Policy evaluation

for a given policy $\pi$, evaluate the state-value function $V_\pi(s)$ at each state $s \in \mathcal{S}$

iterative application of Bellman expectation backup
\begin{aligned} V(s) & \leftarrow \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right)\left[r\left(s, a, s^{\prime}\right)+\gamma V\left(s^{\prime}\right)\right] \ \text { or } Q(s, a) & \leftarrow \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right)\left[r\left(s, a, s^{\prime}\right)+\gamma \sum_{a^{\prime} \in \mathcal{A}} \pi\left(a^{\prime} \mid s^{\prime}\right) Q\left(s^{\prime}, a^{\prime}\right)\right] \end{aligned}

converges to the true solution of Bellman equations

Policy improvement

greedy policy

$\epsilon$-greedy policy

$$V(s) \leftarrow \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right)\left[r\left(s, a, s^{\prime}\right)+\gamma V\left(s^{\prime}\right)\right] \text { or } Q(s, a) \leftarrow \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \mid s, a\right)\left[r\left(s, a, s^{\prime}\right)+\gamma \sum_{a^{\prime} \in \mathcal{A}} \pi\left(a^{\prime} \mid s^{\prime}\right) Q\left(s^{\prime}, a^{\prime}\right)\right]$$

