# 计算机代写|机器学习代写Machine Learning代考|Free Energy

EM can be viewed as optimizing the model parameters $\theta$ together with the distribution $\xi$. The Free Energy for a Hidden Markov Model is:
\begin{aligned} F(\theta, \xi)= & -\sum_i \gamma_1(i) \ln a_i-\sum_{i, j} \sum_{t=1}^{T-1} \xi_t(i, j) \ln A_{i j}-\sum_i \sum_{t=1}^T \gamma_t(i) \ln p\left(\mathbf{y}t \mid s_t=i\right) \ & +\sum{i, j} \sum_{t=1}^{T-1} \xi_t(i, j) \ln \xi_t(i, j)-\sum_i \sum_{t=2}^{T-2} \gamma_t(i) \ln \gamma_t(i) \end{aligned}
where $\gamma$ is defined as a function of $\xi$ as:
$$\gamma_t(i)=\sum_k \xi_t(i, k)=\sum_k \xi_{t-1}(k, i)$$
Warning! Since we weren’t able to find any formula for the free energy, we derived it from scratch (see below). In our tests, it didn’t precisely match the negative log-likelihood. So there might be a mistake here, although the free energy did decrease as expected.

Derivation. This material is very advanced and not required for the course. It is mainly here because we couldn’t find it elsewhere.

As a short-hand, we define $\mathbf{s}=s_{1: T}$ to be a variable representing an entire state sequence. The likelihood of a data sequence is:
$$p\left(\mathbf{y}{1: T}\right)=\sum{\mathbf{s}} p\left(\mathbf{y}_{1: T}, \mathbf{s}\right)$$
where the summation is over all possible state sequences.

## 计算机代写|机器学习代写Machine Learning代考|Most likely state sequences

Suppose we wanted to computed the most likely states $s_t$ for each time in a sequence. There are two ways that we might do it: we could take the most likely state sequence:
$$s_{1: T}^=\arg \max {s{1: T}} p\left(s_{1: T} \mid \mathbf{y}{1: T}\right)$$ or we could take the sequence of most-likely states: $$s_t^=\arg \max {s_t} p\left(s_t \mid \mathbf{y}{1: T}\right)$$ While these sequences may often be similar, they can be different as well. For example, it is possible that the most likely states for two consecutive time-steps do not have a valid transition between them, i.e., if $s_t^=i$ and $s{t+1}^=j$, it is possible (though unlikely) that $A_{i j}=0$. This illustrates that these two ways to create sequences of states answer two different questions: what sequence is jointly most likely? And, for each time-step, what is the most likely state just for that time-step?

Suppose we are given $N$ training vectors $\left{\left(\mathbf{x}_i, y_i\right)\right}$, where $\mathbf{x} \in \mathbb{R}^D, y \in{-1,1}$. We want to learn a classifier
$$f(\mathbf{x})=\mathbf{w}^T \phi(\mathbf{x})+b$$
so that the classifier’s output for a new $\mathbf{x}$ is $\operatorname{sign}(f(\mathbf{x}))$.
Suppose that our training data are linearly-separable in the feature space $\phi(\mathbf{x})$, i.e., as illustrated in Figure 32, the two classes of training exemplars are sufficiently well separated in the feature space that one can draw a hyperplane between them (e.g., a line in 2D, or plane in 3D). If they are linearly separable then in almost all cases there will be many possible choices for the linear decision boundary, each one of which will produce no classification errors on the training data. Which one should we choose? If we place the boundary very close to some of the data, there seems to be a greater danger that we will misclassify some data, especially when the training data are alsmot certainy noisy.

This motivates the idea of placing the boundary to maximize the margin, that is, the distance from the hyperplane to the closest data point in either class. This can be thought of having the largest “margin for error” – if you are driving a fast car between a scattered set of obstacles, it’s safest to find a path that stays as far from them as possible.

## 计算机代写|机器学习代写MACHINE LEARNING代考|FREE ENERGY

$E M$ 可以看作是优化模型参数 $\theta$ 连同分布 $\xi$. 隐马尔可夫模型的自由能是:
$$F(\theta, \xi)=-\sum_i \gamma_1(i) \ln a_i-\sum_{i, j} \sum_{t=1}^{T-1} \xi_t(i, j) \ln A_{i j}-\sum_i \sum_{t=1}^T \gamma_t(i) \ln p\left(\mathbf{y} t \mid s_t=i\right) \quad+\sum i, j \sum_{t=1}^{T-1} \xi_t(i, j) \ln \xi_t(i, j)-\sum_i \sum_{t=2}^{T-2} \gamma_t(i) \ln \gamma_t(i)$$

$$\gamma_t(i)=\sum_k \xi_t(i, k)=\sum_k \xi_{t-1}(k, i)$$

$$p(\mathbf{y} 1: T)=\sum \mathbf{s} p\left(\mathbf{y}_{1: T}, \mathbf{s}\right)$$

## 计算机代写|机器学习代写MACHINE LEARNING代考|MOST LIKELY STATE SEQUENCES

$$s_{1: T}^{=} \arg \max s 1: T p\left(s_{1: T} \mid \mathbf{y} 1: T\right)$$

$$s_t^{=} \arg \max s_t p\left(s_t \mid \mathbf{y} 1: T\right)$$

$$f(\mathbf{x})=\mathbf{w}^T \phi(\mathbf{x})+b$$

