# 统计代写| Connections between Binomial and Hypergeometric stat代写

## 统计代考

$3.9$ Connections between Binomial and Hypergeometric
The Binomial and Hypergeometric distributions are connected in two important ways. As we will see in this section, we can get from the Binomial to the Hypergeometric by conditioning, and we can get from the Hypergeometric to the Binomial by taking a limit. We’ll start with a motivating example.
Example 3.9.1 (Fisher exact test). A scientist wishes to study whether women or
134 men are more likely to have a certain disease, or whether they are equally likely. A random sample of $n$ women and $m$ men is gathered, and each person is tested for the disease (assume for this problem that the test is completely accurate). The numbers of women and men in the sample who have the disease are $X$ and $Y$ respectively, with $X \sim \operatorname{Bin}\left(n, p_{1}\right)$ and $Y \sim \operatorname{Bin}\left(m, p_{2}\right)$, independently. Here $p_{1}$ and $p_{2}$ are unknown, and we are interested in testing whether $p_{1}=p_{2}$ (this is known as a null hypothesis in statistics).

Consider a $2 \times 2$ table with rows corresponding to disease status and columns corresponding to gender. Each entry is the count of how many people have that disease status and gender, so $n+m$ is the sum of all 4 entries. Suppose that it is observed that $X+Y=r$.

The Fisher exact test is based on conditioning on both the row and column sums, so $n, m, r$ are all treated as fixed, and then seeing if the observed value of $X$ is “extreme” compared to this conditional distribution. Assuming the null hypothesis, find the conditional PMF of $X$ given $X+Y=r$
Solution:
First we’ll build the $2 \times 2$ table, treating $n, m$, and $r$ as fixed.
\begin{tabular}{lcc|c}
\hline & Women & Men & Total \
\hline Disease & $x$ & $r-x$ & $r$ \
No disease & $n-x$ & $m-r+x$ & $n+m-r$ \
\hline Total & $n$ & $m$ & $n+m$ \
\hline
\end{tabular}
Next, let’s compute the conditional PMF $P(X=x \mid X+Y=r)$. By Bayes’ rule,
\begin{aligned} P(X=x \mid X+Y=r) &=\frac{P(X+Y=r \mid X=x) P(X=x)}{P(X+Y=r)} \ &=\frac{P(Y=r-x) P(X=x)}{P(X+Y=r)} . \end{aligned}
The step $P(X+Y=r \mid X=x)=P(Y=r-x)$ is justified by the independence of $X$ and $Y$. Assuming the null hypothesis and letting $p=p_{1}=p_{2}$, we have $X \sim \operatorname{Bin}(n, p)$ and $Y \sim \operatorname{Bin}(m, p)$, independently, so $X+Y \sim \operatorname{Bin}(n+m, p)$. Thus,
$$P(X=x \mid X+Y=r)=\frac{\left(\begin{array}{c} m \ r-x \end{array}\right) p^{r-x}(1-p)^{m-r+x}\left(\begin{array}{c} n \ x \end{array}\right) p^{x}(1-p)^{n-x}}{\left(\begin{array}{c} n+m \ r \end{array}\right) p^{r}(1-p)^{n+m-r}}$$
$$=\frac{\left(\begin{array}{l} n \ x \end{array}\right)\left(\begin{array}{c} m \ r-x \end{array}\right)}{\left(n_{r}^{n+m}\right)} .$$
So the conditional distribution of $X$ is Hypergeometric with parameters $n, m, r .$
To understand why the Hypergeometric appeared, seemingly out of nowhere, let’s connect this problem to the elk story for the Hypergeometric. In the elk story, we are
Random variables and their distributions
135 interested in the distribution of the number of tagged elk in the recaptured sample. By analogy, think of women as tagged elk and men as untagged elk. Instead of recapturing $r$ elk at random from the forest, we infect $X+Y=r$ people with the disease; under the null hypothesis, the set of diseased people is equally likely to be any set of $r$ people. Thus, conditional on $X+Y=r, X$ represents the number of of tagged elk in the recaptured sample, which is distributed HGeom( $n, m, r)$. of tagged elk in the recaptured sample, which is distributed HGeom(n, $m, r) .$

## 统计代考

$3.9$ 二项式和超几何之间的连接

134 男性更有可能患有某种疾病，或者他们是否同样可能患有某种疾病。随机收集了 $n$ 个女性和 $m$ 个男性样本，并对每个人进行了疾病检测（假设该检测完全准确）。样本中女性和男性患病人数分别为 $X$ 和 $Y$，分别为 $X\sim\operatorname{Bin}\left(n, p_{1}\right)$ 和 $Y\ sim \operatorname{Bin}\left(m, p_{2}\right)$，独立。这里 $p_{1}$ 和 $p_{2}$ 是未知的，我们有兴趣检验 $p_{1}=p_{2}$ 是否（这在统计学中被称为零假设）。

Fisher 精确检验基于对行和列总和的条件化，因此 $n、m、r$ 都被视为固定值，然后查看 $X$ 的观察值与此条件分布相比是否“极端” .假设零假设，在给定 $X+Y=r$ 的情况下找到 $X$ 的条件 PMF

\开始{表格}{lcc|c}
\hline & 女性 & 男性 & 总计 \
\hline 疾病 & $x$ & $r-x$ & $r$ \

\hline 总计 & $n$ & $m$ & $n+m$ \
\hline
\end{表格}

$$\开始{对齐} P(X=x \mid X+Y=r) &=\frac{P(X+Y=r \mid X=x) P(X=x)}{P(X+Y=r)} \ &=\frac{P(Y=r-x) P(X=x)}{P(X+Y=r)} 。 \end{对齐}$$

$$P(X=x \mid X+Y=r)=\frac{\left(\begin{array}{c} 米\ r-x \end{array}\right) p^{r-x}(1-p)^{m-r+x}\left(\begin{array}{c} n \ X \end{array}\right) p^{x}(1-p)^{n-x}}{\left(\begin{array}{c} n+m \ r \end{数组}\right) p^{r}(1-p)^{n+m-r}}$$
$$=\frac{\left(\begin{数组}{l} n \ X \end{array}\right)\left(\begin{array}{c} 米\ r-x \end{array}\right)}{\left(n_{r}^{n+m}\right)} 。$$

135 对重新捕获的样本中标记的麋鹿数量的分布感兴趣。以此类推，将女性视为标记的麋鹿，将男性视为未标记的麋鹿。我们不是从森林中随机夺回 $r$ 麋鹿，而是将 $X+Y=r$ 人感染这种疾病；在原假设下，一组患病的人同样可能是任何一组 $r$ 人。因此，以$X+Y=r为条件，X$表示重新捕获的样本中标记麋鹿的数量，分布为HGeom($n, m, r)$。重新捕获的样本中标记的麋鹿的数量，分布为 HGeom(n, $m, r) .$