# Oracle testing procedure

## Oracle testing procedure

We have shown that $\delta^{\lambda}(\Lambda, 1 / \lambda)=\left[I\left{\Lambda\left(x_{1}\right)<1 / \lambda\right}, \ldots, I\left{\Lambda\left(x_{m}\right)<1 / \lambda\right}\right]$ is the oracle rule in the weighted classification problem. The equivalence between multiple testing and weighted classification implies the optimal testing rule is also of the form $\delta^{\lambda(\alpha)}[\Lambda, 1 / \lambda(\alpha)]$ if $\Lambda \in \mathcal{T}$, although the cutoff $1 / \lambda(\alpha)$ is not obvious. Note that $\Lambda(x)=\operatorname{Lfdr}(x) /[1-\operatorname{Lfdr}(x)]$ is monotonically increasing in $\operatorname{Lfdr}(x)$, where $\operatorname{Lfdr}(\cdot)=(1-p) f_{0}(\cdot) / f(\cdot)$ is the local false discovery rate (Lfdr) introduced by Efron et al. (2001) and Efron (2004), so the optimal rule for mFDR control is of the form $\delta(\operatorname{Lfdr}(\cdot), c)=\left{I\left[\operatorname{Lfdr}\left(x_{i}\right)<c\right]: i=1, \ldots, m\right}$. The Lfdr has been widely used in the FDR literature to provide a Bayesian version of the frequentist FDR measure and interpret results for individual cases (Efron 2004). We rediscover it here as the optimal (oracle) statistic in the multiple testing problem in the sense that the thresholding rule based on $\operatorname{Lfdr}(X)$ controls the mFDR at the nominal level with the smallest mFNR.

The MRC implies that in order to minimize the mFNR level, we should choose the largest threshold for the Lfdr statistic. Therefore the oracle testing procedure is
$$\delta\left(\operatorname{Lfdr}, c_{O R}\right)=\left{I\left[\operatorname{Lfdr}\left(x_{i}\right)<c_{O R}\right]: i=1, \ldots, m\right},$$
where the oracle threshold $c_{O R}=\sup {c \in(0,1): \operatorname{mFDR}(c) \leqslant \alpha}$. The oracle procedure (3.13) provides an ideal target for evaluating different multiple testing procedures. In particular, it is more efficient than the $p$-value oracle procedure proposed in Genovese and Wasserman (2002). Hence the $z$-value oracle procedure is more efficient than all $p$-value based FDR procedures.

## A data-driven procedure

The oracle procedure is not applicable in practice because the distributional information is usually unknown. This section first discusses the estimation of the null distribution and the non-null proportion in large-scale multiple comparisons. Then we introduce a data-driven procedure that mimics the oracle procedure.
Efron (2004) raised an important issue that in many large-scale studies the usual assumption that the null distribution is known is incorrect, and seemingly negligible differences in the null may result in large differences in subsequent studies. It was demonstrated that the null distribution should be estimated from data instead of being assumed known. Besides the null distribution, the proportion of non-null effects $p$ is also an important quantity. The implementation of many FDR procedures requires the knowledge of $p$ (BH 2000; Storey 2002; GW 2004). Developing good estimators for the proportion of non-nulls is a challenging task. Recent work includes that of Genovese and Wasserman (2004), Langaas, Lindqvist and Ferkingstad (2005), Meinshausen and Rice (2006), Cai, Jin and Low (2007), and Jin and Cai (2007).

Jin and Cai (2007) developed an approach based on the empirical characteristic function and Fourier analysis for simultaneous estimation of both the null distribution $f_{0}$ and proportion of non-null effects $p$. The estimators are shown to be uniformly consistent over a wide class of parameters. Numerical results also showed that the estimators perform favorably in comparison to other existing methods. This method will be used in our data-driven procedure.

