MEtest: A Homogeneity Test under the Presence of Measurement Errors

Overview

We wrote a function me.test() for a nonparametric test of equality of two distributions when the observed contaminated data follow the classical additive measurement error model. We want to test the hypothesis $H_0: F_x=F_y$ when the observed data are $D_w=\{{\mbox{$\mathbf W$}}_1, \dots, {\mbox{$\mathbf W$}}_{n_x}\}$ and $D_v=\{{\mbox{$\mathbf V$}}_1, \dots, {\mbox{$\mathbf V$}}_{n_y}\}$, where ${\mbox{$\mathbf W$}}^T_j=(W_{j1}, \dots, W_{jm_x})$, ${\mbox{$\mathbf V$}}^T_k=(V_{k1}, \dots, V_{km_y})$ for $j=1, \dots, n_x$ and $k=1, \dots, n_y$, $F_x$ and $F_y$ are the cumulative distribution functions (CDFs) of $X$ and $Y$. We note that the observed $W$’s and $V$’s are the surrogates of the unobserved $X$’s and $Y$’s, and assume that $m_x\geq 2$ and $m_y\geq 2$, that is, we have multiple replicates for unobservable true value. By the classical additive measurement error model we mean
\[\begin{equation*} W_{jl}=X_j+U_{x, jl} \mbox{ and }V_{kl^{'}}=Y_k+U_{y, kl^{'}}, \end{equation*}\]

for $j=1, \dots, n_x$, $l=1, \dots, m_x$, $k=1, \dots, n_y$, $l'=1, \dots, m_y$. The measurement error $U_{x, jl}$’s are assumed to be iid, independent of $X_j$, and follows the distribution $F_{u_x}$ that is symmetric around $0$. Similarly, we assume that the measurement error $U_{y, kl}$ are iid, independent of $Y_k$, and follows the distribution $F_{u_y}$ that is symmetric around $0$. Further, $X$, $Y$, $U_x$ and $U_y$ are assumed to be independent. It is important to note that $\{X_1, \dots, X_{n_x}\}$ and $\{Y_1, \dots, Y_{n_y}\}$ are never observed.

Example 1

We will simulate true values and measurement errors from the following design: \[ X \sim {\rm Normal}(0, 1), Y \sim {\rm Normal}(0.2, 1) \mbox{ and }U_x, U_y \sim DE(0, 0.35), \] where $DE(a, b)$ stands for the double exponential distribution with mean $a$ and variance $2b^2$. Note that the standard deviation of the measurement errors $U_x$ and $U_y$ is half of the variance of true signals.

library(smoothmest) ## To generate dobule exponential distribution

## Loading required package: MASS

set.seed(1234)
n <- 200
mx <- my <- 2
X <- rnorm(n, mean = 0, sd = 1)
Y <- rnorm(n, mean = 0.2, sd = 1)
Ux <- matrix(rdoublex(n*mx, mu = 0, lambda = 0.35), ncol = mx)
Uy <- matrix(rdoublex(n*my, mu = 0, lambda = 0.35), ncol = my)
W <- X + Ux
V <- Y + Uy

In this example, $n_x = n_y = 200$ and $m_x = m_y = 2$.

par(mfrow = c(1,2))
boxplot(W, main = "W")
boxplot(V, main = "V")

Now we load packages in the current workspace

library(statmod)
library(MEtest) # This package conducts the hypothesis test.

The package statmod is necessary for the numerical integration. The function me.test() in MEtest has the following arguments:

W: an $m_x$ (>= 2) by $n_x$ matrix of observations.
V: an $m_y$ (>= 2) by $n_y$ matrix of observations.
B: the number of bootstrap samples. Default is 1000.
wt: type of the weight function and must be one of “Uniform” or “Normal”.
wt.bd: lower and upper bound of the weight function. If wt.bd is not specified, bounds are computed based on the deconvoluted distribution function.
wt.prob: probability used to compute lower and upper bound. Default value is 0.99 but this Will be ignored if wt.bd is provided.
nGL: the number of nodes for Gaussian quadrature. Default value is 32.

Once we have W and V it is straighforwad to compute $p$-value for $H_0$.

# Let's view the observed data
head(W)

##            [,1]       [,2]
## [1,] -0.3544622 -1.7772733
## [2,]  0.5700091 -0.2169628
## [3,]  0.7204120  1.6900299
## [4,] -2.2969413 -2.3486976
## [5,] -0.2219204  0.5253668
## [6,]  0.8506322  0.2758073

head(V)

##             [,1]        [,2]
## [1,]  0.49508086 -0.08912287
## [2,]  0.77135540  0.92884779
## [3,] -0.05501657  0.46879317
## [4,]  1.75113131  1.57563234
## [5,]  0.67080298  0.10767074
## [6,]  0.98045858  0.67460554

me.test(W, V, wt = "Uniform")

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 56.606, p-value = 0.006
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Normal")

## 
##  Homogeneity test under Measurement Error with Normal weight
## 
## data:  W and V
## T (Normal) = 18.232, p-value < 2.2e-16
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Uniform", wt.prob = 0.95)

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 46.34, p-value < 2.2e-16
## alternative hypothesis: True signals come from different distributions

The output of me.test() contains test statistic, $p$-value, and information on the weight function.

Example 2

In this example, we set $n_x = 200$ and $n_y = 300$ and $m_x = 2$ and $m_y = 3$. \[ X \sim {\rm Normal}(0, 1), Y \sim {\rm Normal}(0, 1) \mbox{ and }U_x, U_y \sim DE(0, 0.35), \]

set.seed(12345)
nx <- 200; ny <- 300
mx <- 2; my <- 3
X <- rnorm(nx, mean = 0, sd = 1)
Y <- rnorm(ny, mean = 0, sd = 1)
Ux <- matrix(rdoublex(nx*mx, mu = 0, lambda = 0.35), ncol = mx)
Uy <- matrix(rdoublex(ny*my, mu = 0, lambda = 0.35), ncol = my)
W <- X + Ux
V <- Y + Uy
## Observed data are W and V
head(W)

##             [,1]        [,2]
## [1,] -0.22771326  0.91312650
## [2,]  0.46886646 -0.09680206
## [3,]  1.67410510  0.06787347
## [4,] -0.70181978  0.80573355
## [5,] -0.01687766 -0.41074833
## [6,] -1.96293518 -1.65193027

head(V)

##             [,1]       [,2]       [,3]
## [1,] -0.94299160 -1.1055595 -1.8662883
## [2,] -0.11082930 -0.5856071 -0.7795955
## [3,] -0.09263611  0.8909308 -0.7878444
## [4,]  1.18335963  1.4697445  1.5616781
## [5,]  1.62380283  2.1007738  1.1184517
## [6,]  0.10053012  0.3129272  0.6286603

me.test(W, V, wt = "Uniform")

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 13.165, p-value = 0.199
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Normal")

## 
##  Homogeneity test under Measurement Error with Normal weight
## 
## data:  W and V
## T (Normal) = 3.4635, p-value = 0.209
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Uniform", wt.prob = 0.95)

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 9.6725, p-value = 0.141
## alternative hypothesis: True signals come from different distributions

Example 3

In this example, we set $n_x = 200$ and $n_y = 300$ and $m_x = 2$ and $m_y = 3$. \[ X \sim {\rm Normal}(0, 10^2), Y \sim {\rm Normal}(0, 10^2) \mbox{ and }U_x, U_y \sim DE(0, 3.5), \]

set.seed(12345)
nx <- 200; ny <- 300
mx <- 2; my <- 3
X <- rnorm(nx, mean = 0, sd = 10)
Y <- rnorm(ny, mean = 0, sd = 10)
Ux <- matrix(rdoublex(nx*mx, mu = 0, lambda = 3.5), ncol = mx)
Uy <- matrix(rdoublex(ny*my, mu = 0, lambda = 3.5), ncol = my)
W <- X + Ux
V <- Y + Uy
## Observed data are W and V
head(W)

##             [,1]        [,2]
## [1,]  -2.2771326   9.1312650
## [2,]   4.6886646  -0.9680206
## [3,]  16.7410510   0.6787347
## [4,]  -7.0181978   8.0573355
## [5,]  -0.1687766  -4.1074833
## [6,] -19.6293518 -16.5193027

head(V)

##            [,1]       [,2]       [,3]
## [1,] -9.4299160 -11.055595 -18.662883
## [2,] -1.1082930  -5.856071  -7.795955
## [3,] -0.9263611   8.909308  -7.878444
## [4,] 11.8335963  14.697445  15.616781
## [5,] 16.2380283  21.007738  11.184517
## [6,]  1.0053012   3.129272   6.286603

me.test(W, V, wt = "Uniform")

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 642770905, p-value = 0.713
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Normal")

## 
##  Homogeneity test under Measurement Error with Normal weight
## 
## data:  W and V
## T (Normal) = 15933, p-value = 0.767
## alternative hypothesis: True signals come from different distributions

me.test(W, V, wt = "Uniform", wt.prob = 0.95)

## 
##  Homogeneity test under Measurement Error with Uniform weight
## 
## data:  W and V
## T (Uniform) = 2.27e+09, p-value = 0.517
## alternative hypothesis: True signals come from different distributions

Reference

Lee, D., Lahiri, S. and Sinha, S. A test of homogeneity of distributions when observations are subject to measurement errors. Under revision.