Question: A serious problem with Statistics:-Sample()

Hi, 
The choice method=default (the implicit choice) gives dramatically false results as demonstrated here for the sampling of a normal r.v.
It's likely to be the case for any continuous distribution.
In a few words the tails of the distribution are not correctly sampled (far from the mean by 4 standard deviations for a normal r.v., not an extremely rare event in practical applications).

I think the attached file is sufficiently documented for you to understand the problem.


 

restart:


The problem presented here also exists with Maple 2018 and Maple 2019

interface(version)

`Standard Worksheet Interface, Maple 2015.2, Mac OS X, December 21 2015 Build ID 1097895`

(1)

with(Statistics):

X := RandomVariable(Normal(0, 1)):


Generate a sample of X of size 10^6

Here is the R code for those who would like to verify the results and the performances

 

N <- 10^6

R <- 20

Q <- c(0:5)

 

M <- matrix(nrow=R, ncol=length(Q))

 

for (r in c(1:20)) {

  S <- rnorm(N, mean=0, sd=1)

  for (q in Q) {

    M[r, q+1] <-length(S[S>q])

  }

}

 

Expected.Number.of.Outliers <- floor(N - pnorm(Q, mean=0, sd=1) * N)

Observed.Number.of.Outliers <-  round(colMeans(M))

 

Expected.Number.of.Outliers

Observed.Number.of.Outliers

 

Absolute.Differences <- M - kronecker(t(Expected.Number.of.Outliers), rep(1, R))

boxplot(differences)

Remark the loop below takes about 30 seconds to run (to be compared to the 20 minutes
it would take am I used the build-in procedure Select, and to  the 3 seconds R demands).

N := 10^6:
R := 20:

Q  := [$0..5]:
nQ := numelems(Q):
M  := Matrix(R, nQ, 0):

for r from 1 to R do
  S := Sample(X, N):
  for q in Q do
    M[r, q+1] := add(Heaviside~(S -~ q)):
  end do:
end do:

Expected_Number_Of_Outliers := Vector[row](nQ, i -> Probability(X > Q[i], numeric)*N);

Expected_Number_Of_Outliers := Vector[row](6, {(1) = HFloat(500000.0), (2) = HFloat(158655.25393145697), (3) = HFloat(22750.13194817921), (4) = HFloat(1349.8980316301036), (5) = HFloat(31.67124183311998), (6) = HFloat(0.2866515719235352)})

(2)

Observed_Number_Of_Outliers := Mean(M);

Observed_Number_Of_Outliers := Vector[row](6, {(1) = 500291.4, (2) = 158782.3, (3) = 22761.35, (4) = 1349.5, (5) = 51.6, (6) = 1.9}, datatype = float[8])

(3)


While the values of  the Observed_Number_of_Outliers seem to be reasonably correct for q =0, 1, 2 and 3
there are highly suspicious for q = 4 and likely for q = 5 (but N is too small to be categorical)

So let's examine more closely rhe results for q = 4

`q=4`:= convert(M[..,5], list);
 min(`q=4`)

[HFloat(49.0), HFloat(63.0), HFloat(42.0), HFloat(47.0), HFloat(44.0), HFloat(68.0), HFloat(46.0), HFloat(51.0), HFloat(58.0), HFloat(55.0), HFloat(50.0), HFloat(51.0), HFloat(68.0), HFloat(45.0), HFloat(58.0), HFloat(44.0), HFloat(59.0), HFloat(49.0), HFloat(47.0), HFloat(38.0)]

 

HFloat(38.0)

(4)


This means we obtained 20 estimations out of 20 of the probability that X exceeds q=4.
There are only about on chance out of 1 million to get such a result (roughly (1/2)^20 ).

Here is the result R returned for the same quantile q=4

 M[,5]

 [1] 24 37 23 38 30 43 23 30 25 39 25 29 29 36 39 41 35 37 36 34

(9 values out of 20 are less than the expected value of 31.67)

A very simple test based on the Binomial distribution will prove, without any doubt, that the
distribution of the 20 numbers of outliers for q = 4 implies the sampling algorithm is of
is of very poor quality.



Is it possible to obtain correct results with Maple ?


(simulations done below for the case q=4 only)

N := 10^6:
R := 20:

Menv := Vector[column](R, 0):

for r from 1 to R do
  S := Sample(X, N, method=envelope):
  Menv[r] := add(Heaviside~(S -~ 4)):
end do:

`q=4`:= convert(Menv, list);
Mean(Menv);
Variance(Menv);

[HFloat(26.0), HFloat(26.0), HFloat(35.0), HFloat(34.0), HFloat(30.0), HFloat(30.0), HFloat(28.0), HFloat(24.0), HFloat(27.0), HFloat(31.0), HFloat(28.0), HFloat(21.0), HFloat(34.0), HFloat(33.0), HFloat(29.0), HFloat(38.0), HFloat(38.0), HFloat(33.0), HFloat(39.0), HFloat(25.0)]

 

HFloat(30.45)

 

HFloat(24.892105263157895)

(5)



Conclusion, the "default" sampling strategy suffers strong flaw.

Remark: it's well known that one of the best method to sample normal random variables
is Box-Mueller's: why it has not been programmed by default ?

 

 


 

Download Bug_Normal_Sampler.mw


 

Please Wait...