More Testing Randomness: Autocorrelation

September 21 2010 John May 1827
Maple

9

In this post I'll introduce is a nice visual test of randomness from signal processing. The main idea of this test to look at how a random sequence correlates with itself.

The correlation between two sequences a and b is the normalized covariance (γab) given by:

gamma[ab] = ``(1/n)*sum((a[i]-mu[a])*(b[i]-mu[b]), i=1..n)

and the correlation is γab divided by the product of σa and σb, the standard deviation of a and b. These can be computed with ?Statistics,Covariance and ?Statistics,Correlation respectively.  The basic upshot is that the correlation is 1 if the two sequences are identical, 0 if they are completely independent, and -1 if they are opposites (in the sence that any time sequence a is greater than its mean, sequence b is less than its mean by exactly the same amount).  The Covariance of two random sequences should be zero in theory, but in practice, for finite length sequences, it will just be a quanity close to zero. 

(**) r1 := rand(0..1):
(**) L1 := [seq(r1(), i=1..10000)]: L2 := [seq(r1(), i=1..10000)]:
(**) Statistics:-Covariance(L1,L2);
                                0.0009372800000

(**) Statistics:-Correlation(L1,L2);
                                0.003749373837

For testing randomness of a single sequence, we want to look at how it correlated with itself.  We can do this by computing the correlation of a sequence with a shifted version of itself.  This is called Autocorrelation.  For any given 'lag' we can compute the autocorrelation using  ?Statistics,Correlation

(**) lag:=10: Statistics:-Correlation(L1[1..-lag], L1[lag..-1]);
                                 0.01819979011

Of course, we are not interested in just one lag, since we do not know in advance what kind of periodic behavior a sequence might have. We want to compute all possible lags. This could be done with a simple nested seq and add :

AutoCorrBF := proc(X::list, maxlag::posint:=nops(X)-1)
local N, m, cv, R, i, j; # Brute Force Autocorrelation
    N := nops(X);
    m := Statistics:-Mean(X);
    cv := add( (X[i]-m)^2, i=1..N);
    R := [ 1.0, seq( add( (X[j]-m)*(X[j+i]-m), j=1..N-i) / cv, i=1..maxlag ) ];
end proc;

However, this will be quite slow for even moderately sized sequences. This computation can be seen as computing a large number of shifted dot products and that can be reformulated as a discrete fourier transform which is much faster.  It is so fast that it is not really worth while to try to limit the number of lags computed, but instead we just compute them all and return the ones we are interested in.

AutoCorrFFT := proc(X, maxlag::posint:=nops(X)-1)
local mu, N, M, Y, F, S, R, B;
    mu := Statistics:-Mean(X)
    N := nops(X);
    M := 2^ceil(log[2](2*N-1));
    Y := Vector(M, X -~ mu, datatype = complex[8]); # create a padded 0-mean version of X
    F := DiscreteTransforms:-FourierTransform(Y); # FFT
    S := Vector(M, i->Im(F[i])^2+Re(F[i])^2, datatype=float[8]); # compute products
    R := DiscreteTransforms:-InverseFourierTransform(S)[1..maxlag+1];
    return Re~(convert(R,'list')) / abs(R[1]);
end proc;

Once we have computed the autocorrelation, the easiest way to make sense of the information is probably by graphing it. This is a pretty standard thing to do in signal processing, and the graph is known as a correlogram.  A simple procedure to create correlograms is in the attached worksheet (Correlogram.mw) and is based on the FFT autocorrelation procedure above.

The dotted red lines below denote the 95% significance level (z-score > 2)  i.e lines below it do not denote significant correlation.  Below each plot is a summary of the number of correlations with z-scores >2, >3 and >4.  Looking at 1000 lags of a random sequence we should expect 50 lags outside of 95% (z>2), 3 outside of 99.7% (z>3) and maybe 1 outside of 99.99% (z>4).

The mersenne twister in Maple gives pretty much the expected results of a random sequence:

(**) L := [seq(r1(), i=1..10000)]:
(**) Correlogram(L);

The correlogram does not seem to detent non-uniformity, so the unfair coin sequence, also looks random.

(**) r2a := rand(0..2): r2 := ()-> r2a() mod 2:
(**) L := [seq(r2(), i=1..10000)]:
(**) Correlogram(L);

Here is an example of a low period sequence that produces lots of significant spikes:

(**) L := [seq(round(1/3*(evalf(sin(17.2*i)*cos(13.8*i)+1.17)+r1()*2/3)), i=1..10000)]:
(**) Correlogram(L);

The low period "broken" linear congruence example is clearly not random from its correlogram: there are many significant spikes.

 (**) LCState := 1:
r3 := proc()
global LCState;
    LCState := irem(89853*LCState, 50027);
    irem(LCState, 2);
    end proc:
(**) L := [seq(r3(), i=1..10000)]:
(**) Correlogram(L);

Correlograms seem like the perfect thing to apply to the baseball win-loss sequences we looked at in the last post. Intuitively, the performance in a game should be dependant, at least a little, on the performance in last few games.

(**) read("TorontoWL.txt"):
(**) Correlogram(TorWL, 101);

(**) read("MontrealWL.txt"):
(**) Correlogram(MonWL, 101);

Niether the Toronto nor the Montreal reconds have any really significant spikes (z>3) but both seem to have a more positive than expected correlation for truely random sequences.  Perhaps this suggests teams really do have hot and cold streaks, but that the effect is not very strong.

As always, computations are in the worksheet: Correlogram.mw using the baseball data files from my previous post MontrealWL.txt and TorontoWL.txt.

Please Wait...