Wald-Wolfowitz Runs Test and Baseball

September 16 2010 John May 1827
Maple
7

It's been a while since I wrote one of these random posts, but I still have a couple more I wanted to write.  In this post, I want to describe one of the tests used in the paper that initially inspired this series of posts: the Wald-Wolfowitz runs test.  This test is interesting in that it does not test for uniformity, but only for independence.  In other words, unfair coins and loaded dice will pass the Wald-Wolfowitz test and only inherently non-random sequences should fail.

The test itself is simple: for a sequence of two symbols we count the number of "runs" (a sequence of adjacent indentical symbols with the other symbol on each end) and then compare that count against the expected number of runs given the distribution of the two symbols.

If n0 is the number of zeros in a sequence and n1 is the number of ones, then the expected number of runs is

mu=2*(n_1*n_0)/(n_1+n_0)+1

with a variance of

sigma^2=((mu-1)*(mu-2))/(n_1+n_0-1)

We can use mu and sigma to compute a z-score for the number of runs in our sequence.

Here is a simple bit of Maple to compute the number of runs in a sequence (exercise: write one that is better).

Runs := proc(L)
local i, state, runs;
    state := -1;
    runs := 0;
    for i in L do
        if i <> state then
            state := i;
            runs := runs + 1;
        end if;
    end do;
    return runs;
end proc: # Runs

Then, using that, here is the runs test:

WaldWolfowitz := proc(L::list, {level::numeric:=0.05}, $)
local tally, ones, zeros, runs, mu, sigmasq, stat, p;

    tally := Statistics:-Tally(L);
    if {op(map(lhs,tally))} <> {0,1} then
        error `invalid input: expected a list of {0,1}`;
    end if;

    zeros := eval(0, tally);
    ones := eval(1, tally);
    userinfo(1, hints, nprintf("sequence contains %d zeros and %d ones", zeros, ones));

    runs := Runs(L);
    userinfo(1, hints, nprintf("sequence has %d runs", runs));

    mu := ( (2*ones*zeros) / (ones+zeros) ) + 1; # Expected # of runs
    userinfo(1, hints, nprintf("expected number of runs is %d", round(mu)));

    sigmasq := ((mu-1)*(mu-2)/(ones+zeros-1)); # variance
    userinfo(1, hints, nprintf("the variance is %.2f", sigmasq));

    stat := (runs - mu) / sqrt( sigmasq ); # statistic
    userinfo(1, hints, nprintf("the statistic for this sequence is is %f", stat));

    p := 2*(Statistics:-Probability(Statistics:-RandomVariable(Normal(0, 1))>abs(stat), ':-numeric' )); # pvalue

    return evalb(p>level), p;

end proc: # WaldWolfowitz

I found it difficult to find a pseudo-random sequence that failed this test. All my usual generators were able to pass.  Of course, by design, the unfair coin generator passes:

(**) r2a := rand(0..2): r2 := ()-> r2a() mod 2:
(**) L := [seq(r2(), i=1..10000)]:
WaldWolfowitz(L);
WaldWolfowitz: sequence contains 6748 zeros and 3252 ones
WaldWolfowitz: sequence has 4434 runs
WaldWolfowitz: expected number of runs is 4390
WaldWolfowitz: the variance is 1926.00
WaldWolfowitz: the statistic for this sequence is is 1.004890
                       true, 0.3149497052

Even the linear congruence with small period passes:

(**) LCState := 1: r3 := proc() global LCState;
    LCState := irem(89853*LCState, 50027); irem(LCState, 2); end proc:
(**) L := [seq(r3(), i=1..10000)]:
(**) WaldWolfowitz(L);
WaldWolfowitz: sequence contains 4840 zeros and 5160 ones
WaldWolfowitz: sequence has 4950 runs
WaldWolfowitz: expected number of runs is 4996
WaldWolfowitz: the variance is 2494.63
WaldWolfowitz: the statistic for this sequence is is -0.918587
                       true, 0.3583118358

Searching around for something that might fail this test without being too contrived, I decided to take a look at sports statistics since I remember having read a paper as an undergraduate that suggested that winning and losing streaks of sports teams might actually be more random that fans would care to admit. Wald-Wolfowitz seemed like a good test to apply to this hypothesis.

In order to get a large enough set of data to be meaninful, I decided to use win-loss sequences of baseball teams.  The easiest way I found to get this information into Maple was to download all the game logs from RetroSheet.org and grab (way more data that I wanted) and just filter out the win loss information for my chosen team.  The files are called "GLXXXX.TXT" where XXXX is the year.

wlarray := Vector();
cnt := 0;
ocnt := 1;
team := "TOR";

GLFiles := sort(select(t->t[1..2]="GL", FileTools:-ListDirectory(".")));

for filename in GLFiles do
FileTools:-Text:-Open(filename);
printf("Procesing %s\n", filename);
do
    tmp := FileTools:-Text:-ReadLine(filename);
    if tmp = NULL then
        FileTools:-Text:-Close(filename);
        break;
    end if;

    (otmp,tmp) := tmp, StringTools:-SubstituteAll(tmp, ",,", ",0,");
    while otmp <> tmp do
        (otmp,tmp) := tmp, StringTools:-SubstituteAll(tmp, ",,", ",0,");
    end do;

    if tmp[-1] = "," then
        tmp := cat(tmp,"0");
    end if;

    try
        tmp := [parse(tmp)];
    catch:
        printf("Parsing file %s, failed to parse: \"%s\"\n", filename, tmp);
        next;
    end try;

    if tmp[4] = team then
    # Away
        if tmp[11] < tmp[10] then
            cnt := cnt+1;
            wlarray(cnt) := 1;
        elif tmp[11] = tmp[10] then
           print("TIE", tmp[11] = tmp[10]);
        else
           cnt := cnt+1;
           wlarray(cnt) := 0;
        end if;
    elif tmp[7] = team then
    # Home
        if tmp[11] > tmp[10] then
            cnt := cnt+1;
            wlarray(cnt) := 1;
        elif tmp[11] = tmp[10] then
           print("TIE", tmp[11] = tmp[10]);
        else
           cnt := cnt+1;
           wlarray(cnt) := 0;
        end if;
    else
        next;
    end if;
end do:
print(Statistics:-Tally([seq(wlarray[i], i=ocnt..cnt)]));
ocnt := cnt+1;
end do:

TorWL:=[seq(wlarray[i], i=1..cnt)]:

save TorWL, "TorontoWL.txt";

Having created the win-loss sequence for my chosen team (The Blue Jays), I ran the runs test on it:

(**) read("TorontoWL.txt"):
(**) WaldWolfowitz(TorWL); WaldWolfowitz: sequence contains 2632 zeros and 2589 ones WaldWolfowitz: sequence has 2554 runs WaldWolfowitz: expected number of runs is 2611 WaldWolfowitz: the variance is 1304.82 WaldWolfowitz: the statistic for this sequence is is -1.586911 true, 0.1125328014

It passes! But just barely with a p=0.11. I hoped that other teams might be a little less random, and I was Happy to see that the (now defunct) Montreal Expos were less random.

(**) read("MontrealWL.txt"):
(**) WaldWolfowitz(MonWL); WaldWolfowitz: sequence contains 2943 zeros and 2755 ones WaldWolfowitz: sequence has 2766 runs WaldWolfowitz: expected number of runs is 2847 WaldWolfowitz: the variance is 1421.15 WaldWolfowitz: the statistic for this sequence is is -2.145956 false, 0.0318764926

Make of this what you will Toronto fans.

As always, the computations and code are in the attached worksheet WaldWolfowitz.mw and here are my datafiles MontrealWL.txt and TorontoWL.txt.

Please Wait...