## Wald-Wolfowitz Runs Test and Baseball

Maple

It's been a while since I wrote one of these random posts, but I still have a couple more I wanted to write.  In this post, I want to describe one of the tests used in the paper that initially inspired this series of posts: the Wald-Wolfowitz runs test.  This test is interesting in that it does not test for uniformity, but only for independence.  In other words, unfair coins and loaded dice will pass the Wald-Wolfowitz test and only inherently non-random sequences should fail.

The test itself is simple: for a sequence of two symbols we count the number of "runs" (a sequence of adjacent indentical symbols with the other symbol on each end) and then compare that count against the expected number of runs given the distribution of the two symbols.

If n0 is the number of zeros in a sequence and n1 is the number of ones, then the expected number of runs is

with a variance of

We can use mu and sigma to compute a z-score for the number of runs in our sequence.

Here is a simple bit of Maple to compute the number of runs in a sequence (exercise: write one that is better).

```Runs := proc(L)
local i, state, runs;
state := -1;
runs := 0;
for i in L do
if i <> state then
state := i;
runs := runs + 1;
end if;
end do;
return runs;
end proc: # Runs
```

Then, using that, here is the runs test:

```WaldWolfowitz := proc(L::list, {level::numeric:=0.05}, \$)
local tally, ones, zeros, runs, mu, sigmasq, stat, p;

tally := Statistics:-Tally(L);
if {op(map(lhs,tally))} <> {0,1} then
error `invalid input: expected a list of {0,1}`;
end if;

zeros := eval(0, tally);
ones := eval(1, tally);
userinfo(1, hints, nprintf("sequence contains %d zeros and %d ones", zeros, ones));

runs := Runs(L);
userinfo(1, hints, nprintf("sequence has %d runs", runs));

mu := ( (2*ones*zeros) / (ones+zeros) ) + 1; # Expected # of runs
userinfo(1, hints, nprintf("expected number of runs is %d", round(mu)));

sigmasq := ((mu-1)*(mu-2)/(ones+zeros-1)); # variance
userinfo(1, hints, nprintf("the variance is %.2f", sigmasq));

stat := (runs - mu) / sqrt( sigmasq ); # statistic
userinfo(1, hints, nprintf("the statistic for this sequence is is %f", stat));

p := 2*(Statistics:-Probability(Statistics:-RandomVariable(Normal(0, 1))>abs(stat), ':-numeric' )); # pvalue

return evalb(p>level), p;

end proc: # WaldWolfowitz```

I found it difficult to find a pseudo-random sequence that failed this test. All my usual generators were able to pass.  Of course, by design, the unfair coin generator passes:

```(**) r2a := rand(0..2): r2 := ()-> r2a() mod 2:
(**) L := [seq(r2(), i=1..10000)]:
WaldWolfowitz(L);
WaldWolfowitz: sequence contains 6748 zeros and 3252 ones
WaldWolfowitz: sequence has 4434 runs
WaldWolfowitz: expected number of runs is 4390
WaldWolfowitz: the variance is 1926.00
WaldWolfowitz: the statistic for this sequence is is 1.004890
true, 0.3149497052```

Even the linear congruence with small period passes:

```(**) LCState := 1: r3 := proc() global LCState;
LCState := irem(89853*LCState, 50027); irem(LCState, 2); end proc:
(**) L := [seq(r3(), i=1..10000)]:
(**) WaldWolfowitz(L);
WaldWolfowitz: sequence contains 4840 zeros and 5160 ones
WaldWolfowitz: sequence has 4950 runs
WaldWolfowitz: expected number of runs is 4996
WaldWolfowitz: the variance is 2494.63
WaldWolfowitz: the statistic for this sequence is is -0.918587
true, 0.3583118358```

Searching around for something that might fail this test without being too contrived, I decided to take a look at sports statistics since I remember having read a paper as an undergraduate that suggested that winning and losing streaks of sports teams might actually be more random that fans would care to admit. Wald-Wolfowitz seemed like a good test to apply to this hypothesis.

In order to get a large enough set of data to be meaninful, I decided to use win-loss sequences of baseball teams.  The easiest way I found to get this information into Maple was to download all the game logs from RetroSheet.org and grab (way more data that I wanted) and just filter out the win loss information for my chosen team.  The files are called "GLXXXX.TXT" where XXXX is the year.

```wlarray := Vector();
cnt := 0;
ocnt := 1;
team := "TOR";

GLFiles := sort(select(t->t[1..2]="GL", FileTools:-ListDirectory(".")));

for filename in GLFiles do
FileTools:-Text:-Open(filename);
printf("Procesing %s\n", filename);
do
if tmp = NULL then
FileTools:-Text:-Close(filename);
break;
end if;

(otmp,tmp) := tmp, StringTools:-SubstituteAll(tmp, ",,", ",0,");
while otmp <> tmp do
(otmp,tmp) := tmp, StringTools:-SubstituteAll(tmp, ",,", ",0,");
end do;

if tmp[-1] = "," then
tmp := cat(tmp,"0");
end if;

try
tmp := [parse(tmp)];
catch:
printf("Parsing file %s, failed to parse: \"%s\"\n", filename, tmp);
next;
end try;

if tmp[4] = team then
# Away
if tmp[11] < tmp[10] then
cnt := cnt+1;
wlarray(cnt) := 1;
elif tmp[11] = tmp[10] then
print("TIE", tmp[11] = tmp[10]);
else
cnt := cnt+1;
wlarray(cnt) := 0;
end if;
elif tmp[7] = team then
# Home
if tmp[11] > tmp[10] then
cnt := cnt+1;
wlarray(cnt) := 1;
elif tmp[11] = tmp[10] then
print("TIE", tmp[11] = tmp[10]);
else
cnt := cnt+1;
wlarray(cnt) := 0;
end if;
else
next;
end if;
end do:
print(Statistics:-Tally([seq(wlarray[i], i=ocnt..cnt)]));
ocnt := cnt+1;
end do:

TorWL:=[seq(wlarray[i], i=1..cnt)]:

save TorWL, "TorontoWL.txt";
```

Having created the win-loss sequence for my chosen team (The Blue Jays), I ran the runs test on it:

```(**) read("TorontoWL.txt"):(**) WaldWolfowitz(TorWL);
WaldWolfowitz: sequence contains 2632 zeros and 2589 ones
WaldWolfowitz: sequence has 2554 runs
WaldWolfowitz: expected number of runs is 2611
WaldWolfowitz: the variance is 1304.82
WaldWolfowitz: the statistic for this sequence is is -1.586911
true, 0.1125328014```

It passes! But just barely with a p=0.11. I hoped that other teams might be a little less random, and I was Happy to see that the (now defunct) Montreal Expos were less random.

```(**) read("MontrealWL.txt"):(**) WaldWolfowitz(MonWL);
WaldWolfowitz: sequence contains 2943 zeros and 2755 ones
WaldWolfowitz: sequence has 2766 runs
WaldWolfowitz: expected number of runs is 2847
WaldWolfowitz: the variance is 1421.15
WaldWolfowitz: the statistic for this sequence is is -2.145956
false, 0.0318764926```

Make of this what you will Toronto fans.

As always, the computations and code are in the attached worksheet WaldWolfowitz.mw and here are my datafiles MontrealWL.txt and TorontoWL.txt.

﻿