Alec and I thought about the issue for negative correlation a bit more, and thanks to our conversation I've managed to convince at least myself that the following will work.

So let me state the problem once more. We intend to produce a vector *S* of random numbers such that each number individually is Bernoulli distributed with success probability *prob*, and such that the correlation of the *S*[*i*] and the *S*[*i* - *lag*] is *c*. The idea behind the code above is to first generate a sequence of independent numbers, all Bernoulli with success probability *prob*, and then for a fraction abs(*c*) of them, set *S*[*i*] equal to *S*[*i* - *lag*] (if *c* is positive) or to 1 - *S*[*i* - *lag*] (if *c* is negative). This will give the right correlation in all cases, but it will destroy the expected value (or success probability) of the sample if *c* is negative and *prob* is different from 0.5. Since the code above already works for *c* >= 0, we assume *c* < 0 from now on.

One solution is to simply set the initial sample using some other success probability than *prob*, which may require changing the fraction of numbers that we change from abs(*c*) to something else. If in the end, we still get expected value *prob* for each entry, and *lag*-delayed autocorrelation *c*, then all is fine.

The first *lag* entries of *S* are never changed after generating them initially, so we need to generate them with success probability *prob*. Let us assume we generate the *n* - *lag* other entries with success probability *q*, and then for a fraction *r* of them, we set *S*[*i*] equal to 1 - *S*[*i* - *lag*]. We will determine the values of *q* and *r* that give expected value *prob* and correlation *c* later.

So suppose that we have generated all initial values of *S* and are now in the process of setting *S*[*i*] equal to 1 - *S*[*i* - *lag*] for some *i*. In particular, suppose we have processed the entries *S*[*j*] for *j* < *i*. Then currently *E*(*S*[*j*]) is equal to *prob* for *j* < *i* and it is equal to *q* for *j* >= *i*. With probability *r*, we set *S*[*i*] equal to 1 - *S*[*i* - *lag*], and otherwise we leave it as is; the expected value of *S*[*i*] after this operation will be *r*(1 - *prob*) + (1 - *r*)*q*. This should be equal to *prob*, so we have:

*r*(1 - *prob*) + (1 - *r*)*q* = *prob*.

Now for the correlation. The correlation of *S*[*i*] with *S*[*i* - *lag*] can be defined as

(E(S[i] * S[i - lag]) - E(S[i]) * E(S[i - lag]))/sqrt((E(S[i]^2) - E(S[i])^2)*(E(S[i - lag]^2) - E(S[i - lag])^2)) = (E(S[i] * S[i - lag]) - prob^2)/(prob - prob^2).

We now need to find *E*(*S*[*i*] * *S*[*i* - *lag*]). If we set *S*[*i*] equal to 1 - *S*[*i* - *lag*] (which we do with probability *r*), then their product is certainly 0. And if we don't (probability 1 - *r*), then the two are independent; *S*[*i* - *lag*] is equal to 1 with probability *prob*, but *S*[*i*] has probability *q* of being equal to 1. So the expected value we seek to find is equal to (1 - *r*) *prob* *q*. Setting the correlation equal to *c*, we find

((1 - r)*prob*q - prob^2)/(prob - prob^2) = ((1 - r) * q - prob) / (1 - prob) = c,

thus

(1 - r) * q = prob + c * (1 - prob).

Isolating the term (1 - *r*) *q* from the other equation above that came from setting the expected value to *prob*, we find that

prob - r*(1 - prob) = prob + c * (1 - prob),

or in other words, *r* = - *c*. This leads to *q* = (*prob* + *c* (1 - *prob*))/(1 + *c*). This leads to the following change to Alec's implementation:

SampleWithCorr2 := proc(prob::And(numeric,satisfies(c->c>=0 and c<=1)),
lag::posint,
c::And(numeric,satisfies(c->c<=1 and c>=-1)),
n::posint,
$)
local X, B, S, C, s, i, q, r;
uses Statistics;
if c >= 0 then
X := RandomVariable(Bernoulli(prob));
S := Sample(X, n);
if n <= lag or s = 0 then
return S;
end if;
B := RandomVariable(Bernoulli(abs(c)));
C := Sample(B, n - lag);
for i from lag + 1 to n do
if C[i - lag] = 1 then
S[i] := S[i - lag];
end if;
end do;
return S;
else
r := -c;
q := (prob + c * (1 - prob))/(1 + c);
# Initial lag elements
S := Vector(n, Sample(RandomVariable(Bernoulli(prob)), lag));
# Fill with the rest of the elements
ArrayTools:-Copy(Sample(RandomVariable(Bernoulli(q)), n - lag), S, lag);
C := Sample(Bernoulli(r), n - lag);
for i from lag + 1 to n do
if C[i - lag] = 1 then
S[i] := 1 - S[i - lag];
end if;
end do;
return S;
end if;
end proc;

S := SampleWithCorr2(1/3, 1, -1/5, 10^6);
Vector[row](%id = 137642912)
Statistics:-Correlation(S[1..10^6-1], S[2..10^6]);
-0.2010351921
Statistics:-Mean(S);
0.3331040000
S := SampleWithCorr2(1/4, 1, 2/5, 10^6);
Vector[row](%id = 137036492)
Statistics:-Correlation(S[1..10^6-1], S[2..10^6]);
0.4008726203
Statistics:-Mean(S);
0.2510660000

Two final notes:

- There are values for
*prob* and *c* where *q* is not between 0 and 1. I think this reflects the fact that not all combinations of *c* and *prob* actually correspond to a solution at all. For example, if you want the correlation to be -1, then the values will have to alternate; thus the expected value of the entries will be 1/2. I haven't looked into this very deeply though.
- I tried to get the display engine to do Maple formatting, but didn't succeed. I might try to revisit this comment at a later stage and try and get that to work.

All the best, Erik.