A random sample of length n drawn from Bernoulli distribution with probability of success prob, that has a correlation c with itself shifted back lag steps, can be generated using following procedure,
SampleWithCorr:=proc(prob::And(numeric,satisfies(c->c>=0 and c<=1)), lag::posint,c::And(numeric,satisfies(c->c<=1 and c>=-1)),n::posint) local X,B,S,C,s,i; uses Statistics; X:=RandomVariable(Bernoulli(prob)); S:=Sample(X,n); if n<=lag or s=0 then S else s:=signum(c); B:=RandomVariable(Bernoulli(abs(c))); C:=Sample(B,n-lag); if s=1 then for i from lag+1 to n do if C[i-lag]=1 then S[i]:=S[i-lag] fi od; else for i from lag+1 to n do if C[i-lag]=1 then S[i]:=1-S[i-lag] fi od fi fi; S end:
For example,
S:=SampleWithCorr(1/2,1,-0.7,10000);
[ 10000 Element Row Vector ]
S := [ Data Type: float[8] ]
[ Storage: rectangular ]
[ Order: Fortran_order ]
Statistics:-Correlation(S[1..9999],S[2..10000]);
-.6973705688
Edit 1: Erik Postma noticed that it gives a distribution with different probability of success in cases with prob not equal 1/2 for negative correlation. Any suggestions?
Acknowledgements: The idea (for prob = 1/2) belongs to alex_01. The implementation (not perfect, but working) is mine. I am thankful to Erik Postma noticing the problem for negative correlations in cases with prob not equal 1/2.
Alec
Comments
Extension for negative correlation
Alec and I thought about the issue for negative correlation a bit more, and thanks to our conversation I've managed to convince at least myself that the following will work.
So let me state the problem once more. We intend to produce a vector S of random numbers such that each number individually is Bernoulli distributed with success probability prob, and such that the correlation of the S[i] and the S[i - lag] is c. The idea behind the code above is to first generate a sequence of independent numbers, all Bernoulli with success probability prob, and then for a fraction abs(c) of them, set S[i] equal to S[i - lag] (if c is positive) or to 1 - S[i - lag] (if c is negative). This will give the right correlation in all cases, but it will destroy the expected value (or success probability) of the sample if c is negative and prob is different from 0.5. Since the code above already works for c >= 0, we assume c < 0 from now on.
One solution is to simply set the initial sample using some other success probability than prob, which may require changing the fraction of numbers that we change from abs(c) to something else. If in the end, we still get expected value prob for each entry, and lag-delayed autocorrelation c, then all is fine.
The first lag entries of S are never changed after generating them initially, so we need to generate them with success probability prob. Let us assume we generate the n - lag other entries with success probability q, and then for a fraction r of them, we set S[i] equal to 1 - S[i - lag]. We will determine the values of q and r that give expected value prob and correlation c later.
So suppose that we have generated all initial values of S and are now in the process of setting S[i] equal to 1 - S[i - lag] for some i. In particular, suppose we have processed the entries S[j] for j < i. Then currently E(S[j]) is equal to prob for j < i and it is equal to q for j >= i. With probability r, we set S[i] equal to 1 - S[i - lag], and otherwise we leave it as is; the expected value of S[i] after this operation will be r(1 - prob) + (1 - r)q. This should be equal to prob, so we have:
r(1 - prob) + (1 - r)q = prob.
Now for the correlation. The correlation of S[i] with S[i - lag] can be defined as
We now need to find E(S[i] * S[i - lag]). If we set S[i] equal to 1 - S[i - lag] (which we do with probability r), then their product is certainly 0. And if we don't (probability 1 - r), then the two are independent; S[i - lag] is equal to 1 with probability prob, but S[i] has probability q of being equal to 1. So the expected value we seek to find is equal to (1 - r) prob q. Setting the correlation equal to c, we find
thus
Isolating the term (1 - r) q from the other equation above that came from setting the expected value to prob, we find that
or in other words, r = - c. This leads to q = (prob + c (1 - prob))/(1 + c). This leads to the following change to Alec's implementation:
SampleWithCorr2 := proc(prob::And(numeric,satisfies(c->c>=0 and c<=1)), lag::posint, c::And(numeric,satisfies(c->c<=1 and c>=-1)), n::posint, $) local X, B, S, C, s, i, q, r; uses Statistics; if c >= 0 then X := RandomVariable(Bernoulli(prob)); S := Sample(X, n); if n <= lag or s = 0 then return S; end if; B := RandomVariable(Bernoulli(abs(c))); C := Sample(B, n - lag); for i from lag + 1 to n do if C[i - lag] = 1 then S[i] := S[i - lag]; end if; end do; return S; else r := -c; q := (prob + c * (1 - prob))/(1 + c); # Initial lag elements S := Vector(n, Sample(RandomVariable(Bernoulli(prob)), lag)); # Fill with the rest of the elements ArrayTools:-Copy(Sample(RandomVariable(Bernoulli(q)), n - lag), S, lag); C := Sample(Bernoulli(r), n - lag); for i from lag + 1 to n do if C[i - lag] = 1 then S[i] := 1 - S[i - lag]; end if; end do; return S; end if; end proc;S := SampleWithCorr2(1/3, 1, -1/5, 10^6); Vector[row](%id = 137642912) Statistics:-Correlation(S[1..10^6-1], S[2..10^6]); -0.2010351921 Statistics:-Mean(S); 0.3331040000 S := SampleWithCorr2(1/4, 1, 2/5, 10^6); Vector[row](%id = 137036492) Statistics:-Correlation(S[1..10^6-1], S[2..10^6]); 0.4008726203 Statistics:-Mean(S); 0.2510660000Two final notes:
All the best, Erik.
Negative correlation
Suppose X and Y are Bernoulli random variables, both with mean p.
We have E[XY] = Prob(X=1,Y=1) = 2*p-1 - Prob(X=0,Y=0) >= max(0, 2*p-1).
Thus Cov[X,Y] = E[XY] - p^2 >= max(-p^2, -(1-p)^2) and
Corr(X,Y) >= max(p/(p-1), (p-1)/p).
Moreover, this inequality is best possible.
Thanks, Robert
Thanks, Robert. A final improvement then; I also realized that I broke detection of n being less than or equal to lag:
SampleWithCorr3 := proc(prob::And(nonnegative,satisfies(c->c<=1)), lag::posint, c::And(satisfies(c->c<=1 and c>=-1)), n::posint, $) local S, C, i, q, r; uses Statistics; ASSERT(c >= prob/(prob - 1) and c >= (prob - 1)/prob, "impossible to achieve this combination of correlation and probability"); if n <= lag then return Sample(Randomvariable(Bernoulli(prob)), n); end if; if c >= 0 then S := Sample(RandomVariable(Bernoulli(prob)), n); C := Sample(RandomVariable(Bernoulli(c)), n - lag); for i to n - lag do if C[i] = 1 then S[i + lag] := S[i]; end if; end do; else r := -c; q := (prob + c * (1 - prob))/(1 + c); # Initial lag elements S := Vector(n, Sample(RandomVariable(Bernoulli(prob)), lag)); # Fill with the rest of the elements ArrayTools:-Copy(Sample(RandomVariable(Bernoulli(q)), n - lag), S, lag); C := Sample(Bernoulli(r), n - lag); for i to n - lag do if C[i] = 1 then S[i + lag] := 1 - S[i]; end if; end do; end if; return S; end proc;