Bernoulli Distribution with Serial Correlation

alec's picture

A random sample of length n drawn from Bernoulli distribution with probability of success prob, that has a correlation c with itself shifted back lag steps, can be generated using following procedure,

SampleWithCorr:=proc(prob::And(numeric,satisfies(c->c>=0 and c<=1)),
lag::posint,c::And(numeric,satisfies(c->c<=1 and c>=-1)),n::posint)
local X,B,S,C,s,i;
uses Statistics;
X:=RandomVariable(Bernoulli(prob));
S:=Sample(X,n);
if n<=lag or s=0 then S else
s:=signum(c);
B:=RandomVariable(Bernoulli(abs(c)));
C:=Sample(B,n-lag);
if s=1 then 
for i from lag+1 to n do
if C[i-lag]=1 then S[i]:=S[i-lag] fi od;
else for i from lag+1 to n do
if C[i-lag]=1 then S[i]:=1-S[i-lag] fi od
fi fi; S end:

For example,

S:=SampleWithCorr(1/2,1,-0.7,10000);

                       [ 10000 Element Row Vector ]
                  S := [ Data Type: float[8]      ]
                       [ Storage: rectangular     ]
                       [ Order: Fortran_order     ]

Statistics:-Correlation(S[1..9999],S[2..10000]);

                            -.6973705688

Edit 1: Erik Postma noticed that it gives a distribution with different probability of success in cases with prob not equal 1/2 for negative correlation. Any suggestions?

Acknowledgements: The idea (for prob = 1/2) belongs to alex_01. The implementation (not perfect, but working) is mine. I am thankful to Erik Postma noticing the problem for negative correlations in cases with prob not equal 1/2.

Alec

Comments

epostma's picture

Extension for negative correlation

Alec and I thought about the issue for negative correlation a bit more, and thanks to our conversation I've managed to convince at least myself that the following will work.

So let me state the problem once more. We intend to produce a vector S of random numbers such that each number individually is Bernoulli distributed with success probability prob, and such that the correlation of the S[i] and the S[i - lag] is c. The idea behind the code above is to first generate a sequence of independent numbers, all Bernoulli with success probability prob, and then for a fraction abs(c) of them, set S[i] equal to S[i - lag] (if c is positive) or to 1 - S[i - lag] (if c is negative). This will give the right correlation in all cases, but it will destroy the expected value (or success probability) of the sample if c is negative and prob is different from 0.5. Since the code above already works for c >= 0, we assume c < 0 from now on.

One solution is to simply set the initial sample using some other success probability than prob, which may require changing the fraction of numbers that we change from abs(c) to something else. If in the end, we still get expected value prob for each entry, and lag-delayed autocorrelation c, then all is fine.

The first lag entries of S are never changed after generating them initially, so we need to generate them with success probability prob. Let us assume we generate the n - lag other entries with success probability q, and then for a fraction r of them, we set S[i] equal to 1 - S[i - lag]. We will determine the values of q and r that give expected value prob and correlation c later.

So suppose that we have generated all initial values of S and are now in the process of setting S[i] equal to 1 - S[i - lag] for some i. In particular, suppose we have processed the entries S[j] for j < i. Then currently E(S[j]) is equal to prob for j < i and it is equal to q for j >= i. With probability r, we set S[i] equal to 1 - S[i - lag], and otherwise we leave it as is; the expected value of S[i] after this operation will be r(1 - prob) + (1 - r)q. This should be equal to prob, so we have:

r(1 - prob) + (1 - r)q = prob.

Now for the correlation. The correlation of S[i] with S[i - lag] can be defined as

(E(S[i] * S[i - lag]) - E(S[i]) * E(S[i - lag]))/sqrt((E(S[i]^2) - E(S[i])^2)*(E(S[i - lag]^2) - E(S[i - lag])^2)) = (E(S[i] * S[i - lag]) - prob^2)/(prob - prob^2).

We now need to find E(S[i] * S[i - lag]). If we set S[i] equal to 1 - S[i - lag] (which we do with probability r), then their product is certainly 0. And if we don't (probability 1 - r), then the two are independent; S[i - lag] is equal to 1 with probability prob, but S[i] has probability q of being equal to 1. So the expected value we seek to find is equal to (1 - r) prob q. Setting the correlation equal to c, we find

((1 - r)*prob*q - prob^2)/(prob - prob^2) = ((1 - r) * q - prob) / (1 - prob) = c,

thus

(1 - r) * q = prob + c * (1 - prob).

Isolating the term (1 - r) q from the other equation above that came from setting the expected value to prob, we find that

prob - r*(1 - prob) = prob + c * (1 - prob),

or in other words, r = - c. This leads to q = (prob + c (1 - prob))/(1 + c). This leads to the following change to Alec's implementation:

SampleWithCorr2 := proc(prob::And(numeric,satisfies(c->c>=0 and c<=1)),
                        lag::posint,
                        c::And(numeric,satisfies(c->c<=1 and c>=-1)),
                        n::posint,
                        $)
local X, B, S, C, s, i, q, r;
uses Statistics;
  if c >= 0 then
    X := RandomVariable(Bernoulli(prob));
    S := Sample(X, n);
    if n <= lag or s = 0 then 
      return S;
    end if;
    B := RandomVariable(Bernoulli(abs(c)));
    C := Sample(B, n - lag);
    for i from lag + 1 to n do
      if C[i - lag] = 1 then
        S[i] := S[i - lag];
      end if;
    end do;
    return S;
  else
    r := -c;
    q := (prob + c * (1 - prob))/(1 + c);
    # Initial lag elements
    S := Vector(n, Sample(RandomVariable(Bernoulli(prob)), lag));
    # Fill with the rest of the elements
    ArrayTools:-Copy(Sample(RandomVariable(Bernoulli(q)), n - lag), S, lag);
    C := Sample(Bernoulli(r), n - lag);
    for i from lag + 1 to n do
      if C[i - lag] = 1 then
        S[i] := 1 - S[i - lag];
      end if;
    end do;
    return S;
  end if;
end proc;
S := SampleWithCorr2(1/3, 1, -1/5, 10^6);
                        Vector[row](%id = 137642912)
Statistics:-Correlation(S[1..10^6-1], S[2..10^6]);
                                -0.2010351921
Statistics:-Mean(S);
                                0.3331040000
S := SampleWithCorr2(1/4, 1, 2/5, 10^6);
                        Vector[row](%id = 137036492)
Statistics:-Correlation(S[1..10^6-1], S[2..10^6]);
                                0.4008726203
Statistics:-Mean(S);
                                0.2510660000

Two final notes:

  1. There are values for prob and c where q is not between 0 and 1. I think this reflects the fact that not all combinations of c and prob actually correspond to a solution at all. For example, if you want the correlation to be -1, then the values will have to alternate; thus the expected value of the entries will be 1/2. I haven't looked into this very deeply though.
  2. I tried to get the display engine to do Maple formatting, but didn't succeed. I might try to revisit this comment at a later stage and try and get that to work.

All the best, Erik.

Robert Israel's picture

Negative correlation

Suppose X and Y are Bernoulli random variables, both with mean p.

We have E[XY] = Prob(X=1,Y=1) = 2*p-1 - Prob(X=0,Y=0) >= max(0, 2*p-1).
Thus Cov[X,Y] = E[XY] - p^2 >= max(-p^2, -(1-p)^2) and
Corr(X,Y) >= max(p/(p-1), (p-1)/p). 

Moreover, this inequality is best possible.

epostma's picture

Thanks, Robert

Thanks, Robert. A final improvement then; I also realized that I broke detection of n being less than or equal to lag:

SampleWithCorr3 := proc(prob::And(nonnegative,satisfies(c->c<=1)),
                        lag::posint,
                        c::And(satisfies(c->c<=1 and c>=-1)),
                        n::posint,
                        $)
local S, C, i, q, r;
uses Statistics;
  ASSERT(c >= prob/(prob - 1) and c >= (prob - 1)/prob,
      "impossible to achieve this combination of correlation and probability");
  if n <= lag then 
    return Sample(Randomvariable(Bernoulli(prob)), n);
  end if;
  if c >= 0 then
    S := Sample(RandomVariable(Bernoulli(prob)), n);
    C := Sample(RandomVariable(Bernoulli(c)), n - lag);
    for i to n - lag do
      if C[i] = 1 then
        S[i + lag] := S[i];
      end if;
    end do;
  else
    r := -c;
    q := (prob + c * (1 - prob))/(1 + c);
    # Initial lag elements
    S := Vector(n, Sample(RandomVariable(Bernoulli(prob)), lag));
    # Fill with the rest of the elements
    ArrayTools:-Copy(Sample(RandomVariable(Bernoulli(q)), n - lag), S, lag);
    C := Sample(Bernoulli(r), n - lag);
    for i to n - lag do
      if C[i] = 1 then
        S[i + lag] := 1 - S[i];
      end if;
    end do;
  end if;
  return S;
end proc;

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
}