Very slow evaluating matrix computation.

zakyn's picture

Hello,

I would like to use Maple instead of Matlab, well I rewrite the Matlab code to Maple like following. The output is correct, but it spends more time than in Matlab. Is it possible to rewrite my Maple code to get output faster?

restart:
linearni_model:=proc(beta,sigma)
local n1,n2,Y,X,Y1,Y2,se,se2,t1,t2,t3,t4,vbeta,t,si2,e1,e2,e3,e4,Yp,N,s2,n;
use Statistics in
X:=Matrix([[1, 1, 1, 1, 1], [1, 1, 1, 1, -1], [1, 1, 1, -1, 1], [1, 1, 1, -1, -1], [1, 1, -1, 1, 1], [1, 1, -1, 1, -1], [1, 1, -1, -1, 1], [1, 1, -1, -1, -1], [1, -1, 1, 1, 1], [1, -1, 1, 1, -1], [1, -1, 1, -1, 1], [1, -1, 1, -1, -1], [1, -1, -1, 1, 1], [1, -1, -1, 1, -1], [1, -1, -1, -1, 1], [1, -1, -1, -1, -1]]);
n:=16;
N := RandomVariable(Normal(0,sigma)):
n1:=Sample(N,16);
n2:=Sample(N,16):
Y:=X.beta:
Y:=convert(Y,Vector[row]):
Y1 := Y + n1:
Y2 := Y + n2:
Yp := (Y1 + Y2)/2:
e1:=add(Yp[k]*X[k,2]/(n/2),k=1..16):
e2:=add(Yp[k]*X[k,3]/(n/2),k=1..16):
e3:=add(Yp[k]*X[k,4]/(n/2),k=1..16):
e4:=add(Yp[k]*X[k,5]/(n/2),k=1..16):
si2:=Vector[row](1..16):
si2 := 1/2*map(x->x^2, Y1[1..16]-Y2[1..16]):
s2:=add(si2[k],k=1..16)/n:
se2 := 4*s2/(2*n):
se:=sqrt(se2):
t1 := e1/se;
t2 := e2/se;
t3 := e3/se;
t4 := e4/se;
t := 2.1199:
#% z tabulek k=16, alfa=0,05
#% testy vyznamnosti efektu,1-je vyznamny,0-neni vyznamny
vbeta := <0,0,0,0>:
if abs(t1)>t then
vbeta[1]:=1:
end if;
if abs(t2)>t then
vbeta[2]:=1:
end if;
if abs(t3)>t then
vbeta[3]:=1:
end if;
if abs(t4)>t then
vbeta[4]:=1:
end if;
end use;
return vbeta;
end proc:
test:=proc(beta, sigma, pocet_opakovani)
local i,vyznamnost;
vyznamnost := <0,0,0,0>;
for i from 1 to pocet_opakovani do
vyznamnost := vyznamnost + linearni_model(beta,sigma);
end do;
vyznamnost := map(x->x/pocet_opakovani,vyznamnost)*100;
end proc:
beta:=<11,7,5,-9,6>;
linearni_model(beta,sigma);
test(beta,1,1000);

Thank you very much.

Vladimir

View 551_mapleprimes.mw on MapleNet or Download 551_mapleprimes.mw
View file details

Comments

Getting faster: start with CodeTools[Profiling]

The first question you need to answer is how much faster do you need it to be.

Before attempting any serious optimization, you should get some idea as to what steps are taking most of the time. The CodeTools[Profiling] subpackage is useful for that. It can be used as follows:

sigma := 3.0: # this wasn't assigned
with(CodeTools[Profiling]:
Profile(linearni_model):
test(beta,1,1000):
PrintProfiles();
linearni_model
linearni_model := proc(beta, sigma)
local n1, n2, Y, X, Y1, Y2, se, se2, t1, t2, t3, t4, vbeta, t, si2, e1, e2, e3, e4, Yp, N, s2, n;
     |Calls Seconds  Words|
PROC | 1000  13.430 40194262|
   1 | 1000   0.310 1537769| X := Matrix([[1, 1, 1, 1, 1], [1, 1, 1, 1, -1], [1, 1, 1, -1, 1], [1, 1, 1, -1, -1], [1, 1, -1, 1, 1], [1, 1, -1, 1, -1], [1, 1, -1, -1, 1], [1, 1, -1, -1, -1], [1, -1, 1, 1, 1], [1, -1, 1, 1, -1], [1, -1, 1, -1, 1], [1, -1, 1, -1, -1], [1, -1, -1, 1, 1], [1, -1, -1, 1, -1], [1, -1, -1, -1, 1], [1, -1, -1, -1, -1]]);
   2 | 1000   0.000       0| n := 16;
   3 | 1000   0.490  660757| N := Statistics:-RandomVariable(Normal(0,sigma));
   4 | 1000   1.160 3383694| n1 := Statistics:-Sample(N,16);
   5 | 1000   1.200 3267384| n2 := Statistics:-Sample(N,16);
   6 | 1000   1.740 4796540| Y := X.beta;
   7 | 1000   0.240 662748 | Y := convert(Y,Vector[row]);
   8 | 1000   1.870 5232405| Y1 := Y+n1;
   9 | 1000   1.850 5270749| Y2 := Y+n2;
  10 | 1000   0.000  153004| Yp := 1/2*Y1+1/2*Y2;
  11 | 1000   0.470 1638911| e1 := add(2*Yp[k]*X[k,2]/n,k = 1 .. 16);
  12 | 1000   0.410 1737491| e2 := add(2*Yp[k]*X[k,3]/n,k = 1 .. 16);
  13 | 1000   0.370 1770133| e3 := add(2*Yp[k]*X[k,4]/n,k = 1 .. 16);
  14 | 1000   0.430 1740252| e4 := add(2*Yp[k]*X[k,5]/n,k = 1 .. 16);
  15 | 1000   0.210 738338 | si2 := Vector[row](1 .. 16);
  16 | 1000   2.130 6121594| si2 := 1/2*map(x -> x^2,Y1[1 .. 16]-Y2[1 .. 16]);
  17 | 1000   0.350  877050| s2 := add(si2[k],k = 1 .. 16)/n;
  18 | 1000   0.030   58980| se2 := 2*s2/n;
  19 | 1000   0.070  327876| se := sqrt(se2);
  20 | 1000   0.020   19010| t1 := e1/se;
  21 | 1000   0.010   29733| t2 := e2/se;
  22 | 1000   0.010   29423| t3 := e3/se;
  23 | 1000   0.010   32201| t4 := e4/se;
  24 | 1000   0.000       0| t := 2.1199;
  25 | 1000   0.000   23919| vbeta := `<,>`(0,0,0,0);
  26 | 1000   0.030   13338| if t < abs(t1) then
  27 | 1000   0.000    8245|   vbeta[1] := 1
                            end if;
  28 | 1000   0.000   15171| if t < abs(t2) then
  29 | 1000   0.000    5659|   vbeta[2] := 1
                            end if;
  30 | 1000   0.010   25612| if t < abs(t3) then
  31 | 1000   0.010    4869|   vbeta[3] := 1
                            end if;
  32 | 1000   0.000    8043| if t < abs(t4) then
  33 | 1000   0.000    3364|   vbeta[4] := 1
                            end if;
  34 | 1000   0.000       0| return vbeta
end proc

From that it is apparent that there are a few trivial improvements you can make. There is no use reassigning X each time inside the procedure, either make a global, pass it in as an argument, or lexically scope it. Statement 15, the initialization of si2, is not needed and can be deleted. Those should be done, but the overall effect is small. To make a serious dent you're probably going to have rewrite this so that evalhf can be used. That won't work, alas, on the statements that use the Statistics package, which currently take about 20% of the total time. So without redoing that, the best you can get is a 5X improvement.

Another improvement you can make is to precompute X.beta, no since doing that each time in the inner loop. Following is a rewrite with these changes. I've split your procedure into two parts, the second part (now called `compute') can be called with the evalhf procedure.

restart:

X := Matrix([[1, 1, 1, 1, 1], [1, 1, 1, 1, -1], [1, 1, 1, -1, 1], [1, 1, 1, -1, -1], [1, 1, -1, 1, 1], [1, 1, -1, 1, -1], [1, 1, -1, -1, 1], [1, 1, -1, -1, -1], [1, -1, 1, 1, 1], [1, -1, 1, 1, -1], [1, -1, 1, -1, 1], [1, -1, 1, -1, -1], [1, -1, -1, 1, 1], [1, -1, -1, 1, -1], [1, -1, -1, -1, 1], [1, -1, -1, -1, -1]]
            ,datatype=float[8]):
n := 16;

sample :=proc(sigma)
local N;
global n;
uses Statistics;
    N := RandomVariable(Normal(0,sigma)):
    Sample(N,n),Sample(N,n);
end proc:

compute := proc(Y::'Vector[row]'(16,datatype=float[8])
                ,n1,n2,n,vbeta)
local Y1,Y2,se,se2,t1,t2,t3,t4,t,si2,e1,e2,e3,e4,Yp,s2,k;
    Y1 := Y + n1:
    Y2 := Y + n2:
    Yp := (Y1 + Y2)/2:
    e1 := add(Yp[k]*X[k,2]/(n/2),k=1..n):
    e2 := add(Yp[k]*X[k,3]/(n/2),k=1..n):
    e3 := add(Yp[k]*X[k,4]/(n/2),k=1..n):
    e4 := add(Yp[k]*X[k,5]/(n/2),k=1..n):
    si2 := 1/2*map(x->x^2, Y1-Y2):
    s2 := add(si2[k],k=1..n)/n:
    se2 := 4*s2/(2*n):
    se := sqrt(se2):
    t1 := e1/se;
    t2 := e2/se;
    t3 := e3/se;
    t4 := e4/se;
    t := 2.1199:
    if abs(t1)>t then vbeta[1] := 1: end if;
    if abs(t2)>t then vbeta[2] := 1: end if;
    if abs(t3)>t then vbeta[3] := 1: end if;
    if abs(t4)>t then vbeta[4] := 1: end if;
    vbeta;
end proc;

test := proc(beta, sigma, pocet_opakovani)
local i,vyznamnost,n1,n2,Y,vbeta,vbeta0;
    vyznamnost := <0,0,0,0>;
    Y := (X.beta)^%T;
    vbeta0 := Vector(4, datatype=float[8]);
    for i from 1 to pocet_opakovani do
        (n1,n2) := sample(sigma);
        vbeta := copy(vbeta0);
        vyznamnost := vyznamnost + evalhf(compute(Y,n1,n2,n,vbeta));
    end do;
    vyznamnost := map(x->x/pocet_opakovani,vyznamnost)*100;
end proc:

beta := Vector([11,7,5,-9,6],datatype=float[8]):
sigma := 3:

Profiling this we get

with(CodeTools[Profiling]):
Profile(test,sample):
PrintProfiles();
sample
sample := proc(sigma)
local N;
global n;
     |Calls Seconds  Words|
PROC | 1000   2.910 7395969|
   1 | 1000   0.540 674583| N := Statistics:-RandomVariable(Normal(0,sigma));
   2 | 1000   2.370 6721386| Statistics:-Sample(N,n), Statistics:-Sample(N,n)
end proc

test
test := proc(beta, sigma, pocet_opakovani)
local i, vyznamnost, n1, n2, Y, vbeta, vbeta0;
     |Calls Seconds  Words|
PROC |    1   0.440 913992|
   1 |    1   0.000    351| vyznamnost := `<,>`(0,0,0,0);
   2 |    1   0.070 161693| Y := (X.beta)^%T;
   3 |    1   0.000    455| vbeta0 := Vector(4,datatype = float[8]);
   4 |    1   0.010      0| for i to pocet_opakovani do
   5 | 1000   0.010   4292|   n1, n2 := sample(sigma);
   6 | 1000   0.090  88718|   vbeta := copy(vbeta0);
   7 | 1000   0.250 648017|   vyznamnost := vyznamnost+evalhf(compute(Y,n1,n2,n,vbeta))
                            end do;
   8 |    1   0.010  10466| vyznamnost := 100*map(x -> x/pocet_opakovani,vyznamnost)
end proc

This runs about 3-4 times faster than the original. To significantly it improve it you might have to code your own procedures for Statistics[sample], which now use up most of the time.

faster yet

On reflection, a few more improvements can be made. The call to RandomVariable should be taken out of the outer loop. Rather than making multiple calls to Sample, we can use a very large sample to initialize a Matrix, then select one row at a time. The following code does that (compute is not changed):

test := proc(beta, sigma, pocet_opakovani)
local i,vyznamnost,n1,n2,Y,vbeta,vbeta0,N,N1,N2;
global n;
uses Statistics;
    vyznamnost := <0,0,0,0>;
    Y := (X.beta)^%T;
    vbeta0 := Vector(4, datatype=float[8]);
    N := RandomVariable(Normal(0,sigma));
    N1 := Matrix(pocet_opakovani, n, convert(Sample(N,n*pocet_opakovani),list));
    N2 := Matrix(pocet_opakovani, n, convert(Sample(N,n*pocet_opakovani),list));
    for i from 1 to pocet_opakovani do
        n1 := N1[i,1..-1];
        n2 := N2[i,1..-1];
        vbeta := copy(vbeta0);
        vyznamnost := vyznamnost + evalhf(compute(Y,n1,n2,n,vbeta));
    end do;
    vyznamnost := map(x->x/pocet_opakovani,vyznamnost)*100;
end proc:

beta := Vector([11,7,5,-9,6],datatype=float[8]):
sigma := 3:

with(CodeTools[Profiling]):
Profile(test);
time(test(beta, sigma, 1000));
PrintProfiles();
test
test := proc(beta, sigma, pocet_opakovani)
local i, vyznamnost, n1, n2, Y, vbeta, vbeta0, N, N1, N2;
global n;
     |Calls Seconds  Words|
PROC |    1   0.890 2899786|
   1 |    1   0.000    186| vyznamnost := `<,>`(0,0,0,0);
   2 |    1   0.060 161140| Y := (X.beta)^%T;
   3 |    1   0.000    413| vbeta0 := Vector(4,datatype = float[8]);
   4 |    1   0.010  50436| N := Statistics:-RandomVariable(Normal(0,sigma));
   5 |    1   0.190 657490| N1 := Matrix(pocet_opakovani,n,convert(Statistics:-Sample(N,n*pocet_opakovani),list));
   6 |    1   0.190 687860| N2 := Matrix(pocet_opakovani,n,convert(Statistics:-Sample(N,n*pocet_opakovani),list));
   7 |    1   0.000      0| for i to pocet_opakovani do
   8 | 1000   0.000  69326|   n1 := N1[i,1 .. -1];
   9 | 1000   0.000  50770|   n2 := N2[i,1 .. -1];
  10 | 1000   0.070 104326|   vbeta := copy(vbeta0);
  11 | 1000   0.370 1105480|   vyznamnost := vyznamnost+evalhf(compute(Y,n1,n2,n,vbeta))
                            end do;
  12 |    1   0.000  12359| vyznamnost := 100*map(x -> x/pocet_opakovani,vyznamnost)
end proc

That is an improvement of over an order of magnitude from the original.

roman_pearce's picture

thanks

Thank you for your posts on this topic, I've found them insightful.

ArrayTools:-Reshape

You're welcome. I usually learn as much doing this as I hope others do. Here is another improvement, which doubles the speed. As alluded to in my previous post, it is a lot faster to create one large random sample in one call to Sample, then to make multiple calls. Instead of making two calls, to generate N1 and N2, it is better to generate a larger array and then subdivide. I didn't do that originally because I didn't come up with an elegant way of doing so. However, there is one and it works nicely. Use ArrayTools:-Reshape command to reshape a the large Vector returned by Sample to a multidimensional Array. This is very fast and elegant.

test := proc(beta, sigma, pocet_opakovani)
local i,vyznamnost,n1,n2,Y,vbeta,vbeta0,N,V,N12;
global n;
uses Statistics;
    vyznamnost := <0,0,0,0>;
    Y := (X.beta)^%T;
    vbeta0 := Vector(4, datatype=float[8]);
    N := RandomVariable(Normal(0,sigma));
    V := Sample(N,2*n*pocet_opakovani);
    N12 := ArrayTools:-Reshape(V,pocet_opakovani,n,2);
    for i from 1 to pocet_opakovani do
        n1 := N12[i,1..-1,1];
        n2 := N12[i,1..-1,2];
        vbeta := copy(vbeta0);
        vyznamnost := vyznamnost + evalhf(compute(Y,n1,n2,n,vbeta));
    end do;
    vyznamnost := map(x->x/pocet_opakovani,vyznamnost)*100;
end proc:

Running Profiling on this gives

test := proc(beta, sigma, pocet_opakovani)
local i, vyznamnost, n1, n2, Y, vbeta, vbeta0, N, N12, V;
global n;
     |Calls Seconds  Words|
PROC |    1   0.410 1243475|
   1 |    1   0.000    181| vyznamnost := `<,>`(0,0,0,0);
   2 |    1   0.070 162542| Y := (X.beta)^%T;
   3 |    1   0.000    497| vbeta0 := Vector(4,datatype = float[8]);
   4 |    1   0.020  50321| N := Statistics:-RandomVariable(Normal(0,sigma));
   5 |    1   0.070  82392| V := Statistics:-Sample(N,2*n*pocet_opakovani);
   6 |    1   0.000  67716| N12 := ArrayTools:-Reshape(V,pocet_opakovani,n,2);
   7 |    1   0.000      0| for i to pocet_opakovani do
   8 | 1000   0.020  84299|   n1 := N12[i,1 .. -1,1];
   9 | 1000   0.000  66057|   n2 := N12[i,1 .. -1,2];
  10 | 1000   0.050 120740|   vbeta := copy(vbeta0);
  11 | 1000   0.170 597630|   vyznamnost := vyznamnost+evalhf(compute(Y,n1,n2,n,vbeta))
                            end do;
  12 |    1   0.010  11100| vyznamnost := 100*map(x -> x/pocet_opakovani,vyznamnost)
end proc

We should be able, I think, to do even better by not reshaping and using ArrayTools:-Alias in the inner loop to create n1 and n2, however, not enough time can be saved there to make a significant change. Better to work on the long poles. The Matrix multiplication and transpose in line 2 (above) can be improved.

acer's picture

memory usage as component of efficiency

One can also consider memory usage as an important component of efficiency. Apart from wanting to keep memory allocation down for its own benefits, one can also try to keep memory use and re-use down so as to minimize garbage collection time.

Other benefits of this sort of array memory optimization can be that total memory allocation can sometimes be reduced, that large problems can become tractable, and sometimes speed is improved. The reasons for this can be complicated, but it can relate to memory fragmentation and also to the fact that garbage is not immediately freed.

For example,
Y := (X.beta)^%T;
might become
Y := X.beta;
LinearAlgebra:-Transpose(Y,inplace=true);
That should avoid producing an unnecessary object of the size of Y.

The creating of Arrays n1 and n2 by forming each N12 sub-Array might also be improved. One might be able to allocate empty n1 and n2 Vector[row]'s of the desired size and datatype, just once, and then use ArrayTools:-Copy each time through the loop so as to get the right portion of N12 into them. The key would be using the right offset and stride.

One might also be able to allocate space for Y1 and Y2 to be used in procedure `compute`, just the once. Eg, produce Y1 and Y2 as empty Vector[row]'s of the desired datatype, outside of `compute`, just once. Then, outside of `compute` but each time through the loop, use ArrayTools:-Copy to get Y into Y1. Follow that by VectorAdd(Y1,n1,inplace=true). And similarly for Y2.

Notice also that n1 and n2 might only get used in `compute` to produce Yp. So why not produce a re-uasable container for Yp just once, and never produce Y1 and Y2 at all! How about something like this,

Yp:=Vector[row](num,datatype=float); # just once
n1:=Vector[row](num,datatype=float); # just once
# and now, inside the i loop
ArrayTools:-Copy(...,N12,...n1...); # get data into n1
ArrayTools:-Copy(...,N12,...Yp...); # get n2 into Yp
VectorAdd(Yp,n1,inplace=true);
VectorScalarMultiply(Yp,0.5,inplace=true); # (n1+n2)/2
VectorAdd(Yp,Y); # (Y+n1 + Y+n2)/2 = (Y1+Y2)/2

Now consider the lines in `compute` like,
add(Yp[k]*X[k,3]/(n/2),k=1..n):
These line are the only places where Yp gets used, yes? So why not first scale Yp, inplace, by n/2 and then have those lines be like,
add(Yp[k]*X[k,3],k=1..n):

The Y1-Y2 object is just n1-n2, no? So the Y1-Y2 = n1-n2 object could also be created just once, as a re-usable Vector `Y1minusY2` outside the loop. But outside the loop, one already has n1 from above code. And n1 is no longer needed, once Yp is formed. So use it to hold Y1-Y2=n1-n2. Ie, outside the loop do,
VectorAdd(n1,n2,1,-1,inplace=true);

Hopefully I haven't made major mistakes. I'm sure that there are other improvements possible, and some of these techniques above won't make a big difference for smaller problems.

acer

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
}