So I have constructed this example. Naively I would have thought since constructing the big matrix takes N^2 operations and each small matrix (N/2)^2 operations, if three threads run at once I would get a big speedup. Instead it is much slower.

My machine is X64 (Windows XP) quad core and X86 (Windows 7) 2 core. The overall result is similar. What is happening here?

> restart;

> with(LinearAlgebra);with(Threads);

> Digits:=14:

> UseHardwareFloats:=true:

> # some utility programs

>

> Ncols:=proc(A)

> op(2,ArrayDims(A)[2]);

> end proc:

>

> Nrows:=proc(A)

> op(2,ArrayDims(A)[1]);

> end proc:

> kernelopts(numcpus);

2

> kernelopts(numactivethreads);

1

> MakeV0:=proc(nrows::integer,ncols::integer,istart::integer,jstart::integer,N::Matrix(datatype=float[8]), X::Matrix(datatype=float[8]),nPotbasis::integer, sigma::float)

> local foo,Xrdim,Xcdim,Xrfull,ioffset,joffset;

> global MakeV0C;

>

> Xrdim:=nrows;

> Xcdim:=ColumnDimension(X);

> Xrfull :=nPotbasis;

>

> ioffset:=istart;

> joffset:=jstart;

>

> print("numthreads executing = ",kernelopts(numactivethreads));

> if istart=jstart then

> foo:=Matrix(nrows,ncols,(i,j)->MakeV0A(i+ioffset,j+joffset,N[i+ioffset,j+joffset],X,sigma,Xcdim,Xrfull));

> else

> foo:=Matrix(nrows,ncols,(i,j)->MakeV0A(i+ioffset,j+joffset,N[i+ioffset,j+joffset],X,sigma,Xcdim,Xrfull));

> end if;

> return foo;

> end proc:

>

> MakeV0A:=proc(i,j,Nij,X,sigma,Xcdim,Xrfull)

> local V0,den,k,l,diffs,pot,psi,tempdiffs,temppsi,temppot,potn,chsign;

> #option autocompile;

>

> den:=(2*sigma^2);

> chsign:=proc(x) -x;end proc;

> tempdiffs:=Array(1..Xrfull);

> seq(assign('tempdiffs'[l],add(((X[i,k]+X[j,k])/2.0-X[l,k])^2,k=1..Xcdim)/den),l=1..Xrfull);

> temppsi:=exp~(chsign~(tempdiffs));

> temppot:= zip((a,b)->a*b,temppsi,tempdiffs);

> pot:=convert(temppot,`+`);

> psi:=convert(temppsi,`+`);

>

> if pot=0.0 and psi=0.0 then

> V0:=0.0;

> else V0 := Nij*pot/psi;

> end if;

> return V0;

> end proc:

> mergeV:=proc(V1,V2,V3)

> local V1rows, V2rows,V3rows,V1cols,V2cols,V3cols,totrows,totcols,Vtemp;

>

> V1rows:=Nrows(V1);

> V1cols:=Ncols(V1);

> V2rows:=Nrows(V2);

> V2cols:=Ncols(V2);

> V3rows:=Nrows(V3);

> V3cols:=Ncols(V3);

>

> if V3rows <> V1rows then

> error("V3rows=%1 not equal to V1rows = %2",V3rows,V1rows);

> elif V3cols <> V2cols then

> error("V3cols=%1 not equal to V2cols = %2",V3rows,V1rows);

> end if;

>

> totrows:=V1rows+V2rows;

> totcols:=V1cols+V2cols;

>

> Vtemp:=Matrix(V1rows+V2rows,V1cols+V2cols,datatype=float[8]);

> Vtemp[1..V1rows,1..V1cols]:=V1;

> Vtemp[V1rows+1..totrows,V1cols+1..totcols]:=V2;

> Vtemp[1..V1rows,V1cols+1..totcols]:=V3;

> Vtemp[V1rows+1..totrows,1..V1cols]:=HermitianTranspose(V3);

>

> return Vtemp;

>

> end proc:

>

> N:=0.05*RandomMatrix(300,300):

> N:=0.5*(N+HermitianTranspose(N)):

> XX:=0.03*RandomMatrix(3000,18):

> st:=time():Vfull:=MakeV0(40,40,1,1,N,XX,400,0.2);time()-st;

"numthreads executing = ",1

Matrix(%id = 869864272)

208.938

> task:=proc(N,XX)

> local W;

> W:=Task:-Continue(mergeV,Task=[MakeV0,20,20,1,1,N,XX,400,0.2],

> Task=[MakeV0,20,20,21,21,N,XX,400,0.2],Task=[MakeV0,20,20,1,21,N,XX,400,0.2]);

> return W;

> end proc:

> st:=time():Vtask:=Task:-Start(task,N,XX);time()-st;

"numthreads executing = ",4

"numthreads executing = ",4

"numthreads executing = ",4

Matrix(%id = 869867552)

314.734

> Vfull-Vtask;