Using an accelerated, fast BLAS set is important, but after attaining a certain major proportional gain over generic refrence BLAS the smaller differences between fast implmentations becomes less important. An exception to that idea is that threaded parallel implementations for SMP or multi-core machines can still be a further major benefit. It's also often the case that resolving Maple Library level interpreted code bottlenecks is also a fruitful area for focus and improvement, especially once some fast BLAS has been utilized. In other words, once fast BLAS are in use on a platform then further tweaking to get a little more performance out of the BLAS is effort less well spent than would be finding other bottlenecks in the Maple Library -- with multithreading being a possible exception.

Cleve Moler made a brief mention in Cleve's Corner, Winter 2000, to the effect that there was not much difference between MKL and ATLAS. That statement is dated, or course. Any difference is going to change over time, depending on which makes best use of new chipset extensions in a timely fashion.

There is an AMD equivalent to Intel's MKL. It is the AMD ACML.

The past few releases of the Intel MKL have also been available for Linux. The Maple-NAG Connector toolbox allows for use of generic, MKL, or ACML BLAS where the corresponding supported NAG C Library allows.

Tip for today: The environment variable OMP_NUM_THREADS can be set to the number of SMP CPUs or cores, on a MS-Windows machine. The Intel MKL should pick this up and allow parallel computation, especially noticable in large floating-point level-3 BLAS calls. While this can bring a benefit for true multi-core machines, it can degrade performance on a single-core hyperthreaded CPU if nice access to cached data is ruined.

Dave Linder Mathematical Software, Maplesoft

For problem 14, "Gaussian error function over a 1500x1500 matrix"

The original code,

with(LinearAlgebra):
TotalTime:=0:
for i from 1 by 1 to 100 do
a:=RandomMatrix(1500,density=0.5,generator=0.0..1.0,outputoptions=[datatype=float[8]]):
t:=time():
Map(erf,a):
TotalTime:=TotalTime+time()-t:
end do:
print(TotalTime);

was reported as taking about 4800 sec on their "Intel Quad Core Q6600 processor with 2.4 GHz and 2 GB RAM running under Windows Vista Home".

Replacing

LinearAlgebra:-Map(erf,a):

with

map[evalhf](erf,a):

makes it take only 17 sec on a single-CPU AMD Athlon64 3200+ under 64bit Linux with Maple 11.02.

Using,

map[evalhf,inplace](erf,a):

brought it to 10 sec.

Dave Linder Mathematical Software, Maplesoft

For problem 13, "Gamma function over a 1500x1500 matrix".

The original code,

with(LinearAlgebra):
TotalTime:=0:
for i from 1 by 1 to 100 do
a:=RandomMatrix(1500,generator=0.01..1.0,outputoptions=[datatype=float[8]]):
t:=time():
Map(GAMMA,a):
TotalTime:=TotalTime+time()-t:
end do:
print(TotalTime);

was reported as taking about 10000 sec on their "Intel Quad Core Q6600 processor with 2.4 GHz and 2 GB RAM running under Windows Vista Home".

Replacing

LinearAlgebra:-Map(GAMMA,a):

with

map[evalhf](GAMMA,a):

makes it take only 66 sec on a single-CPU AMD Athlon64 3200+ under 64bit Linux with Maple 11.02.

Using,

map[evalhf,inplace](GAMMA,a):

brought it to 59 sec.

Dave Linder Mathematical Software, Maplesoft

Take, for example, this Matrix U.
restart:
U := Matrix([[a,0,0],[0,0,0],[0,0,c]]);
with(LinearAlgebra):
Evals := Eigenvalues(U);
U1:= U - Evals[1]*IdentityMatrix(3):
Now apply that "trick" of dropping the bottom row of U1. If I've got the wrong idea from your post then please let me know.
V1:= LinearSolve(U1[1..2,1..3],Vector(2));
U.V1 - Evals[1].V1; # not all zero, so not right
That V1 doesn't look like the right parametrization of the eigenspace associated with the first eigenvalue. (Unless _t[3] is set to zero, but there is no mention of such clever assignments, in the trick as described. And how to deduce that, anyway?)
But this looks fine.
V1_1:= LinearSolve(U1,Vector(3));
U.V1_1 - Evals[1].V1_1;
The trick of removing the last row of the Characteristic Matrix (evaluated at an eigenvalue) looks dubious to me.
As for whether using LinearSolve or NullSpace is better, that depends on what form one wants the eigenvectors in. If you want a basis of the eigenspace (like Eigenvectors returns in columns) as separate Vectors then use NullSpace. If you want a parametrized sum, in the case of an eigenspace with more than one Vector in its basis, then use LinearSolve.
The above is all general, with nothing about those side relations.
If you want to avoid mistakenly using as a pivot some entry which is (actually, or under some side relations) zero, then setting Testzero or Normalizer should work with each of these.
LinearSolve(U1,Vector(3),method=LU);
LinearSolve(U1,Vector(3));
NullSpace(U);
The fact that it doesn't always work so well, using Normalizer (to include simplification with side relations) and Eigenvectors is something that could be investigated.
Dave Linder
Mathematical Software, Maplesoft

Take, for example, this Matrix U.
restart:
U := Matrix([[a,0,0],[0,0,0],[0,0,c]]);
with(LinearAlgebra):
Evals := Eigenvalues(U);
U1:= U - Evals[1]*IdentityMatrix(3):
Now apply that "trick" of dropping the bottom row of U1. If I've got the wrong idea from your post then please let me know.
V1:= LinearSolve(U1[1..2,1..3],Vector(2));
U.V1 - Evals[1].V1; # not all zero, so not right
That V1 doesn't look like the right parametrization of the eigenspace associated with the first eigenvalue. (Unless _t[3] is set to zero, but there is no mention of such clever assignments, in the trick as described. And how to deduce that, anyway?)
But this looks fine.
V1_1:= LinearSolve(U1,Vector(3));
U.V1_1 - Evals[1].V1_1;
The trick of removing the last row of the Characteristic Matrix (evaluated at an eigenvalue) looks dubious to me.
As for whether using LinearSolve or NullSpace is better, that depends on what form one wants the eigenvectors in. If you want a basis of the eigenspace (like Eigenvectors returns in columns) as separate Vectors then use NullSpace. If you want a parametrized sum, in the case of an eigenspace with more than one Vector in its basis, then use LinearSolve.
The above is all general, with nothing about those side relations.
If you want to avoid mistakenly using as a pivot some entry which is (actually, or under some side relations) zero, then setting Testzero or Normalizer should work with each of these.
LinearSolve(U1,Vector(3),method=LU);
LinearSolve(U1,Vector(3));
NullSpace(U);
The fact that it doesn't always work so well, using Normalizer (to include simplification with side relations) and Eigenvectors is something that could be investigated.
Dave Linder
Mathematical Software, Maplesoft

In the FPS system of units, the default unit for length is the foot (ft). That's what the F in FPS stands for.
You can create and use a new system, where inches is used instead of feet.
Here's an example, where inches is used for 1-dimensional length.
restart:
with(Units[Natural]):
Units[AddSystem](FPS_with_inches, Units[GetSystem](FPS), inch);
Units[UseSystem](FPS_with_inches):
85000*pounds/inch^2;
36*lbs/ft;
Dave Linder
Mathematical Software, Maplesoft

In the FPS system of units, the default unit for length is the foot (ft). That's what the F in FPS stands for.
You can create and use a new system, where inches is used instead of feet.
Here's an example, where inches is used for 1-dimensional length.
restart:
with(Units[Natural]):
Units[AddSystem](FPS_with_inches, Units[GetSystem](FPS), inch);
Units[UseSystem](FPS_with_inches):
85000*pounds/inch^2;
36*lbs/ft;
Dave Linder
Mathematical Software, Maplesoft

Thanks, that's useful, to have something so concrete to consider.
What sizes (N, rho, d, trace_list) are most important? Do you expect to call the routines many, many times at small sizes, or fewer times at larger sizes?
Does your profiling show relatively how much overhead is within Feynman_permutation_matrix() and Permutation_matrix(), for Partial_trace()?
In Parametrize_SU_Euler(), the final loop looks heavy in terms of how many new Matrices get created (eventual garbage). One possibility might be to re-use Matrix temp each time through it (using ArrayTools:-Copy to get lambda[used_lambdas[k]] into it, and then scale it in-place). Hmm. Does that iterated scaling & matrix exponential simplify mathematically, I wonder? There are, what, N^2-1 matrix exponential calls?
It might be worthwhile to look at the looping which creates used_lambdas. Interesting.
If there is just too much overhead (function calls, garbage collection, etc), how low-level would you be willing to go? For example, would you consider examing and re-using the matrix-exponential Library internal routine but altered to re-use workspace Matrices? How about direct BLAS calls from within Maple, instead of LinearAlgebra routines? Of course, those would be last resorts, if no other big improvements could be made.
Dave Linder
Mathematical Software, Maplesoft

Thanks, that's useful, to have something so concrete to consider.
What sizes (N, rho, d, trace_list) are most important? Do you expect to call the routines many, many times at small sizes, or fewer times at larger sizes?
Does your profiling show relatively how much overhead is within Feynman_permutation_matrix() and Permutation_matrix(), for Partial_trace()?
In Parametrize_SU_Euler(), the final loop looks heavy in terms of how many new Matrices get created (eventual garbage). One possibility might be to re-use Matrix temp each time through it (using ArrayTools:-Copy to get lambda[used_lambdas[k]] into it, and then scale it in-place). Hmm. Does that iterated scaling & matrix exponential simplify mathematically, I wonder? There are, what, N^2-1 matrix exponential calls?
It might be worthwhile to look at the looping which creates used_lambdas. Interesting.
If there is just too much overhead (function calls, garbage collection, etc), how low-level would you be willing to go? For example, would you consider examing and re-using the matrix-exponential Library internal routine but altered to re-use workspace Matrices? How about direct BLAS calls from within Maple, instead of LinearAlgebra routines? Of course, those would be last resorts, if no other big improvements could be made.
Dave Linder
Mathematical Software, Maplesoft

Jacques'

comments here are not off the mark. Working with Matrix blocks might be said to straddle an area between computational and abstract linear algebra (ALA). And ALA is a big area, with a number of challenging design issues. The not least of which would be to have it usefully mesh with the current computational functionality of LinearAlgebra.
I find it interesting that theoretical physicists have been showing the most interest in this area. We do read MaplePrimes content. And it's very helpful, adding insight for us on what users want to do.
Dave Linder
Mathematical Software, Maplesoft

Thanks for that bit of perspective, Jacques.
Dave Linder
Mathematical Software, Maplesoft

I look forward to seeing any of your results in this area.
An issue for me is that, right now, Sample(X,N) returns a Vector with either hardware float or software datatype, according to UseHardwareFloats and Digits. The same is true for Sample(N), which I believe returns a procedure which produces a Vector whose datatype is determined at invocation time. That is, it isn't fixed at the time that the procedure output of Sample(N) is created.
But Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample") produces a procedure with a specific characteristic. It produces a procedure which returns Vectors of only a single datatype. That datatype is determined when the procedure is created and not when it is called.
For example,

> kernelopts(opaquemodules=false):
> Statistics:-ExternalSupport:-Initialize():
> p_hw := Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample");
p := proc()
option call_external, define_external("STATS_MapleNormalRandomSample_HW",
MAPLE, LIB = "libstatshw.so");
call_external(0, 182945245392, true, args)
end proc
> Digits:=20:
> p_sw := Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample");
p := proc()
option call_external, define_external("STATS_MapleNormalRandomSample_MP",
MAPLE, LIB = "libstatsmp.so");
call_external(0, 182955154176, 0, args)
end proc

So routines p_hw and p_sw behave differently, each static w.r.t the datatype and environmental settings like Digits and UseHardwareFloats. But Statistics:-Sample(X,N) on the other hand, that changes behaviour according to those settings.
Emitting a lean procedure, with as little argument checking and as few optional parameters as possible, but which also respects UseHardwareFloats and Digits on the fly, well, that's very close to what Sample(X) now returns.
ps. Actually, it's a tiny little bit weird, that the software Vectors have datatype=anything instead of datatype=sfloat, but perhaps that's an attempt to be more flexible. It shouldn't affect these discussions.
Dave Linder
Mathematical Software, Maplesoft

I look forward to seeing any of your results in this area.
An issue for me is that, right now, Sample(X,N) returns a Vector with either hardware float or software datatype, according to UseHardwareFloats and Digits. The same is true for Sample(N), which I believe returns a procedure which produces a Vector whose datatype is determined at invocation time. That is, it isn't fixed at the time that the procedure output of Sample(N) is created.
But Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample") produces a procedure with a specific characteristic. It produces a procedure which returns Vectors of only a single datatype. That datatype is determined when the procedure is created and not when it is called.
For example,

> kernelopts(opaquemodules=false):
> Statistics:-ExternalSupport:-Initialize():
> p_hw := Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample");
p := proc()
option call_external, define_external("STATS_MapleNormalRandomSample_HW",
MAPLE, LIB = "libstatshw.so");
call_external(0, 182945245392, true, args)
end proc
> Digits:=20:
> p_sw := Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample");
p := proc()
option call_external, define_external("STATS_MapleNormalRandomSample_MP",
MAPLE, LIB = "libstatsmp.so");
call_external(0, 182955154176, 0, args)
end proc

So routines p_hw and p_sw behave differently, each static w.r.t the datatype and environmental settings like Digits and UseHardwareFloats. But Statistics:-Sample(X,N) on the other hand, that changes behaviour according to those settings.
Emitting a lean procedure, with as little argument checking and as few optional parameters as possible, but which also respects UseHardwareFloats and Digits on the fly, well, that's very close to what Sample(X) now returns.
ps. Actually, it's a tiny little bit weird, that the software Vectors have datatype=anything instead of datatype=sfloat, but perhaps that's an attempt to be more flexible. It shouldn't affect these discussions.
Dave Linder
Mathematical Software, Maplesoft

Hi John,
We've been working on a KroneckerProduct routine. The two variations you just gave here, entered with exactly the same syntax, both produce the same result in a development version of Maple.
Dave Linder
Mathematical Software, Maplesoft

I'm trying to understand how far (or close) to these ideas the following distinction is:
X := Statistics:-RandomVariable(Normal(0,1));
# numeric vector return, with overhead
V := Statistics:-Sample(X,5);
# or...
G := Statistics:-Sample(X);
# numeric vector return, with less overhead
V := G(5);
There is still some overhead in the returned procedure G. Consider this variant,
kernelopts(opaquemodules=false):
Statistics:-ExternalSupport:-Initialize();
p := Statistics:-ExternalSupport:-DefineExternal("MapleNormalRandomSample");
p(5,0,1);
But it seems that Statistics:-ExternalSupport:-DefineExternal will behave according to UseHardwareFloats and Digits. So an even leaner "trampoline" sort of reassignment might be tricky.
Dave Linder
Mathematical Software, Maplesoft