by using your code (including the the above msm and ExtMSM tweaks) I ran again the whole Parametrize_SU_Euler procedure that now uses the optimized code. Unfortunately, even "only" 1000 8x8 matrices take 44 sec (where Matlab needs 11.76 sec for 10,000(!) 8x8 matrices). The profiling below shows that the main chunk is of course taken by the newly optimized matrix_exp() funtion which was called 63,000 times.
So I guess the (relatively) poor performance is due to the many procedure calls and I cannot do anything further about that, right?
```
Parametrize_SU_Euler := proc(N::posint, params)
local lambda, used_lambdas, temp, i, j, m, l, k, A, alpha, X, param_list, param_ranges, U, a, temp1, temp2, s, RESMAT, workuaf, onevec, PROD, mmm, msm, old_hardware_floats;
|Calls Seconds Words|
PROC | 1000 44.001 297621258|
1 | 1000 0.000 1836| if params::list then
2 | 0 0.000 0| param_ranges := evalf(Feynman_parameters("SU","Euler angles",N));
3 | 0 0.000 0| if not nops(param_ranges) = nops(params) then
4 | 0 0.000 0| error `: incorrect number of parameters! Expected a list of`, nops(param_ranges), ` parameters.`
end if;
5 | 0 0.000 0| alpha := params
elif params = "random" then
6 | 1000 0.921 681478| X := Statistics[RandomVariable](Uniform(0,1000));
7 | 1000 0.609 5121203| alpha := convert(Statistics[Sample](X,N^2-1),list)
else
8 | 0 0.000 0| error `: Either a list of numerical parameters is expected as second argument or the keyword "random".`
end if;
9 | 1000 0.078 303927| lambda := Feynman_hermitian_basis(N,"float");
10 | 1000 0.000 31963| j := m -> (N+m-1)*(N-m);
11 | 1000 0.031 749337| used_lambdas := Vector(N^2-1,datatype = integer);
12 | 1000 0.000 0| i := 1;
13 | 1000 0.000 0| for m from N by -1 to 2 do
14 | 7000 0.032 0| for k from 2 to m do
15 |28000 0.080 232101| used_lambdas[i] := 3;
16 |28000 0.062 242550| used_lambdas[i+1] := (k-1)^2+1;
17 |28000 0.000 0| i := i+2
end do
end do;
18 | 1000 0.000 0| for m from 2 to N do
19 | 7000 0.000 113343| used_lambdas[i] := m^2-1;
20 | 7000 0.000 0| i := i+1
end do;
21 | 1000 0.047 1020696| a := Matrix(N,N,datatype = ('complex')('float'));
22 | 1000 0.079 821553| temp1 := Matrix(N,N,datatype = ('complex')('float'));
23 | 1000 0.031 820575| temp2 := Matrix(N,N,datatype = ('complex')('float'));
24 | 1000 0.000 822536| s := Matrix(N,N,datatype = ('complex')('float'));
25 | 1000 0.062 818949| RESMAT := Matrix(N,N,datatype = ('complex')('float'));
26 | 1000 0.062 402396| workuaf := Vector[row](N,('datatype') = ('float'));
27 | 1000 0.079 513260| onevec := Vector(N,('fill') = 1.0,('datatype') = ('complex')('float'));
28 | 1000 0.047 820535| PROD := Matrix(N,N,datatype = ('complex')('float'));
29 | 1000 0.406 3058690| U := LinearAlgebra:-IdentityMatrix(N,N,compact = false,outputoptions = [datatype = complex[8]]);
30 | 1000 0.030 862553| temp := Matrix(N,N,datatype = complex[8]);
31 | 1000 0.015 2097| kernelopts(opaquemodules = false);
32 | 1000 0.032 4258| mmm := LinearAlgebra:-LA_Main:-LA_External:-MatrixMatrixMultiply;
33 | 1000 0.000 1300| msm := LinearAlgebra:-LA_Main:-LA_External:-MatrixScalarMultiply;
34 | 1000 0.000 0| old_hardware_floats := UseHardwareFloats;
35 | 1000 0.000 14923| UseHardwareFloats := true;
36 | 1000 0.157 54| for k to op(1,used_lambdas) do
37 |63000 0.278 981181| ArrayTools:-Copy(N*N,lambda[used_lambdas[k]],0,1,temp,0,1);
38 |63000 2.325 8443036| msm(temp,I*alpha[k]);
39 |63000 36.270 248603941| matrix_exp(temp,a,temp1,temp2,s,workuaf,onevec,RESMAT);
40 |63000 2.096 21058532| mmm(U,RESMAT,PROD,('inplace') = false);
41 |63000 0.172 1072455| ArrayTools:-Copy(N*N,PROD,0,1,U,0,1)
end do;
42 | 1000 0.000 0| UseHardwareFloats := old_hardware_floats;
43 | 1000 0.000 0| return U
end proc
```

Wow, this was (once more) an impressive lesson. A big thank you!
A short test of your code shows the enormous speed-up. I will still have to do my homework and go through the details but the impact of the main idea is clear.
One question though: for your test code above (with oldmethod, newmethod := false, true) with n=64 and N=1000 my machine needs 8.891 sec. However, for N=5000 already 212 sec are needed? It's not obvious to me why it scales so strangely...!?
I hope I will be able to transfer this approach to other bottlenecks in my program. I really appreciate the time and effort you put into solving my problem! As with many of your posts here in the forum, this has been very instructive for me! Thank you!

Wow, this was (once more) an impressive lesson. A big thank you!
A short test of your code shows the enormous speed-up. I will still have to do my homework and go through the details but the impact of the main idea is clear.
One question though: for your test code above (with oldmethod, newmethod := false, true) with n=64 and N=1000 my machine needs 8.891 sec. However, for N=5000 already 212 sec are needed? It's not obvious to me why it scales so strangely...!?
I hope I will be able to transfer this approach to other bottlenecks in my program. I really appreciate the time and effort you put into solving my problem! As with many of your posts here in the forum, this has been very instructive for me! Thank you!

Thanks for these ideas. I'll have a look at the correspondig help pages to better understand what you mean.
Mostly, I'm working under WinXP (32bit) which makes external calls a bit more complicated. Meanwhile, it tried to compile a BLAS.DLL library file that might be used under Windows (although I have never done that yet).
But I have also a Ubuntu installation available. In principle, if external calls are really necessary, I would like to support Windows and Linux platforms if possible. Shouldn't that work if one delivers both a DLL library for windows and SO library for linux?
UPDATE: By setting infolevel[LinearAlgebra]:=2 I checked again and found that NAG hardware precision routines are already used. Moreover this showed that the MatrixFunction command does some (probably avoidable?) copying:
"MatrixFunction: copying, to recover complex(float) datatype"
Is this the overhead you were referring to?
BTW: How can one find all those undocumented low-level procedures? I just learned about
kernelopts(opaquemodules=false);
showstat( ... );
But typically I don't know the names of these, procedures. So where can I find them?

Thanks for these ideas. I'll have a look at the correspondig help pages to better understand what you mean.
Mostly, I'm working under WinXP (32bit) which makes external calls a bit more complicated. Meanwhile, it tried to compile a BLAS.DLL library file that might be used under Windows (although I have never done that yet).
But I have also a Ubuntu installation available. In principle, if external calls are really necessary, I would like to support Windows and Linux platforms if possible. Shouldn't that work if one delivers both a DLL library for windows and SO library for linux?
UPDATE: By setting infolevel[LinearAlgebra]:=2 I checked again and found that NAG hardware precision routines are already used. Moreover this showed that the MatrixFunction command does some (probably avoidable?) copying:
"MatrixFunction: copying, to recover complex(float) datatype"
Is this the overhead you were referring to?
BTW: How can one find all those undocumented low-level procedures? I just learned about
kernelopts(opaquemodules=false);
showstat( ... );
But typically I don't know the names of these, procedures. So where can I find them?

Hi acer,
yes, the lambda[i] matrices are created with datatype=complex[8].
A basically equivalent code in Matlab 7.5 (for the creation of the whole SU matrix, not only the matrix exponentials!) takes 11.76 sec for 10,000 (random) 8x8 SU matrices. 10,000 16x16 matrices already take 92.09 sec.
1,000 (not 10,000) 32x32 random matrices take 133.60 sec. All timings done under WinXP (32bit) on a CoreDuo laptop (2x2.33GHz). In Matlab, multithreading was enabled (doesn't make a difference, though).
It seems that 64x64 will stay unrealistic as it takes 44.68 sec for only 10 64x64 matrices...
Of course, I'm curious what you suggest to improve the Maple code as the above Maple implementation is orders of magnitude slower:
13.047 sec for only 100(!) 8x8 SU matrices and
58.922 sec for 100 16x16.
PS: Generally, the lambda[i] matrices do not commute.

Hi acer,
yes, the lambda[i] matrices are created with datatype=complex[8].
A basically equivalent code in Matlab 7.5 (for the creation of the whole SU matrix, not only the matrix exponentials!) takes 11.76 sec for 10,000 (random) 8x8 SU matrices. 10,000 16x16 matrices already take 92.09 sec.
1,000 (not 10,000) 32x32 random matrices take 133.60 sec. All timings done under WinXP (32bit) on a CoreDuo laptop (2x2.33GHz). In Matlab, multithreading was enabled (doesn't make a difference, though).
It seems that 64x64 will stay unrealistic as it takes 44.68 sec for only 10 64x64 matrices...
Of course, I'm curious what you suggest to improve the Maple code as the above Maple implementation is orders of magnitude slower:
13.047 sec for only 100(!) 8x8 SU matrices and
58.922 sec for 100 16x16.
PS: Generally, the lambda[i] matrices do not commute.

Thank you for your interest in my problem. Let me answer your questions:
- The typical sizes of the matrices will be in the range 8x8 to 64x64.
- When using some higher-lever commands this results in quite many (can be many thousands e.g. in optimisation problems) calls of the lower level procedures (see example profiling below)
- The Feynman_permutation_matrix() command (which is the same as Permutation_matrix(), just forgot to rename it when pasting here) seems to be less critical. It is often called with the same arguments and therefore I use the 'remember' option. As for the Partial_transpose() command: Below you find the profiling after some more or less real-life application. One can see it has been called rather often...

```
Partial_trace := proc(rho::('Matrix')(square), d::list(posint), trace_list::list(posint))
local rho_dim, i, j, k, d_in, d_out, in_list, out_list, U, reordered_rho, new_rho;
|Calls Seconds Words|
PROC |12000 12.460 107208277|
1 |12000 0.016 54220| rho_dim := mul(d[x],x = 1 .. nops(d));
2 |12000 0.092 145846| if not LinearAlgebra[RowDimension](rho) = rho_dim then
3 | 0 0.000 0| error `: The given matrix does not match the specified subspace dimensions.`
end if;
4 |12000 0.016 28394| if nops(d) < max(op(trace_list)) then
5 | 0 0.000 0| error `: One (or more) of the specified target subspaces is invalid.`
end if;
6 |12000 0.124 366797| out_list, in_list := selectremove(has,[seq(1 .. nops(d))],trace_list);
7 |12000 0.000 217701| d_in := mul(d[i],i = in_list);
8 |12000 0.000 169443| d_out := mul(d[i],i = out_list);
9 |12000 0.031 225276| if not [op(in_list), op(out_list)] = [seq(i,i = 1 .. nops(d))] then
10 | 8000 0.575 2826135| if rho[1,1]::complex[8] then
11 | 8000 0.811 12085671| U := Matrix(rho_dim,rho_dim,Feynman_permutation_matrix([op(in_list), op(out_list)],d),datatype = complex[8]);
12 | 8000 0.420 4633864| new_rho := Matrix(d_in,d_in,datatype = complex[8])
else
13 | 0 0.000 0| U := Feynman_permutation_matrix([op(in_list), op(out_list)],d);
14 | 0 0.000 0| new_rho := Matrix(d_in,d_in)
end if;
15 | 8000 4.582 55225765| reordered_rho := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(U,rho,inplace = false,outputoptions = []),LinearAlgebra:-LA_Main:-HermitianTranspose(U,inplace = false,outputoptions = []),inplace = false,outputoptions = [])
else
16 | 4000 0.000 0| reordered_rho := rho;
17 | 4000 1.411 3122064| if rho::('Matrix')(complex[8]) then
18 | 4000 0.171 2840927| new_rho := Matrix(d_in,d_in,datatype = complex[8])
else
19 | 0 0.000 0| new_rho := Matrix(d_in,d_in)
end if
end if;
20 |12000 0.031 0| for i to d_in do
21 |36000 0.142 0| for j to d_in do
22 |120000 4.038 25266174| new_rho[i,j] := add(reordered_rho[(i-1)*d_out+k,(j-1)*d_out+k],k = 1 .. d_out)
end do
end do;
23 |12000 0.000 0| return new_rho
end proc
```

- Specifically in Parametrize_SU_Euler() (but also in other procedures) I need to calculate matrix functions. This is rather expensive but not so easy to do it by simple external calls, right!? Here's a profiling from a very similar application as the profiling above. Here, only 200 calls occured but when carrying out some optimisation problem over the set of unitary matrices (which are parametrized by Parametrize_SU_Euler) then _easily_ many thousand calls occur (which is of course much to expensive...) ...sorry for the ugly line breaks...

```
Parametrize_SU_Euler := proc(N::posint, params)
local lambda, used_lambdas, temp, i, j, m, l, k, A, alpha, X, param_list, param_ranges, U;
|Calls Seconds Words|
PROC | 200 40.805 361206053|
1 | 200 0.000 255| if params::list then
2 | 200 0.015 266569| param_ranges := evalf(Feynman_parameters("SU","Euler angles",N));
3 | 200 0.000 1106| if not nops(param_ranges) = nops(params) then
4 | 0 0.000 0| error `: incorrect number of parameters! Expected a list of`, nops(param_ranges), ` parameters.`
end if;
5 | 200 0.000 0| alpha := params
elif params = "random" then
6 | 0 0.000 0| X := Statistics[RandomVariable](Uniform(0,1000));
7 | 0 0.000 0| alpha := convert(Statistics[Sample](X,N^2-1),list)
else
8 | 0 0.000 0| error `: Either a list of numerical parameters is expected as second argument or the keyword "random".`
end if;
9 | 200 0.109 597316| lambda := Feynman_hermitian_basis(N,"float");
10 | 200 0.000 2602| j := m -> (N+m-1)*(N-m);
11 | 200 0.016 138841| used_lambdas := Vector(N^2-1,datatype = integer);
12 | 200 0.000 0| i := 1;
13 | 200 0.000 0| for m from N by -1 to 2 do
14 | 1800 0.016 0| for k from 2 to m do
15 | 9000 0.064 62752| used_lambdas[i] := 3;
16 | 9000 0.000 64130| used_lambdas[i+1] := (k-1)^2+1;
17 | 9000 0.000 0| i := i+2
end do
end do;
18 | 200 0.000 0| for m from 2 to N do
19 | 1800 0.000 20392| used_lambdas[i] := m^2-1;
20 | 1800 0.000 0| i := i+1
end do;
21 | 200 0.000 585435| temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(lambda[used_lambdas[1]],I*alpha[1],inplace = false,outputoptions = [datatype = complex[8]]);
22 | 200 0.407 2661130| U := LinearAlgebra:-LA_Main:-MatrixFunction(temp,exp(dummy),dummy,outputoptions = [datatype = complex[8]]);
23 | 200 0.031 105| for k from 2 to op(1,used_lambdas) do
24 |19600 5.127 51781164| temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(lambda[used_lambdas[k]],I*alpha[k],inplace = false,outputoptions = []);
25 |19600 35.020 305024256| U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(U,LinearAlgebra:-LA_Main:-MatrixFunction(temp,exp(dummy),dummy,outputoptions = []),inplace = false,outputoptions = [])
end do
end proc
- Some rough empirical tests seemed to show that garbage collection works best at gcfreq=200000000.
- If you think it pays off to call BLAS, LAPACK, whatever routines then I'm willing to learn how to do it. But, currently I mainly use Windows where I first would have to create the proper DLLs. How complicated would it be to offer such a external calling solution for Windows and Linux (if possible I would like to support both platforms). But -as you pointed out already- LinearAlgebra uses already external routines. So I wonder if there is really so much room for improvements!?
EDIT: I should mention that the parametrize_SU command above is actually already the result of a forum topic here (see http://www.mapleprimes.com/forum/efficient-parametrization-of-unitary-matrices ).
Once again: Thank you for your support!
```

```
```