<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
  <channel>
    <title>MaplePrimes - answers and comments on Question, Advice on profiling?</title>
    <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling</link>
    <language>en-us</language>
    <copyright>2026 Maplesoft, A Division of Waterloo Maple Inc.</copyright>
    <generator>Maplesoft Document System</generator>
    <lastBuildDate>Wed, 17 Jun 2026 00:47:56 GMT</lastBuildDate>
    <pubDate>Wed, 17 Jun 2026 00:47:56 GMT</pubDate>
    <itunes:subtitle />
    <itunes:summary />
    <description>The latest answers and comments added to the Question, Advice on profiling?</description>
    <image>
      <url>http://www.mapleprimes.com/images/mapleprimeswhite.jpg</url>
      <title>MaplePrimes - answers and comments on Question, Advice on profiling?</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling</link>
    </image>
    <item>
      <title>LinearAlgebra efficiency</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer74128</link>
      <itunes:summary>LinearAlgebra does make extensive use of compiled external code, the NAG routines.  In fact, this might be connected to the troubles you're having in profiling.  See the help page "Efficient Numeric Linear Algebra Computation" (?EffNumLA) for pointers on how to make your code more efficient.  In particular, it may make a significant difference if you use datatype=float[8].
</itunes:summary>
      <description>LinearAlgebra does make extensive use of compiled external code, the NAG routines.  In fact, this might be connected to the troubles you're having in profiling.  See the help page "Efficient Numeric Linear Algebra Computation" (?EffNumLA) for pointers on how to make your code more efficient.  In particular, it may make a significant difference if you use datatype=float[8].
</description>
      <guid>74128</guid>
      <pubDate>Thu, 10 Jan 2008 00:09:05 Z</pubDate>
      <itunes:author>Robert Israel</itunes:author>
      <author>Robert Israel</author>
    </item>
    <item>
      <title>code sample</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer74126</link>
      <itunes:summary>
If you can post a portion of the code, that might give the best chances for useful advice.

Dave Linder
Mathematical Software, Maplesoft</itunes:summary>
      <description>
If you can post a portion of the code, that might give the best chances for useful advice.

Dave Linder
Mathematical Software, Maplesoft</description>
      <guid>74126</guid>
      <pubDate>Thu, 10 Jan 2008 00:39:36 Z</pubDate>
      <itunes:author>Dave L</itunes:author>
      <author>Dave L</author>
    </item>
    <item>
      <title>Digits=14</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer74124</link>
      <itunes:summary>Having &lt;code&gt;Digits=14&lt;/code&gt; does not guarantee that the computations are done with hardware arithmetic.  The &lt;code&gt;Digits&lt;/code&gt; value indicates how many digits of precision you would like in the result of a primitive computation.  The computation itself might be done with, say, &lt;code&gt;Digits+2&lt;/code&gt; precision, in order to achieve &lt;code&gt;Digits&lt;/code&gt; precision in the result.

There are some Maple routines that increase &lt;code&gt;Digits&lt;/code&gt; internally.  I do not know if that is that case with the LinearAlgebra routines you are using.  To force hardware floats, I think that you should set &lt;code&gt;UseHardwareFloats=true&lt;/code&gt; and ensure that your Matrices/Vectors have &lt;code&gt;datatype&lt;/code&gt; set appropiately.  See &lt;code&gt;?EffNumLA&lt;/code&gt; for details.  

I do not have much experience with this.  Perhaps others can suggest more.</itunes:summary>
      <description>Having &lt;code&gt;Digits=14&lt;/code&gt; does not guarantee that the computations are done with hardware arithmetic.  The &lt;code&gt;Digits&lt;/code&gt; value indicates how many digits of precision you would like in the result of a primitive computation.  The computation itself might be done with, say, &lt;code&gt;Digits+2&lt;/code&gt; precision, in order to achieve &lt;code&gt;Digits&lt;/code&gt; precision in the result.

There are some Maple routines that increase &lt;code&gt;Digits&lt;/code&gt; internally.  I do not know if that is that case with the LinearAlgebra routines you are using.  To force hardware floats, I think that you should set &lt;code&gt;UseHardwareFloats=true&lt;/code&gt; and ensure that your Matrices/Vectors have &lt;code&gt;datatype&lt;/code&gt; set appropiately.  See &lt;code&gt;?EffNumLA&lt;/code&gt; for details.  

I do not have much experience with this.  Perhaps others can suggest more.</description>
      <guid>74124</guid>
      <pubDate>Thu, 10 Jan 2008 01:24:32 Z</pubDate>
      <itunes:author>DJKeenan</itunes:author>
      <author>DJKeenan</author>
    </item>
    <item>
      <title>Two example procedures...</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer74121</link>
      <itunes:summary>Thank you for your comments. I know that my question was/is rather general and therefore hard to answer. In principle I am aware of the following rules of thumb:

- use appropriate datatypes in the matrices, e.g. float[8] or complex[8]

- use programming layer commands of the LinearAlgebra package, i.e. LA_Main:-...

- use cache/remember tables when suitable

- use inlining (typically not possible)


Below I'll show some sample procedures to give some idea...

&lt;code&gt;
#################################################################################
#################################################################################

Parametrize_SU_Euler := proc(N::posint, params)

#
# returns a NxN SU(N) unitary matrix as described by the given list of
# parameters (or the keyword "random").
# The procedure follows [Tilma, Sudarshan, J. Phys. A 35 (2002) 10467]
# (see Eq. (19))
#

options hfloat;

local lambda, used_lambdas, temp, i, j, m, l, k, A, alpha, X, param_list, param_ranges, U;

if params::list then
   #
   # if a list of parameters [alpha[1], ..., alpha[N^2-1]]
   # is given they are checked to be within the valid ranges.
   #
   param_ranges := evalf(Feynman_parameters("SU", "Euler angles", N));
   if not nops(param_ranges) = nops(params) then
      error(`\: incorrect number of parameters! Expected a list of`, nops(param_ranges),` parameters.`);
   end if;

   alpha := params;

elif params = "random" then
   #
   # if no explicit list of numerical parameters is given but
   # the key "random" then the necessary parameters are generated
   # randomly
   #
   X := Statistics[RandomVariable](Uniform(0,1000));
   alpha := convert(Statistics[Sample](X, N^2-1), list);


else
   error(`\: Either a list of numerical parameters is expected as second argument or the keyword "random".`)
end if;

#
# first, a list of the generalized Gell-Mann matrices is needed as
# a hermitian basis of the space of NxN matrices.
#
lambda := Hermitian_basis(N, "float");

#
# define auxiliary function j(m) as in the reference
#
#j := m -&gt; piecewise(m=N, 0, sum(2*(m+l), l=0..N-m-1));
#
# actually the same but easier is the following
j := m -&gt; (N+m-1)*(N-m);


#
# create a list of the lambda matrices that are actually used
# (in the order of later use, including multiple occurrences)
#
used_lambdas := Vector(N^2-1, datatype=integer):
i := 1;
for m from N to 2 by -1 do
   for k from 2 to m do
      used_lambdas[i]   := 3;
      used_lambdas[i+1] := (k-1)^2 + 1;
      i := i+2;
   end do:
end do:
for m from 2 to N do
   used_lambdas[i] := m^2-1;
   i := i+1;
end do:

temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
              lambda[used_lambdas[1]], I*alpha[1],
              inplace=false, outputoptions=[datatype=complex[8]]);
U := LinearAlgebra:-LA_Main:-MatrixFunction(
              temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]);

for k from 2 to op(1,used_lambdas) do
   temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
                 lambda[used_lambdas[k]], I*alpha[k],
                 inplace=false, outputoptions=[]);
   U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
                 U, LinearAlgebra:-LA_Main:-MatrixFunction(
                             temp, exp(dummy), dummy, outputoptions=[]),
                             inplace=false, outputoptions=[]);
end do:

end proc:

#################################################################################
#################################################################################
&lt;/code&gt;

&lt;code&gt;
#################################################################################
#################################################################################


Partial_trace := proc(rho::'Matrix'(square), d::list(posint), trace_list::list(posint))

#
# returns the partial trace of a square (density) matrix with respect
# to one or more subsystems. 'd' is the list of the dimensions of each
# subspace, 'trace_list' is the list of the subsystem indices which are
# traced out.
# A matrix is returned
#

local rho_dim, i, j, k, d_in, d_out, in_list, out_list, U, reordered_rho,
      new_rho;

#
# check if dimension of the given matrix is compatible with the specified subspace
# dimensions
#
rho_dim := mul(d[x], x=1..nops(d));
if not LinearAlgebra[RowDimension](rho) = rho_dim then
   error(`\: The given matrix does not match the specified subspace dimensions.`);
end if;

if max(op(trace_list)) &gt; nops(d) then
   error(`\: One (or more) of the specified target subspaces is invalid.`);
end if;


#
# reorder the subsystems so that the surviving ones (in ascending order)
# come first and the one to be traced out are last
#
out_list, in_list := selectremove(has, [seq(1..nops(d))], trace_list);
d_in  := mul(d[i], i=in_list);  # dimension of the remaining subspace
d_out := mul(d[i], i=out_list); # dimension of the subspace to be traced out

if not [op(in_list), op(out_list)] = [seq(i,i=1..nops(d))] then
   if rho[1,1]::complex[8] then
      U := Matrix(rho_dim, rho_dim, Permutation_matrix([op(in_list), op(out_list)], d), datatype=complex[8]);
      new_rho := Matrix(d_in, d_in, datatype=complex[8]);
   else
      U := Feynman_permutation_matrix([op(in_list), op(out_list)], d);
      new_rho := Matrix(d_in, d_in);
   end if;
   reordered_rho := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(U, rho, inplace=false, outputoptions=[]), LinearAlgebra:-LA_Main:-HermitianTranspose(U, inplace=false, outputoptions=[]), inplace=false, outputoptions=[]);
else
   reordered_rho := rho;
   if rho::'Matrix'(complex[8]) then
      new_rho := Matrix(d_in, d_in, datatype=complex[8]);
   else
      new_rho := Matrix(d_in, d_in);
   end if;
end if;

for i from 1 to d_in do
   for j from 1 to d_in do
      new_rho[i,j] := add(reordered_rho[(i-1)*d_out+k, (j-1)*d_out+k], k=1..d_out);
   end do;
end do;


return(new_rho);

end proc:


#################################################################################
#################################################################################
&lt;/code&gt;


Problems become apparent, e.g. when I combine several of such procedures, often involving eigenvalues of matrices, in one higher-level command. When such a procedure is used e.g. in connection with the Optimization package or the Global Optimization Toolbox, then evalhf is typically not possible (as most commands involving matrices). In such scenarios, some procedures are are called very often. Then the bottlenecks really kick in...</itunes:summary>
      <description>Thank you for your comments. I know that my question was/is rather general and therefore hard to answer. In principle I am aware of the following rules of thumb:

- use appropriate datatypes in the matrices, e.g. float[8] or complex[8]

- use programming layer commands of the LinearAlgebra package, i.e. LA_Main:-...

- use cache/remember tables when suitable

- use inlining (typically not possible)


Below I'll show some sample procedures to give some idea...

&lt;code&gt;
#################################################################################
#################################################################################

Parametrize_SU_Euler := proc(N::posint, params)

#
# returns a NxN SU(N) unitary matrix as described by the given list of
# parameters (or the keyword "random").
# The procedure follows [Tilma, Sudarshan, J. Phys. A 35 (2002) 10467]
# (see Eq. (19))
#

options hfloat;

local lambda, used_lambdas, temp, i, j, m, l, k, A, alpha, X, param_list, param_ranges, U;

if params::list then
   #
   # if a list of parameters [alpha[1], ..., alpha[N^2-1]]
   # is given they are checked to be within the valid ranges.
   #
   param_ranges := evalf(Feynman_parameters("SU", "Euler angles", N));
   if not nops(param_ranges) = nops(params) then
      error(`\: incorrect number of parameters! Expected a list of`, nops(param_ranges),` parameters.`);
   end if;

   alpha := params;

elif params = "random" then
   #
   # if no explicit list of numerical parameters is given but
   # the key "random" then the necessary parameters are generated
   # randomly
   #
   X := Statistics[RandomVariable](Uniform(0,1000));
   alpha := convert(Statistics[Sample](X, N^2-1), list);


else
   error(`\: Either a list of numerical parameters is expected as second argument or the keyword "random".`)
end if;

#
# first, a list of the generalized Gell-Mann matrices is needed as
# a hermitian basis of the space of NxN matrices.
#
lambda := Hermitian_basis(N, "float");

#
# define auxiliary function j(m) as in the reference
#
#j := m -&gt; piecewise(m=N, 0, sum(2*(m+l), l=0..N-m-1));
#
# actually the same but easier is the following
j := m -&gt; (N+m-1)*(N-m);


#
# create a list of the lambda matrices that are actually used
# (in the order of later use, including multiple occurrences)
#
used_lambdas := Vector(N^2-1, datatype=integer):
i := 1;
for m from N to 2 by -1 do
   for k from 2 to m do
      used_lambdas[i]   := 3;
      used_lambdas[i+1] := (k-1)^2 + 1;
      i := i+2;
   end do:
end do:
for m from 2 to N do
   used_lambdas[i] := m^2-1;
   i := i+1;
end do:

temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
              lambda[used_lambdas[1]], I*alpha[1],
              inplace=false, outputoptions=[datatype=complex[8]]);
U := LinearAlgebra:-LA_Main:-MatrixFunction(
              temp, exp(dummy), dummy, outputoptions=[datatype=complex[8]]);

for k from 2 to op(1,used_lambdas) do
   temp := LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
                 lambda[used_lambdas[k]], I*alpha[k],
                 inplace=false, outputoptions=[]);
   U := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(
                 U, LinearAlgebra:-LA_Main:-MatrixFunction(
                             temp, exp(dummy), dummy, outputoptions=[]),
                             inplace=false, outputoptions=[]);
end do:

end proc:

#################################################################################
#################################################################################
&lt;/code&gt;

&lt;code&gt;
#################################################################################
#################################################################################


Partial_trace := proc(rho::'Matrix'(square), d::list(posint), trace_list::list(posint))

#
# returns the partial trace of a square (density) matrix with respect
# to one or more subsystems. 'd' is the list of the dimensions of each
# subspace, 'trace_list' is the list of the subsystem indices which are
# traced out.
# A matrix is returned
#

local rho_dim, i, j, k, d_in, d_out, in_list, out_list, U, reordered_rho,
      new_rho;

#
# check if dimension of the given matrix is compatible with the specified subspace
# dimensions
#
rho_dim := mul(d[x], x=1..nops(d));
if not LinearAlgebra[RowDimension](rho) = rho_dim then
   error(`\: The given matrix does not match the specified subspace dimensions.`);
end if;

if max(op(trace_list)) &gt; nops(d) then
   error(`\: One (or more) of the specified target subspaces is invalid.`);
end if;


#
# reorder the subsystems so that the surviving ones (in ascending order)
# come first and the one to be traced out are last
#
out_list, in_list := selectremove(has, [seq(1..nops(d))], trace_list);
d_in  := mul(d[i], i=in_list);  # dimension of the remaining subspace
d_out := mul(d[i], i=out_list); # dimension of the subspace to be traced out

if not [op(in_list), op(out_list)] = [seq(i,i=1..nops(d))] then
   if rho[1,1]::complex[8] then
      U := Matrix(rho_dim, rho_dim, Permutation_matrix([op(in_list), op(out_list)], d), datatype=complex[8]);
      new_rho := Matrix(d_in, d_in, datatype=complex[8]);
   else
      U := Feynman_permutation_matrix([op(in_list), op(out_list)], d);
      new_rho := Matrix(d_in, d_in);
   end if;
   reordered_rho := LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(LinearAlgebra:-LA_Main:-MatrixMatrixMultiply(U, rho, inplace=false, outputoptions=[]), LinearAlgebra:-LA_Main:-HermitianTranspose(U, inplace=false, outputoptions=[]), inplace=false, outputoptions=[]);
else
   reordered_rho := rho;
   if rho::'Matrix'(complex[8]) then
      new_rho := Matrix(d_in, d_in, datatype=complex[8]);
   else
      new_rho := Matrix(d_in, d_in);
   end if;
end if;

for i from 1 to d_in do
   for j from 1 to d_in do
      new_rho[i,j] := add(reordered_rho[(i-1)*d_out+k, (j-1)*d_out+k], k=1..d_out);
   end do;
end do;


return(new_rho);

end proc:


#################################################################################
#################################################################################
&lt;/code&gt;


Problems become apparent, e.g. when I combine several of such procedures, often involving eigenvalues of matrices, in one higher-level command. When such a procedure is used e.g. in connection with the Optimization package or the Global Optimization Toolbox, then evalhf is typically not possible (as most commands involving matrices). In such scenarios, some procedures are are called very often. Then the bottlenecks really kick in...</description>
      <guid>74121</guid>
      <pubDate>Thu, 10 Jan 2008 03:58:32 Z</pubDate>
      <itunes:author>quantum</itunes:author>
      <author>quantum</author>
    </item>
    <item>
      <title>a little more</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer74108</link>
      <itunes:summary>A little more.

LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
    temp, I*alpha[1], inplace=true,
    outputoptions=[datatype=complex[8]]):

in the final loop could be replaced by

msm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixScalarMultiply:

msm(temp,I*alpha[1]): # safer is msm(temp,evalf(I*alpha[1]))

where the assignment to `msm` is done right after that to `mmm`, before the loop begins.

Also, there is a call to LinearAlgebra:-LA_Main:-MatrixScalarMultiply inside `matexp`. This could be set up to directly call an external function, just like is done for addition, norm,  etc. The following lines could be added in `matexp` in the relevant places, and ExtMSM declared as a new local.

ExtMSM := ExternalCalling:-DefineExternal('hw_f06jdf', extlib);

ExtMSM := ExternalCalling:-DefineExternal('sw_f06jdf', extlib);

Then the call in `matexp`

    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true', 'outputoptions' = []);

could be replaced by

ExtMSM(n*n, M, a, 1);

Together those give another 10%-15% or time savings over the orginal at size 16x16.

Keep in mind that all this is deliberately bypassing a lot of sanity checks. The Matrices had better be complex[8] datatypes with full rectangular storage, or else it will crash and burn.

It's not just garbage collection that slows Maple down for these computations. It's also the cost and overhead of Maple function calls, some of which have been avoided in the code I've posted on this example.

Having Maple be a general system capable of exact or arbitrary precision floating-point computations brings with it the overhead of smart runtime selection of modes. Systems like Matlab don't necessarily have that sort of overhead, if they are primarily purely hardware double precision engines. There are alternative schemes for the general purpose system like Maple. For example, on-the-fly generation of code tailored for just a single mode of computations (exact, hardware, arbitrary precision, etc) is one possibility. Another possibility is making very low-level routines like the BLAS get direct, individual interfaces at the user level.

acer</itunes:summary>
      <description>A little more.

LinearAlgebra:-LA_Main:-MatrixScalarMultiply(
    temp, I*alpha[1], inplace=true,
    outputoptions=[datatype=complex[8]]):

in the final loop could be replaced by

msm:=LinearAlgebra:-LA_Main:-LA_External:-MatrixScalarMultiply:

msm(temp,I*alpha[1]): # safer is msm(temp,evalf(I*alpha[1]))

where the assignment to `msm` is done right after that to `mmm`, before the loop begins.

Also, there is a call to LinearAlgebra:-LA_Main:-MatrixScalarMultiply inside `matexp`. This could be set up to directly call an external function, just like is done for addition, norm,  etc. The following lines could be added in `matexp` in the relevant places, and ExtMSM declared as a new local.

ExtMSM := ExternalCalling:-DefineExternal('hw_f06jdf', extlib);

ExtMSM := ExternalCalling:-DefineExternal('sw_f06jdf', extlib);

Then the call in `matexp`

    LinearAlgebra:-LA_Main:-MatrixScalarMultiply(a, M, 'inplace' = 'true', 'outputoptions' = []);

could be replaced by

ExtMSM(n*n, M, a, 1);

Together those give another 10%-15% or time savings over the orginal at size 16x16.

Keep in mind that all this is deliberately bypassing a lot of sanity checks. The Matrices had better be complex[8] datatypes with full rectangular storage, or else it will crash and burn.

It's not just garbage collection that slows Maple down for these computations. It's also the cost and overhead of Maple function calls, some of which have been avoided in the code I've posted on this example.

Having Maple be a general system capable of exact or arbitrary precision floating-point computations brings with it the overhead of smart runtime selection of modes. Systems like Matlab don't necessarily have that sort of overhead, if they are primarily purely hardware double precision engines. There are alternative schemes for the general purpose system like Maple. For example, on-the-fly generation of code tailored for just a single mode of computations (exact, hardware, arbitrary precision, etc) is one possibility. Another possibility is making very low-level routines like the BLAS get direct, individual interfaces at the user level.

acer</description>
      <guid>74108</guid>
      <pubDate>Sat, 12 Jan 2008 08:50:30 Z</pubDate>
      <itunes:author>acer</itunes:author>
      <author>acer</author>
    </item>
    <item>
      <title>Perfect target for partial evaluation</title>
      <link>http://www.mapleprimes.com/questions/40265-Advice-On-Profiling?ref=Feed:MaplePrimes:Advice on profiling?:Comments#answer73975</link>
      <itunes:summary>As I have &lt;a href="http://www.mapleprimes.com/forum/procedure-becomes-slower-and-slower-with-every-call#comment-11186"&gt;mentionned before&lt;/a&gt;, this kind of inlining and optimization of Maple code can actually be automated via partial evaluation.  Thanks for providing an additional test case.</itunes:summary>
      <description>As I have &lt;a href="http://www.mapleprimes.com/forum/procedure-becomes-slower-and-slower-with-every-call#comment-11186"&gt;mentionned before&lt;/a&gt;, this kind of inlining and optimization of Maple code can actually be automated via partial evaluation.  Thanks for providing an additional test case.</description>
      <guid>73975</guid>
      <pubDate>Tue, 22 Jan 2008 06:53:07 Z</pubDate>
      <itunes:author>JacquesC</itunes:author>
      <author>JacquesC</author>
    </item>
  </channel>
</rss>