Items tagged with performance performance Tagged Items Feed

I made a small change to the Task Filtering code I uploaded a few weeks ago.  The new code has better memory performance and, most importantly has more stable memory usage which means it can actually run very large examples.  Here is the new version of the code:

FilterCombPara := module( )
    local ModuleApply,
            doSplit,
            splitCombs,
            makeNewPrefix,
            lessThanX,
            filterCombs;

    lessThanX := proc( x, i ) x<=i; end;

    doSplit := proc( i::integer, prefix::set(posint), rest::set(posint),
                                                k::posint, filter::appliable, $ )
        splitCombs( prefix union {i}, remove( lessThanX, rest, i ), k-1, filter );
    end;

    splitCombs := proc( prefix::set(posint), rest::set(posint), k::posint,
                                                                filter::appliable, $ )
        if ( numelems( rest ) < k ) then
            NULL;
        elif ( k = 1 ) then
            filterCombs( prefix, rest, filter );
        else
            op( Threads:-Map( doSplit, rest, prefix, rest, k, filter ) );
        end;
    end;

    makeNewPrefix := proc( i, prefix ) { i, op( prefix ) } end;
    filterCombs := proc( prefix::set(posint), rest::set(posint), filter::appliable, $ )
        local i, f;

        op(select( filter, map( makeNewPrefix, rest, prefix ) )):
    end;

    ModuleApply := proc( n::posint, k::posint, filter::appliable, $ )
        [ splitCombs( {}, {seq( i,i=1..n )}, k, filter ) ];
    end;

end:

This code has the small mapping functions as module members instead of declared inline.  This means that less memory is churned as this code is excuted.  For a long runs, this helps keeps the memory stable.

As an example, I ran the following example:

> CodeTools:-Usage( FilterCombPara( 113,7, x->false ) );
memory used=17.39TiB, alloc change=460.02MiB, cpu time=88.52h, real time=20.15h
                                                  []

It used 88 CPU hours to run, but only 20 hours of real time (go parallelism!)  It used 17 Terabytes of memory, but only allocated 500 M.  This example is pretty trival, as the filter returned false for all combinations, so it did not collect any matches during the run.  However as long as the number of matches is small, that shouldn't be an issue.  If the number of matches is too large to fit in memory, then this code may need to be modified to write the matches out to disk instead of trying to hold them all in memory at once.

Darin

-- Kernel Developer Maplesoft

FilterComb.mw

As described on the help page ?updates,Maple17,Performance, Maple 17 uses a new data structure for polynomials with integer coefficients. Our goal was to improve the performance and parallel speedup of polynomial algorithms that underpin much of the system and create a platform for large scale polynomial computations. Shown below is the new representation for 9xy3z

Dear Maple users

I have had Maple creating graphics for me that I cannot do in other programs I have access to: 3D pictures of circle waves interferring or even the result of an interference pattern from a diffraction grating in Physics. But when it comes to simple animations, I am not all that impressed.

Basicly I have three complaints: 

a)  Maple seems to use a lot of space to save frame information, resulting in large filesizes.

Way back in Maple 6, the rtable was introduced. You might be more familiar with its three types: Array, Matrix, and Vector. The name rtable is named after "rectangular table", since its entries can be stored contiguously in memory which is important in the case of "hardware" datatypes. This is a key aspect of the external-calling mechanism which allows Maple to use functions from the NAG and CLAPACK external libraries. In essence, the contiguous data portion of a hardware datatype rtable can be passed to a compiled C or Fortran function without any need for copying or preliminary conversion. In such cases, the data structure in Maple is storing its numeric data portion in a format which is also directly accessible within external functions.

You might have noticed that Matrices and Arrays with hardware datatypes (eg. float[8], integer[4], etc) also have an order. The two orders, Fortran_order and C_order, correspond to column-major and row-major storage respectively. The Wikipedia page row-major  explains it nicely.

There is even a help-page which illustrates that the method of accessing entries can affect performance. Since Fortran_order means that the individual entries in any column are contiguous in memory then code which accesses those entries in the same order in which they are stored in memory can perform better. This relates to the fact that computers cache data: blocks of nearby data can be moved from slower main memory (RAM) to very fast cache memory, often as a speculative process which often has very real benefits.

What I'd like to show here is that the relatively small performance improvement (due to matching the entry access to the storage order) when using evalhf can be a more significant improvement when using Maple's Compile command. For procedures which walk all entries of a hardware datatype Matrix or multidimensional Array, to apply a simple operation upon each value, the improvement can involve a significant part of the total computation time.

What makes this more interesting is that in Maple the default order of a float[8] Matrix is Fortran_order, while the default order of a float[8] Array used with the ImageTools package is C_order. It can sometimes pay off, to write your for-do loops appropriately.

If you are walking through all entries of a Fortran_order float[8] Matrix, then it can be beneficial to access entries primarily by walking down each column. By this I mean accessing entries M[i,j] by changing i in ther innermost loop and j in the outermost loop. This means walking the data entries, one at a time as they are stored. Here is a worksheet which illustrates a performance difference of about 30-50% in a Compiled procedure (the precise benefit can vary with platform, size, and what else your machine might be doing that interferes with caching).

Matrixorder.mw

If you are walking through all entries of an m-by-n-by-3 C_order float[8] Array (which is a common structure for a color "image" used by the ImageTools package) then it can be beneficial to access entries A[i,j,k] by changing k in the innermost loop and i in the outermost loop. This means walking the data entries, one at a time as they are stored. Here is a worksheet which illustrates a performance difference of about 30-50% in a Compiled procedure (the precise benefit can vary with platform, size, and what else your machine might be doing that interferes with caching).

Arrayorder.mw

Based on the title, once again I'll mention that the programm that I have been made by using

Maple14 could not run faster.

The function of Maple should be running effectively compared to others.

Hereby, I'll attach the programm for your further action.

That's all, thank you.

I have to solve a linear system

in Maple. A is a tridiagonal matrix.

My problem is that my matrix A is big and therefore, I have to wait too long. Is there some function in Maple that can help this problem?

I'm using Maple to simulate particle collision and interaction (absorption and scattering) and keep tracking its energy until each particle either get absorbed or leaked (Energy less than predefined minimum energy). I also use Maple statistical pacakge. I also need to call MATLAB link to generate histogram in a logarithmically spaced bin, a feature that I could not find anywhere in Maple and I think it is one of the reasons for calculation performance issue.

The problem...

Yesterday I wrote a post that began,

"I realized recently that, while 64bit Maple 15 on Windows (XP64, 7) is now using accelerated BLAS from Intel's MKL, the Operating System environment variable OMP_NUM_THREADS is not being set automatically."

But that first sentence is about where it stopped being correct, as far as how I was interpreting the performance on 64bit Maple on Windows. So I've rewritten the whole post, and this is the revision.

I concluded that, by setting the Windows operating system environment variable OMP_NUM_THREADS to 4, performance would double on a quad core i7. I even showed timings to help establish that. And since I know that memory management and dynamic linking can cause extra overhead, I re-ran all my examples in freshly launched GUI sessions, with the user-interface completely closed between examples. But I got caught out in a mistake, nonetheless. The problem was that there is extra real-time cost to having my machine's Windows operating system dynamically open the MKL dll the very first time after bootup.

So my examples done first after bootup were at a disadvantage. I knew that I could not look just at measured cpu time, since for such threaded applications that reports as some kind of sum of cycles for all threads. But I failed to notice the real-time measurements were being distorted by the cost of loading the dlls the first time. And that penalty is not necessarily paid for each freshly launched, completely new Maple session. So my measurements were not fair.

Here is some illustration of the extra real-time cost, which I was not taking into account. I'll do Matrix-Matrix multiplication for a 1x1 example, to try and show just how much this extra cost is unrelated to the actual computation. In these examples below, I've done a full reboot on Windows 7 where so annotated. The extra time cost for the very first load of the dynamic MKL libraries can be from 1 to over 3 seconds. That's about the same as the cpu time this i7 takes to do the full 3000x3000 Matrix multiplication! Hence the confusion.

Roman brought up hyperthreading in his comment on the original post. So part of redoing all these examples, with full restarts between them, is testing each case both with and without hyperthreading enabled (in the BIOS).

Quad core Intel i7. (four physical cores)

Hyperthreading disabled in BIOS
-------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);   # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.18KiB, alloc change=127.98KiB, cpu time=219.00ms, real time=3.10s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.91KiB, alloc change=127.98KiB, cpu time=140.00ms, real time=2.81s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Hyperthreading enabled in BIOS
------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);    # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.00KiB, alloc change=127.98KiB, cpu time=202.00ms, real time=2.84s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=215.56KiB, alloc change=127.98KiB, cpu time=187.00ms, real time=1.12s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Having established that the first use after reboot was incurring a real time penalty of a few seconds, I redid the timings in order to gauge the benefit of having OMP_NUM_THREADS set appropriately. These too were done with and without hyperthreading enabled. The timings below appear to indicate that slightly bettern performance can be had for this example in the case that hyperthreading is disabled. The timings also appear to indicate that having OMP_NUM_THREADS unset results in performance competitive with having it set to the number of physical cores.

Hyperthreading disabled in BIOS
-------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=142.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.50s, real time=1.92s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=141.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.38s, real time=7.38s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.11KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.57s, real time=1.94s



Hyperthreading enabled in BIOS
------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.57KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.46s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.35s, real time=7.35s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.56s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "8"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.69s, real time=2.23s

With all those new timing measurements it appears that having to set the global environment variable OMP_NUM_THREADS to the number of physical cores may not be necessary. The performance is comparable, when that variable is left unset. So, while this post is now a non-story, it's interesting to know.

And the lesson about comparitive timings is also useful. Sometimes, even complete GUI/kernel relaunch is not enough to get a level and fair field for comparison.

I was recently looking at rotating a 3D plot, using plottools:-rotate, and noticed something inefficient.

In the past few releases of Maple, efficient float[8] datatype rtables (Arrays or hfarrays) can be used inside the plot data structure. This can save time and memory, both in terms of the users' creation and manipulation of them as well as in terms of the GUI's ability to use them for graphic rendering.

What I noticed is that, if one starts with a 3D plot data structure containing a float[8] Array in the MESH portion, then following application of plottools:-rotate a much less efficient list-of-lists is produced in the resulting structure.

Likewise, an effiecient float[8] Array or hfarray in the GRID portion of a 3D plot structure gets transformed by plottools:-rotate into an inefficient list-of-lists object in the MESH portion of the result. For example,

restart:

p:=plot3d(sin(x),x=-6..6,y=-6..6,numpoints=5000,style=patchnogrid,
          axes=box,labels=[x,y,z],view=[-6..6,-6..6,-6..6]):

seq(whattype(op(3,zz)), zz in indets(p,specfunc(anything,GRID)));
                            hfarray

pnew:=plottools:-rotate(p,Pi/3,0,0):

seq(whattype(op(1,zz)), zz in indets(pnew,specfunc(anything,MESH)));
                              list

The efficiency concern is not just a matter of the occupying space in memory. It also relates to the optimal attainable methods for subsequent manipulation of the data.

It may be nice and convenient for plottools to get as much mileage as it can out of plottools:-transform, internally. But it's suboptimal. And plotting is a topic where dedicated, optimized helper routines for some particular data format is justified and of merit. If we want plot manipulation to be fast, then both Library-side as well as GUI-side operations need more case-by-case-optimizated.

Here's an illustrative worksheet, using and comparing memory performance with a (new, alternative) procedure that does inplace rotation of a 3D MESH. plot3drotate.mw

Whenever I try to plot a simple/small function in 2d, Maple literally takes forever to calculate it whenever the frame rate is too high.

For instance, the following command (from a tutorial, so it should work):

animate( plot, [exp(-x/5)*sin(x),x=0..t], t=0..20, frames=100 );

Will not compute, unless i change the number of frames to 50 or below.

I tried this on a friends computer (he has a mac, but has way weaker performance) and it worked just fine.

Consider a function f depending on a parameter k and a variable y

f:=(k,y)->k*s(y);

where s:=y->add(sin(1/y^i),i=1..1000000) is just some nasty function.

Now, why does it take a long time to evaluate f(0,y)? And how could I speed this up?

 

More generally, I would like to plug (varying) parameters into a function, simplify and then do analysis on the resulting function. How could I do that?

I am calculating the dotproduct in maple 15 of two very large vectors. Are there any tricks/ways to speed up such a calculation? Maple is taking a very long time to execute the calculation and I am unsure whether it will actually return a result.

I'm using maple code of the form DotProduct(vector1,vector2);

 

Any advice ?

 

Most programs will not produce and assign to a large number of global "top-level" names. But it is interesting that the cost associated with such global name assignment is related to the number of entries in libname.

A possible cause of this cost is the need to check whether the name is protected, before assigning.

The following timings were made on 32bit Maple 15 running on Windows 7, on an Intel i7. The set of four timings is...

The complexplot3d command can color by using (complex) argument for the hue, and compute height z by magnitude. So, when rotated to view the x-y plane straight on, it can provide a nice coloring of the argument of whatever complex-valued expression is being plotted.

Another way to obtain a similar plot is to use densityplot (with appropriate values for its scaletorange option) and apply argument to the expression or function being plotted. For some kinds of complex-valued...

I'm working on a 3D program that plots stars, star clusters and other objects of interest against a model of the Milky Way. The guts of this program create a few hundred points (< 2000) that represent stars, and move them a little bit in random directions to create a cloud effect. Then the animate procedure rotates them about the center of the galaxy. I would like to add more stars and do more things with them, but I am encountering a very sluggish editor, suggesting that Maple is imposing a memory limit. The Activity Monitor says that Maple is using 140 MB, so gigabytes of RAM remain available. Furthermore, if I save the plot as an animated gif, I can run half a dozen copies simultaneously in separate browser windows without encountering noticeable slowdown. But I can't elaborate the program without better performance from the editor.

The program is written in the Document interface. I tried saving it to the classic interface, but that didn't seem to help, and I found it difficult to edit.

Ideas sought.

1 2 3 4 Page 1 of 4