Items tagged with blog blog Tagged Items Feed

Yesterday I wrote a post that began,

"I realized recently that, while 64bit Maple 15 on Windows (XP64, 7) is now using accelerated BLAS from Intel's MKL, the Operating System environment variable OMP_NUM_THREADS is not being set automatically."

But that first sentence is about where it stopped being correct, as far as how I was interpreting the performance on 64bit Maple on Windows. So I've rewritten the whole post, and this is the revision.

I concluded that, by setting the Windows operating system environment variable OMP_NUM_THREADS to 4, performance would double on a quad core i7. I even showed timings to help establish that. And since I know that memory management and dynamic linking can cause extra overhead, I re-ran all my examples in freshly launched GUI sessions, with the user-interface completely closed between examples. But I got caught out in a mistake, nonetheless. The problem was that there is extra real-time cost to having my machine's Windows operating system dynamically open the MKL dll the very first time after bootup.

So my examples done first after bootup were at a disadvantage. I knew that I could not look just at measured cpu time, since for such threaded applications that reports as some kind of sum of cycles for all threads. But I failed to notice the real-time measurements were being distorted by the cost of loading the dlls the first time. And that penalty is not necessarily paid for each freshly launched, completely new Maple session. So my measurements were not fair.

Here is some illustration of the extra real-time cost, which I was not taking into account. I'll do Matrix-Matrix multiplication for a 1x1 example, to try and show just how much this extra cost is unrelated to the actual computation. In these examples below, I've done a full reboot on Windows 7 where so annotated. The extra time cost for the very first load of the dynamic MKL libraries can be from 1 to over 3 seconds. That's about the same as the cpu time this i7 takes to do the full 3000x3000 Matrix multiplication! Hence the confusion.

Roman brought up hyperthreading in his comment on the original post. So part of redoing all these examples, with full restarts between them, is testing each case both with and without hyperthreading enabled (in the BIOS).

Quad core Intel i7. (four physical cores)

Hyperthreading disabled in BIOS
-------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);   # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.18KiB, alloc change=127.98KiB, cpu time=219.00ms, real time=3.10s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.91KiB, alloc change=127.98KiB, cpu time=140.00ms, real time=2.81s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Hyperthreading enabled in BIOS
------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);    # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.00KiB, alloc change=127.98KiB, cpu time=202.00ms, real time=2.84s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);
                              "4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=215.56KiB, alloc change=127.98KiB, cpu time=187.00ms, real time=1.12s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns


Having established that the first use after reboot was incurring a real time penalty of a few seconds, I redid the timings in order to gauge the benefit of having OMP_NUM_THREADS set appropriately. These too were done with and without hyperthreading enabled. The timings below appear to indicate that slightly bettern performance can be had for this example in the case that hyperthreading is disabled. The timings also appear to indicate that having OMP_NUM_THREADS unset results in performance competitive with having it set to the number of physical cores.

Hyperthreading disabled in BIOS
-------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=142.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.50s, real time=1.92s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=141.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.38s, real time=7.38s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.11KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.57s, real time=1.94s



Hyperthreading enabled in BIOS
------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.57KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.46s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.35s, real time=7.35s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS
                              "4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.56s, real time=2.15s


> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);
                              "8"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.69s, real time=2.23s

With all those new timing measurements it appears that having to set the global environment variable OMP_NUM_THREADS to the number of physical cores may not be necessary. The performance is comparable, when that variable is left unset. So, while this post is now a non-story, it's interesting to know.

And the lesson about comparitive timings is also useful. Sometimes, even complete GUI/kernel relaunch is not enough to get a level and fair field for comparison.

Maple 15, Windows7x64, Standard v. Classic

I have noticed that, on my system, the smoothness of some INLINE plots is better in Classic than in Standard. Is this some regression or some installation-specific quirck I wonder?

In Tools->Options, I have plot anti-aliasing enabled (whatever that is).

This looks alright in Classic

plots:-implicitplot(
  [ x^2 + y^2 = 1, x^2 + y^2 = 2 ]
  , x = -2 .. 2
  , y = -2 .. 2

I was recently looking at rotating a 3D plot, using plottools:-rotate, and noticed something inefficient.

In the past few releases of Maple, efficient float[8] datatype rtables (Arrays or hfarrays) can be used inside the plot data structure. This can save time and memory, both in terms of the users' creation and manipulation of them as well as in terms of the GUI's ability to use them for graphic rendering.

What I noticed is that, if one starts with a 3D plot data structure containing a float[8] Array in the MESH portion, then following application of plottools:-rotate a much less efficient list-of-lists is produced in the resulting structure.

Likewise, an effiecient float[8] Array or hfarray in the GRID portion of a 3D plot structure gets transformed by plottools:-rotate into an inefficient list-of-lists object in the MESH portion of the result. For example,

restart:

p:=plot3d(sin(x),x=-6..6,y=-6..6,numpoints=5000,style=patchnogrid,
          axes=box,labels=[x,y,z],view=[-6..6,-6..6,-6..6]):

seq(whattype(op(3,zz)), zz in indets(p,specfunc(anything,GRID)));
                            hfarray

pnew:=plottools:-rotate(p,Pi/3,0,0):

seq(whattype(op(1,zz)), zz in indets(pnew,specfunc(anything,MESH)));
                              list

The efficiency concern is not just a matter of the occupying space in memory. It also relates to the optimal attainable methods for subsequent manipulation of the data.

It may be nice and convenient for plottools to get as much mileage as it can out of plottools:-transform, internally. But it's suboptimal. And plotting is a topic where dedicated, optimized helper routines for some particular data format is justified and of merit. If we want plot manipulation to be fast, then both Library-side as well as GUI-side operations need more case-by-case-optimizated.

Here's an illustrative worksheet, using and comparing memory performance with a (new, alternative) procedure that does inplace rotation of a 3D MESH. plot3drotate.mw

pre-sized plots...

September 24 2011 acer 9996

The goal here is to produce plots for inclusion inside Worksheets or Documents of the Standard GUI at specific sizes.

[update: Maple 18 has this as a new feature for 2D plots. See the `size` option described on ?plot,options]

When manually resizing an existing plot, using the mouse pointer, there is no visual cue as to what pixel size has been attained. Hence any worksheet author who wishes to produce a plot of size 600x600 is presented with two barriers. The first is that resizing must be done manually, and the second is that there is no convenient mechanism showing the actual size attained.

The `Resize` package attempts to address these barriers by allowing construction of a plot, inside a worksheet, with programmatically specified width and height in pixels.

The default behaviour of the package is to produce the plot inside a new Worksheet, from whence it may be selected and copied. An optional behaviour is to show the constructed plot inside a Task Template (a form of help-page), where it may be previewed for correctness and inserted into the current Worksheet or Document at the press of a single button.

It appears to function for both 2D and 3D single plots.

It won't work for so-called Array plots, which are collections of multiple plots displayed side-by-side inside a worksheet table.

This first version is a bit rough. The plot is currently being inserted as input, which is why it isn't centered on the page. I suspect that it would be best to insert the first argument (eg. a `plot` call) as input to an execution group, and then have the plot be the output. That would look, and hopefully act, just as usual. And with the plot call inserted as input, the original `Resize` call could be neatly deleted if desired.

To install this thing, use the File->Open from the Standard GUI's menubar. Choose this .mla file as the thing to open. (You may have to slide a scrollbar, and select a view of "All Files", in order to see it in the pop-up File Manager.) Double-clicking on the file, to launch it, should ideally also open it but it looks like that functionality broke for Maple 15.

Resize_installer.mla

Alternatively, you could run the command,

march( 'open', "...full...path...to...Resize_installer.mla");

The attached .mla archive is a (graphically) self-unpacking installer, when opened in this way.

The bundled materials include a pre_built .mla containing the package itself, the source code and a worksheet that rebuilds it from source if desired, a short example worksheet, and a worksheet that rebuilds the whole installer (and re-bundles all those files into it). I used the `InstallerBuilder` to make the self-unpacking .mla installer, as I think it's a handy tool that is under-appreciated (and, alas, under documented!).

It's supposed to work without the usual hassle of having to set `libname`. This is an automatic consequence of the place in which it gets installed.

It seems to work in Maple 12, 14, and 15, on Windows 7. Let me know if you have problems with it.

acer

Paper Models of 3D Plots...

September 09 2011 Paul 390 Maple

 

                

3D Paper Physical Model

Most programs will not produce and assign to a large number of global "top-level" names. But it is interesting that the cost associated with such global name assignment is related to the number of entries in libname.

A possible cause of this cost is the need to check whether the name is protected, before assigning.

The following timings were made on 32bit Maple 15 running on Windows 7, on an Intel i7. The set of four timings is...

The complexplot3d command can color by using (complex) argument for the hue, and compute height z by magnitude. So, when rotated to view the x-y plane straight on, it can provide a nice coloring of the argument of whatever complex-valued expression is being plotted.

Another way to obtain a similar plot is to use densityplot (with appropriate values for its scaletorange option) and apply argument to the expression or function being plotted. For some kinds of complex-valued...

A short remark. This would be a blog entry if those still existed.

I am comparing plot,options.

The style=point option has versatile options for symbols, their sizes, the number of points:

  , 'style' = point
  , 'numpoints' = 50
  , 'symbol' = solidcircle
  , 'symbolsize' = 8

 

The ability to color a 3D plot using a color function is geared more for functions of x and y only.

But quite often, the surface or pointwise 3D position of the plot is itself being specified as a height z which is a function of x and y. For the plot3d comand, that's pretty much the way it works (whether using an expression or a procedure).

So, of course, the very same rule that...

A Plot Component (on the right below) can act as a kind of 2-dimensional "slider" for inputing values of two parameters at once.

The polar plot (on the left below) makes use of both values.

If you click in the right Plot, and drag around the mouse cursor for a while, then the left Plot will be continuously updated.

Make sure to execute the collapsed code-edit region, to initialize it. (Just click on it, to execute. Or expand, look, and right-click on it.)

 

 

 

Click any point above, & Drag

 

Download plotslider1.mw

 

This is in response to a Question about the speed and memory use of an animated DEplot. The problems are that the example's animation was slow to create, and prohibitively expensive to save in a Document.

An alternative approach is to combine multiple calls to plots:-odeplot with a call to plots:-fieldplot to supply the background flow arrows. This is a lot faster. It takes less memory to create and run, but the GUI may still consume too much resources saving it. The good news is that it's so much faster that it's not inconvenient to re-run the entire thing from scratch. And so it's quite feasible to remove all the expensive output from the Document prior to saving and thus avoid the whole resources problem.

The original questioner also wanted to visualize with resect to two varying parameters. So I've also done an implementation of that using Embedded Components and two Sliders.

Here is the old DEplot animation. It takes about 40 sec to create it animation on an Intel i7.

Here is the new combined odeplot+fieldplot animation. It takes a second or two to create its animation on an Intel i7.

Here is the DEplot in Embedded Components. It's very slow, and the image doesn't change smoothly with the sliders.

Here is the new combined odeplot+fieldplot in Embedded Components. Its image changes pretty smoothly with the sliders.

I encourage completely quitting the GUI (not just restart, or close Document and re-Open) between comparison runs of these implementations, of you want to get a really good feel for the effects of both running them as well as saving them (with and without all output).

For the Embedded Component documents, the functioning code resides inside the "initialize" button. (right-click, go to Component Properties, Action When Clicked, only if you want to inspect it.) To run those two  Documents, execute all the commands (use the triple-exclam from the menubar if you like), and then press the "initialize" button, and then move the sliders.

The difference in performance is related partly to the use of hardware datatype Arrays in the PLOT structures generated by odeplot and fieldplot. (But an `arrow` primitive would help even more!)

And (I think) there is improvement by virtue of using dsolve/numeric/parameters in the use of `odeplot`. That saves overhead from repeated cold invocations of dsolve/numeric. And DEplot doesn't support that, since it expects as argument the system of DEs and ICs. The newer `odeplot` command accepts the procedure returned by dsolve/numeric, and thus allows for efficient repeated setting of parameter values. The `fieldplot` command doesn't need the solution of the DE system at all: it just needs the DEs.

I would have considered wrapping the whole combined approach up into a single command, but it might have to accept separate options for the view ranges, in order to always look its best. A smart version might be able to deduce the computed ranges from the odeplot output, and then create the background fieldplot based on that.

Inside a procedure, local variables are evaluated only one level. Of what good is this, one might ask?

Well, for one thing it allows you to do checks or manipulations of an unevaluated function call without having that function call be evaluated over again. I mean, for function calls to routines which don't happen to remember earlier results.

This is a revision of an Answer

There have been some recent posts about interpolating data.

Attached below is a worksheet that shows some possibilities, with the functionality centering on the CurveFitting:-ArrayInterpolation command.

This is quick summary of parts of a broader document which covers both 2-d and 3-d methods (for regular grids), where I've left out the higher-efficiency methods and instead roughed in some examples involving integration and differentiation.

I've elected not to follow the 3-d Example from the ArrayInterpolation command's help-page, although using a pre-formed grid is a very fast approach to obtain just an interpolated 3-d plot. I also prefer to use the plots:-surfdata command rather than the plots:-matrixplot command, since the former let's one get the axes' tickmarks correct for the x- and y-data ranges.

The scenario is that you have a grid of data points in two dimensions (x- and y, or P- and T-, or what have you).

For each point (ie, for each 2-d pair of values) you have an associated value (or height in z, say). Hence you actually have data points in 3-d space.

How you obtained the associated (z) values depends on your own particular data collection method, or your own program. How you got the data is irrelevant here. What matters is that you have the finite number of data values, and no other easy way to generate data values at more points (let alone data for arbitrary new points). Below, we'll just create the data (once, at the start) for this example using an entirely made-up formula.

The presumption is that you might wish to plot a smooth surface that connects the 3-d data.

But you might also wish to write some program which requires interpolated (z) values at some new (x,y) 2-d points.  And you do not yet know what these 2-d point pairs are. So a pre-formed Array of points  at which to interpolate may not suffice.

Instead of using a pre-formed Array of output points, we'll contruct a procedure named `B` which can be supplied with a new (x,y) 2-d output point and (if that point lies within the original range) return an interpolated (z) value.

This procedure `B` can also be plotted, using the usual `plot3d` comamnd. It won't plot quite as fast as would a pre-computed and pre-interpolated finer grid of (x,y) values, but it should plot nicely. And the surface can be made quite smooth, by merely increasing the number of plotted points using plot3d's usual numpoints option. (Maple does not currently do "adaptive" 3-d plotting, so there's also no advantage in that respect.) But `B` does solve the secondary task, of being able to compute for any subsequent (x,y) point.

We can even integrate and differentiate `B` numerically. Of course we should keep in mind that this is somewhat error prone, since on top of usual issues with numerical differentiation there is also fact that we make the choice of interpolation method! The entire interpolated surface will differ considerably according to whether a spline, cubic (or other) interpolating scheme is chosen!

We'll use P and T as the x- and y- grid points, below, since "a name is just a name" and our choice of variables is arbitarary.

plot_interp.mw

One useful feature of the `evalf` command is that it remembers previous results. But it also stores the current value of Digits as well as its input argument, to be associated with the remembered result.

There are two reasons for this. The fancier reason is so that, when Digits is reduced from that of an earlier successful computation, `evalf` can simply round off the earlier result to the desired number of decimal digits. The more basic reason is that `evalf` might...

The goal of computing only a select number of eigenvectors of a real symmetric floating-point Matrix comes up now and then. For very large Matrices the memory requirements can be more restrictive than the timing.

The attached worksheet and code computes this, more quickly and with significantly less memory allocation than does the usual task of computing all eigenvectors. By using the supplied Matrix itself as a partial "workspace" the amount of additional workspace and memory allocation for the task is negligible.

For example, having created the very large Matrix in the first place,  essentially no significant further memory allocation is required to compute the largest eigenvalue and its associated eigenvector.

A little about this routine `SelectedEigenvectors` follows.

It only works in hardware double precision. It expects a float[8] datatype Matrix (because you are serious about using minimal memory!). It uses the CLAPACK function dsyevx, using the "wrapperless" version of Maple's external-calling mechanism. It seems to work fine in the systems I've tried so far: Maple 13 and 14 on both 32bit and 64bit Linux and Windows.

Whether it computes and returns the selected eigenvectors (alongside the selected eigenvalues, which are always returned) is controlled by the 'vectors=truefalse' optional argument. By default it uses the Matrix argument as partial workspace and so destroys the original data; but this can be overridden with the 'preserve=true' optional argument. The requested accuracy can be relaxed with the 'epsilon=float' optional argument, which might sometimes speed it up.

The input Matrix is presumed to be symmetric. By default it uses the data in the lower triangle, but this can be changed to be the upper triangle with the 'uplo' optional argument.

The choice of eigenvalues is controlled by the two integer arguments `il` and `iu`. If il=iu=n then only the nth largest eigenvalue is computed. If il=1 and iu=4 then the four smallest eigenvalues are computed.

It returns three things: a Vector of dimension n whose first m entries are the selected eigenvalues, a nxm Matrix whose columns are the m associated eigenvectors, and a Vector of dimension n whose entries indicate whether corresponding eigenvectors failed to converge.

I didn't enable float arguments such as `vl` and `vu` which in principle could allow one to supply a floating-point range in which to find eigenvalues.

I didn't make an optimization of having it do an initial "dummy" external call in which no calculation would be done, but which would instead query for and subsequently utilize the optimal-performance additonal float workspace size.

For reasons mysterious to me, on Windows the 64bit version runs almost exactly half as fast as the 32bit version.

Usually, the workspace for eigen-solving is implemented to be at least O(n^2) for an nxn Matrix. But this routine does only O(m+n) extra workspace allocation to compute the m eigenvectors. And that is linear. Which is the Big Deal.

A 5000x5000 datatype=float[8] (ie. hardware double precision) Matrix takes 200MB of memory. With the preserve=true option, this routine can compute just the largest eigenvector with only about 200MB of additional allocation. And if the original Matrix is no longer required then with the preserve=false option this routine can do that task with less than 1MB further allocation. In comparison, the regular LinearAlgebra:-Eigenvectors command would require about 600MB of additional memory allocation while computing all eigenvectors.

At size 5000x5000 this routine is only about four times faster than LinearAlgebra:-Eigenvectors. I suspect that is because it still has to compute in full the reduction to tridiagonal form.

Download dsyevx.mw

1 2 3 4 5 6 7 Page 2 of 7