module, and some comments

acer — Mon, 13 Feb 2012 08:20:42 Z

My first choice for this kind of thing would be to store the Matrix as a member of a module, and have all the working procedures also be exports of that module. You could have the routine which constructs the Matrix be one of the module exports, and if you want store the Matrix as a local of that module (thus still ensuring that it is accesible to all its members).

You have a lot of questions, implicit and explicit. The answers all interrelate, so I hope the following addresses most of it.

Note that Maple does not make a copy of an rtable (Matrix, Array, or Vector) when passing as an argument to a procedure. That is true for several if not most of Maple's mutable data structures (of which these are such). It would be a catasphrophe for performance if it did. One consequence of not doing full copy/evaluations is that in-place operations are possible on rtable arguments. In my experience using Maple heavily for floating-point linear algebra since Maple 6, the ability to do in-place operations on hardware datatype=float[8] rtables is one of the single most important to getting higher performance behaviour.

Maple does not yet have a data structure which represents an rtable whose data is stored (permanetely or semi-permanently) in the GPU's memory. It is still the case, when CUDA is enabled in Maple, that main-memory float[8] hardware double precision rtable data is passed to the GPU's memory with each call to a (the) CUDA function. This involves a very high data transfer cost. Each time you call it, Maple has to push the data of two float[8] Matrices up to the card, and then pull the data of the result float[8] Matrix back down from the card. This is probably so expensive that only an O(n^3) operation like matrix-matrix multiplicaiton would be worthwhile usin this behavioural model. It might even be hitting a law of diminishing returns. (See this very recent review, by someone who is usually pretty aware of how to wring the most out of big-M math software.) Also, using `.` or LinearAlgebra:-MatrixMatrixMultiply to invoke the CUDA version of the underlying matrix-matrix function will produce a new container for the result. Doing so repeatedly will mean that the old results will be memory-managed and garbage-collected by Maple, which is additional overhead.

You do not mention your Operating System. It matters. It sounds as if you are going to be using the BLAS function dgemm, repeatedly, via an entry point from Maple. A datatype=float[8] rtable is a Maple data structure whose data portion is a contiguous portion of main memory. It is a rectangular table (hence, rtable). A Matrix is one flavour of rtable. Maple is simply going to pass the address(es) of the data portions of the relevent float[8] rtables as the relevent arguments of a call to a dgemm function in some shared library. If using CUDA, then it will be the dgemm in the CUDA runtime libraries. On MS-Windows, it will be to the dgemm in the MKL runtime bundled with Maple. On Linux, it will be to the dgemm in a shared ATLAS library. And on OSX it will be to a generic dgemm. On Windows, the MKL will automatically detect how many cores and cpus are available, and use all available by default. That scales pretty well. On Linux, depending on cpu vendor, Maple will only use up to 2 or cores or so, but you can build an ATLAS tuned for a host with a greater number of cores and then drop it in as a replacement.

If Maple is thus already using all available cores, then there is no need to try and break up the matrix-matrix multiplication as if to do Strassen, say, yourself by using Maple's Threads/Task facilities. In any event, note that the external-calls to the shared libs (with the dgemm) are not being made with the THREAD_SAFE option. Which means that Maple only allows one concurrent instance of dl-opening those external libs to run. So only one Maple thread would ever invoke the external dgemm at a given time, anyway.

The external BLAS routines do not put a mutex lock on the data of the rtables, and those functions are themselves threaded.

Note that Maple's time() routine will report something like the sum of all cycles used by all cores. Hence it may spuriously appear that a float[8] Matrix operation doesn't get sped up with the number of cores, even when it does! An alternative is to measure with the variant time[real]() and so see just how long it actually takes in wall-clock time. Of course, do this effectively will entail ensuring that your machine load is otherwise as low as possible.

Ok, where am I? All these topics double back on each other.

Right, let's talk about dgemm. That is the Basic Linear Algebra Subprogram which does double precision general matrix-matrix multiplication. It works like this: given Matrices A, B, C, and scalars alpha and beta, then it performs this,

            alpha*A*B + beta*C -> C

There are several things to admire about that. The first is that the Matrix C, which will contain the result, is acted upon in-place. You could re-use the same C (possibly another module local) as the result container for multiple calls (and incur no memory management, as it wouldn't be collected garbage after each use). If beta=0 then the previous contents of C do not get re-added, but of course you'll still want to fill C with zeroes before calling, if you'd used it before. You do not have to scale A or B before calling dgemm, as you can just use alpha to effect that. Also, there are additional arguments to dgemm which denote whether the Matrices are to be considered as transposed! So you would never waste time transposing the Matrices beforehand. (In fact there is no transposition function in all of BLAS or LAPACK, because nobody should ever waste time doing that since accomodating it could be done with mere changes to the indexing used in the functions.)

Yes, there is a way to access dgemm from Maple so as to make use of such optional subtlety. It's not directly available, but I can post an example in anyone is interested.

In Maple, memory management of large rtables -- even when relatively lean like with float[8] datatype - is expensive. In contrast, a routine like ArrayTools:-Fill can zero out a large float[8] rtable in very short time. This all contributes to why inplace operations can often win the day, for numerical linear algebra in Maple.

My general advice would include things like these points:

- use only datatype=float[8] (or complex[8]) rtables (or Matrices, Vectors, Arrays)

- try very hard to have the code act inplace on a reusable container

- if transposing often either do so with Transpose(...,inplace) command, or use low level dgemm entry point to get rid of all such explicit transpositions

- try to make all scalar operations in the code run in evalhf mode, or be Compilable. (you cannot do external calls in either mode, so make procedure which do the matrix-matrix multiply and linear algebra bits be separate routines from any other hopefully-evalhf'able computations)

- never make a computational routine create a Matrix. Instead, if possible have it accept the Matrix (into which results would be put) as an additional argument. You can create a separate parent procedure, which both creates any needed rtables and then invokes a call to the computational procedures. If you always write your routines this way, then beautiful inplace style savings may fall from the sky

- don't focus on using CUDA, for now

- forget about using Maple's Thread/Task facilities for the matrix-matrix multiplication here, since your cores are likely being all used in each individual dgemm call already. If you want to use Maple's Threading for other parts of the whole job, then split that off from the matrix-matrix multiplication (since dlopens of the relevent external libs are blocking -- only one at a time).-

- if you want to Thread other parts of the whole job, do so using evalf'able procedures and calls. For embarassingly parallelizable jobs I've been able to get an optimal linear speedup with core number. Maple does not put a lock on all the Matrix entries at once -- the mutex seems to give finer grained access. Of course, the fortunate algorithm will involve no mutex lock at all, but we are not always so lucky in our work. That works well. But I have had consistent experiences showing that single (serial use of) Compiled procedures acting inplace and optimally on float[8] rtables perform about eight times as fast as do Task/Threaded parallelized procedures also acting inplace and optimally under evalhf. So the Compiler can often beat the Task model of threading in Maple, up to about 4-8 cores.

acer

I'll let Win 7 MKL do it, CUDA can wait, yes please example exposing inplace C <= alphaAB + C

jimmyinhmb — Mon, 13 Feb 2012 10:13:14 Z

Hi Acer,

That was a great reply, and fast -- thank you very much.

I should have reread the brain-dead part of that draft before I posted. Just as the programming guide says

In Maple, data is always passed by reference, but the immutability of most data types ensures that the procedure cannot modify the caller's copy of the data. The exceptions are Maple's mutable data structures: tables, Arrays, Matrices, Vectors, records, and objects. Modifying these within a procedure will modify the caller's copy. Fortunately, these larger data structures are the wones that you would most often want to pass by reference, since copying such data consumes time and space.

My bad, good of you to correct an RTFM issue so graciously.

I'm working in Windows 7 with an AMD Phenom II X6 11T running at 3.3 GHz, and the MKL seems to be doing a very good job of parsing out the work. The task breakdown I have in mind is coarser than matrix multiply -- the tasks would share the constant matrix but very little else. But if only one Maple thread can use the dgemm routines at a time, it sounds like this might only be useful in a grid computing situation, and of course each grid computing element would have its own version of the matrix. That's down the road.

Your explanation of how the GPU memory objects are handled makes it clear that I shouldn't be thinking about CUDA in Maple for this problem right now. Maybe when GPU-resident objects are readdressable -- actually, maybe when Maple / BLAS do it under the covers. For now I am happy to let Windows MKL keep the hardware busy -- especially if it not running Maple threads resolves the mutex problem I was worried about.

Hmm, I guess one could always unbind and redefine a variable that was originally created as read-only, so the readonly=true option in Matrix creation has no effect at the cache level, e.g., letting each processor in a multiprocessor system know that their locally-cached copy of a shared variable will never be out of date. Does it improve Maple's efficiency in other ways, or is primarily an assert capability that catches an erroneous write for Maple? Inquiring minds want to know, but I will use it anyway.

If the full C <== alpha*A*B + C functionality can be exposed, I think I can make inplace work pretty easily; but without the "+ C" part of the operation it's more contrived. Please count me as being interested in how you exploit that additional dgemm capability, and hoping to study your Win 7 example.

I'll implement as a module-scoped Matrix tomorrow, and save your remaining speedup advice to chew on a bit later. For now, a big THANKS for sharing your deep and helpful insight.

- Jimmy

example

acer — Mon, 13 Feb 2012 12:05:14 Z

@jimmyinhmb Here's an example, for multiplying the transpose of 50x50 Matrix A with 50x50 Matrix B one hundred thousand times.

On a fast i7 running Win 7 Pro (64bit Maple 15.01) it takes about 4 sec using the ideas laid out above, and it takes about 26 sec to do it repeatedly as C := A^%T . B

dgemm_module.mw

The benefit in speed gets less as the Matrix size goes up. But there is also the question of total memory allocation.

MaplePrimes - answers and comments on Question, How do I efficiently share and multiply with a large constant matrix between procs and tasks?

module, and some comments

I'll let Win 7 MKL do it, CUDA can wait, yes please example exposing inplace C <= alpha*A*B + C

example

I'll let Win 7 MKL do it, CUDA can wait, yes please example exposing inplace C <= alphaAB + C