@Carl Love It's true that it uses the GPU. But it only offers a exports for a couple of BLAS operations (eg. O(n^3) Matrix-Matrix multiplcation).

I have mentally blocked it out because the current optimized float[8] BLAS used by the regular LinearAlgebra package (CPU), from the Intel MKL, is so fast on modern CPU chipsets that it actually beats the CUDA dgemm used via Maple for most reasonably sized examples.

The problem is that the CUDA package has to send all the float[8] Matrix data to the GPU up front, and then get it back afterwards. The overhead of that transfer is on the same order as the time it now takes for the CPU dgemm to compute at most size Matrices that one usually encounter in all but the most esoteric uses in Maple.

Hence the Maple CUDA package only offers a couple of multiplication functions and is no longer generally outperforming the regular LinearAlgebra equivalents on modern CPUs.

There are a couple of CUDA-based general LAPACK projects, but they are still maturing. I believe that, with the current model of having to transfer the data up and down again for each call, operations like SVD and eigen-solving would take enough time for the transfer overhead to be a negligible portion at large but non-enormous size examples. And so those functions would be most promising for a significant gain. LU decompostion less so. I experimented with this a couple of years ago, but performance results were not encouraging because those functions were not yet fully-optimized (and some not yet implemented) on the card.

Another possibility for the future might be something like the Jacket add-on for Matlab. In essence, all the matrix/vector data could get pushed onto the GPU for the duration of an involved multi-operation computation. It can be accessed (transfered back) on demand, but the benefit is that repeated computations with it can be done without repeated up-down transfer. In this scenario only specially treated procedures -- compiled on the GPU using a CUDA compiler -- would compute with such data. So one would need a procedure that 1) did large LinearAlgebra, 2) did enough general computation for it to be worthwhile pushing the data up/down, 3) could be treated by an enhanced Compiler:-Compile to compile on the GPU and call the GPU LAPACK/BLAS. Items 1) and 2) restrict the use cases. Item 3 is problematic because Compiler:-Compile doesn't yet translate Matrix/Vector operations to LAPACK/BLAS calls for regular CPU operations, so getting that for the GPU is an even further goal.

Having said all that, there a large chance that the OP is hoping for sweeping general computation for which parallelization would be automatically done under the hood by the Maple engine. That does not happen, even on the CPU. Having it happen on the GPU is more distant still.

The OP has a long history of asking bizarre questions with almost no context given.