I am going to wander away from parallel programming in Maple, to talk about GPU programming. However I will show an example of connecting Maple to a CUDA accelerated external library, so that's close enough. This post is not intended to be a tutorial on CUDA or OpenCL programming, but an introduction to how the technology works.
A GPU is "Graphics Process Unit", basically it refers to the chip in recent 3D accelerated video cards, and other chips with similar architecture. So what makes a GPU different than a CPU? The primary difference is that a GPU is designed to perform Data Oriented Parallelism. Data oriented parallelism describes parallel problems that involve executing the exact same code over a large number of data elements. (Maple programmers can think of map). This idea is not all that new. Many modern CPUs include Single Instruction, Multiple Data (SIMD) instructions which apply a single instruction over multiple pieces of data. This style of instruction is also been called vector instructions. GPUs take these concepts even further. CUDA CPUs use Single Instruction, Multiple Threads (SIMT) or what OpenCL calls Single Program, Multiple Data (SPMD). This means that a group of threads are all executing the same instructions in lock step with each other. Further, GPUs have a large number of relatively simple cores designed for this SIMT style processing. For example a Geforce GTX 285 (a $350 dollar video card) has 30 "multi-processors" where each multi-processor contains 8 processors, thus a total of 240.
OpenCL is a standard programming API for accessing GPU style hardware. It is relatively new standard, so it does not appear to have gained the momentum of better known technologies, such as CUDA. However code written for OpenCL should be device independent, thus OpenCL code should be able to execute on cards from NVidia or ATI, as well as future hardware such as Intel's Larrabee. This makes it an attractive technology for the future.
Currently most GPU are available as specialized cards (aka video cards). This creates a separation between the computer (the host) and the card (the device). The device often does not have direct access to the host's memory, or if they do have access, it is significantly slower than accessing the device's memory. Often this means an algorithm will need to copy a block of memory to the device's hardware, perform the execution, then copy the results back to main memory. However the expense of this copy can also limit the types of problems that can be accelerated on the GPU. For example we have been experimenting with accelerating linear algebra operations with CUDA hardware. For a routine like Matrix Matrix addition, copying the two Matrices onto the card and copying the result back is more expensive than performing the addition in RAM using optimized C routines. However for more expensive operations, like a Matrix Matrix multiplication, there are significant speed ups to be had.
The actual programming model is very simple. CUDA and OpenCL define a kernel as the function that we want to map over the data. This kernel is then mapped over a three dimensional rectangular space. Thus the kernel is called once per element in the space. When a kernel is called it has access to a global variable that describes the index of that kernel within the space. Thus this index allows the kernel to determine were it exists within the space, and thus what it should be calculating.
So that's the basic programming model. It is very simple. There are more details, dealing with memory issues, synchronization tools, what you can and can't do on the device and how to best optimize your code, however you can do a lot without worrying too much about that stuff.
The following example shows how to use CUDA to compute the Mandelbrot Set, using Maple's external call routines.
First, here is a simple C implementation:
float MandelFunction( float xS, float yS, int32_t iter, float bailout )
{
int32_t i;
float x, y, xTemp;
x = 0;
y = 0;
for ( i = 0; i < iter; i++ )
{
if ( x*x + y*y > bailout )
{
return i + (logf(2*logf(bailout))-logf(logf( sqrtf(x*x + y*y) ) ))/logf(2);
}
xTemp = x*x - y*y + xS;
y = 2*x*y + yS;
x = xTemp;
}
return -1;
}
#define INDEX( x, y, z, xlen, ylen ) ( (x)+xlen*(y)+xlen*ylen*(z) )
void CMandelLoop( float *color, int32_t xlen, int32_t ylen,
float xlow, float xhigh, float ylow, float yhigh,
int32_t iter, float bailout )
{
float x, xStep, y, yStep, c;
int32_t i, j;
xStep = (xhigh-xlow)/(xlen-1);
yStep = (yhigh-ylow)/(ylen-1);
y = ylow;
for ( j = 0; j < ylen; j++ )
{
x = xlow;
for ( i = 0; i < xlen; i++ )
{
c = MandelFunction( x, y, iter, bailout );
if ( c == -1 ) {
color[ INDEX( i, j, 0, xlen, ylen ) ] = 0.0;
color[ INDEX( i, j, 1, xlen, ylen ) ] = 0.0;
color[ INDEX( i, j, 2, xlen, ylen ) ] = 0.0;
}
else {
color[ INDEX( i, j, 0, xlen, ylen ) ] = c/iter;
color[ INDEX( i, j, 1, xlen, ylen ) ] = c/iter;
color[ INDEX( i, j, 2, xlen, ylen ) ] = c/iter;
}
x += xStep;
}
y += yStep;
}
}
CMandelLoop iterates over the dimensions of the image, and calls MandelFunction for each pixel. MandelFunction's return value is then stored in the output array. This is a pretty straight forward implementation. In the CUDA example, these nested for loops are automatically parallelized, but mapping them over the index space.
The following takes the above code and then re-implements it on CUDA.
__device__ float CudaMandelFunction( float xS, float yS, int32_t iter, float bailout )
{
int32_t i;
float x, y, xTemp;
x = 0;
y = 0;
for ( i = 0; i < iter; i++ )
{
if ( x*x + y*y > bailout )
{
return i + (logf(2*logf(bailout))-logf(logf( sqrtf(x*x + y*y) ) ))/logf(2);
}
xTemp = x*x - y*y + xS;
y = 2*x*y + yS;
x = xTemp;
}
return -1;
}
Now notice that CudaMandelFunction is almost identical to the MandelFunction, except that it is declared as __device__, which means it is intended to execute on a CUDA device and not on the host CPU.
#define INDEX( x, y, z, xlen, ylen ) ( (x)+xlen*(y)+xlen*ylen*(z) )
__global__ void MandelWarp( float *color, int32_t xlen, int32_t ylen,
float xlow, float xhigh, float ylow, float yhigh,
int32_t iter, float bailout )
{
float c;
int32_t i = blockIdx.x * blockDim.x + threadIdx.x;
int32_t j = blockIdx.y * blockDim.y + threadIdx.y;
if ( i < xlen && j < ylen )
{
c = CudaMandelFunction(
xlow + ((xhigh-xlow)/(xlen-1))*i,
ylow + ((yhigh-ylow)/(ylen-1))*j,
iter, bailout );
if ( c == -1 ) {
color[ INDEX( i, j, 0, xlen, ylen ) ] = 0.0;
color[ INDEX( i, j, 1, xlen, ylen ) ] = 0.0;
color[ INDEX( i, j, 2, xlen, ylen ) ] = 0.0;
}
else {
color[ INDEX( i, j, 0, xlen, ylen ) ] = c/iter;
color[ INDEX( i, j, 1, xlen, ylen ) ] = c/iter;
color[ INDEX( i, j, 2, xlen, ylen ) ] = c/iter;
}
}
}
MandelWarp is the kernel function. It is declared as __global__, which means that it should be executed on the device, but can be called from the host. Notice how MandelWarp uses the blockIdx, blockDim and threadIdx globals to calculate the index corresponding to this kernel's execution. The index determines which pixel is currently being rendered. An interesting detail is the if statement. This allows the data to be sub-divided into evenly sized blocks, even if the block size does not evenly divide the size of the data.
void CudaMandelLoop( float *color, int32_t xlen, int32_t ylen,
float xlow, float xhigh, float ylow, float yhigh,
int32_t iter, float bailout )
{
int32_t size;
dim3 gdim;
dim3 bdim;
void *cudaMemory;
size = xlen*ylen*3*sizeof(float);
cudaMalloc( &cudaMemory, size );
bdim.x = 32;
bdim.y = 16;
bdim.z = 1;
gdim.x = ( xlen + bdim.x - 1 )/bdim.x;
gdim.y = ( ylen + bdim.y - 1 )/bdim.y;
gdim.z = 1;
MandelWarp<<<gdim,bdim>>>( (float*)cudaMemory, xlen, ylen, xlow, xhigh, ylow, yhigh, iter, bailout );
cudaMemcpy( color, cudaMemory, size, cudaMemcpyDeviceToHost );
cudaFree( cudaMemory );
}
CudaMandelLoop is the function called on the host. This allocates a block of memory on the device, creates the sub-division of data range, and then launches the execution on the device. When the execution is complete, we copy the results from device memory into host memory and then free the device memory.
With CUDA we create a two level sub-division, small blocks (bdim) and a grid of blocks (gdim). The reason for this is that CUDA hardware has a maximum block size that is supported by the hardware, thus a higher level of division is also required. The MandelWarp<<<gdim,bdim>>> line executes the kernel MandelWarp over the space defined by gdim and bdim.
Clearly the CUDA code includes some non-C elements. Thus the CUDA toolkit includes a program call nvcc, which is a compiler for CUDA code.
I will provide complete code for for this example, including the Maple external call wrapper, however here is the time results comparing the C implementation to the CUDA implementation.
> mandel := define_external( MapleCMandel, MAPLE, LIB="libextmandel.so" ):
> n := 10000:
> m := 10000:
> values := Array( 1..n, 1..m, 1..3, datatype=float[4] ):
> time[real]( mandel( values, -2.0, 0.7, -1.35, 1.35, 100, 4.0 ) );
28.474
>
> restart;
>
> mandel := define_external( MapleCudaMandel, MAPLE, LIB="libextmandel.so" ):
> n := 10000:
> m := 10000:
> values := Array( 1..n, 1..m, 1..3, datatype=float[4] ):
> time[real]( mandel( values, -2.0, 0.7, -1.35, 1.35, 100, 10.0 ) );
0.681
Thus for a very large image (10000x10000) the CUDA implementation is about 40 times faster that the C implementation.
This example code was written and tested on x86-64 Linux, however it should also work on 32 bit Linux. If there is enough demand, I will port this example to Windows, building with MSVC.
Download Example Source Code