Maple

In my previous posts I have discussed various difficulties encountered when writing parallel algorithms. At the end of the last post I concluded that the best way to solve some of these problems is to introduce a higher level programming model. This blog post will discuss the Task Programming Model, the high level parallel programming model introduced in Maple 13.

Definitions

• Task: a unit of executable code.

Introduction

Consider the following piece of Maple code

```f := proc()
fc( f1(args1), f2(args2), ... fn(argsn) );
end
```

To evaluate f, we first evaluate the fi's and compute their return values. These return values are then passed into fc as arguments. When fc completes, its return value is passed as the return value of f. The Task Programming Model takes this pattern, but creates tasks for the fi's and fc. So what is a task? A task is a piece of executable code. In Maple, a task is a procedure combined with a set of arguments to that procedure. Once a task is created, the Maple kernel can execute it. By allowing the kernel to schedule tasks, it can automatically distribute them to available cores.

In the example above, a function call, fi, can be replaced by a task, ti, in a fairly straight forward way, the procedure is fi and the arguments are argsi. However the task, tc, corresponding to the function call fc is a little more complex. The function call fc will not be executed until all the fi's have completed, thus supplying their return values to fc as arguments. Similarly the task tc must wait for values from the ti's before it can execute. tc's procedure is fc, and its arguments are the values passed from the other tasks. These other tasks are called tc's child tasks. Similarly tc is called the parent of the ti tasks. The task tc is called the continuation task.

Here is a Task based implementation of the simple example we have been working with

```(**) doSum := proc( A, first, last )
(**)     local i, l;
(**)
(**)     l := 0;
(**)     for i from first to last
(**)     do
(**)         l := l+A[i];
(**)     end;
(**)
(**)     l;
(**) end:
(**)
(**) taskCont := proc( l, r )
(**)     l+r;
(**) end proc:
(**)
(**) f := proc( A )
(**)     local n, n2;
(**)     n := rtable_num_elems( A );
(**)     n2 := floor( n/2 );
(**)
(**)                                         Task=[doSum, A, n2+1, n ] );
(**)
(**)     1;
(**) end:
(**)
(**) A := LinearAlgebra:-RandomVector( 2000 ):
1932

(**) print( add( A[i], i=1..2000 ) );
1932
```

Now, there are a few flaws in this implementation, for example, we only create two tasks.  Each of which performs a significant amount of work.  This limits our parallelism. I promised that using the Task Programming Model would help with scaling, so we need to modify this example slightly to achieve this result.  We'll use the ability of a task to create more tasks to improve this.

```
(**) taskCont := proc( l, r )
(**)     l + r;
(**) end:
(**)
(**) task := proc( A, low, high )
(**)     local i, n, mid;
(**)     n := high-low;
(**)
(**)     if ( n > 10000 ) then
(**)         mid := floor(n/2) + low;
(**)
(**)     else
(**)         n := 0;
(**)         for i from low to high
(**)         do
(**)             n := n + A[i];
(**)         end do;
(**)
(**)         n;
(**)     end;
(**) end:
(**)
(**) n := 10^6:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):
(**)
4001

4001
```

This example divides the input range until a base case is reached, creating a new Task for each sub-division. This creates a large number of relatively small tasks. The kernel is able to spread these tasks over the available cores. As long as there are a more tasks ready to execute than there are cores, the kernel will be able to keep all the cores busy. This allows algorithms written using the Task Programming Model to scale without requiring the programmer to worry about the number of available cores. The ability to scale can be illustrated by using the kernelopts( numcpus ) feature (note: if you have executed some code in the Task Programming Model, you must restart the kernel before setting kernelopts( numcpus )).

```(**) n := 10^8:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):
(**) kernelopts( numcpus );
1

56.714
```

... some code has been removed for brevity's sake, see the attached worksheet for the complete code ...

```(**) kernelopts( numcpus );
2

34.182
```

```(**) kernelopts( numcpus );
3

21.026
```

```(**) kernelopts( numcpus );
4

19.685
```

As you can see, by simply increasing the number of cores available to Maple, the code gets faster. Thus even this relatively simple example is able to scale. Unfortunately the scaling from 3 to 4 cores (and beyond) is not very good. This is a limitation of the kernel. There is still more work to be done in the kernel to increase its parallelism and future releases of Maple should get better and better. Luckily, code written in the Task Model today should automatically take advantage of increased parallelism in the future.

It is always good to compare the multi-threaded version of the code to the fastest single threaded version. For Maple the best way to sum over an array would be to use add.

```(**) n := 10^8:
(**) A := LinearAlgebra:-RandomVector( n, outputoptions=[datatype=integer[8]] ):