dohashi

1172 Reputation

10 Badges

14 years, 348 days
I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity


These are replies submitted by dohashi

@Jarekkk When timing parallel code, you need to use time[real]() instead of time().  time() measures CPU time, which sums the time spend on each cpu when multiple CPUs are running.

Also, when running this example, even 10^5 is relatively short.  However using time[real] and 10^7, I see a decent speed up.  Also, notice that this example only creates two tasks, so the speed up won't be larger that 2 times.  To utilize more that 2 cpus you will need to sub-divide the for loops themselves.

> restart;
> p1 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i-1
>       end;
>  
>       op(result);
>  end:
>  
>  p2 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i;
>       end;
>  
>       op(result);
>  end:
>  
>  st:=time[real]():
>  Threads:-Task:-Start( passed, Task=[p1,10^7], Task=[p2,10^7] ):
memory used=11.4MB, alloc=37.1MB, time=0.18
memory used=25.3MB, alloc=53.1MB, time=0.41
memory used=52.7MB, alloc=85.1MB, time=0.91
memory used=88.7MB, alloc=125.1MB, time=1.40
memory used=189.3MB, alloc=253.9MB, time=3.31
memory used=331.4MB, alloc=381.9MB, time=5.09
memory used=470.8MB, alloc=381.9MB, time=6.81
memory used=610.5MB, alloc=512.0MB, time=8.61
memory used=757.0MB, alloc=768.0MB, time=12.52
memory used=1172.7MB, alloc=900.0MB, time=16.42
>  time[real]()-st;
                                         10.863

>
> restart;
> st := time[real]():
> for i to 10^7 do results[i] := 2*i-1 end do:
memory used=1307.9MB, alloc=18.0MB, time=19.21
memory used=1321.7MB, alloc=42.0MB, time=19.36
memory used=1344.0MB, alloc=66.0MB, time=19.65
memory used=1388.1MB, alloc=122.0MB, time=20.25
memory used=1447.3MB, alloc=253.7MB, time=20.81
memory used=1536.9MB, alloc=253.7MB, time=22.10
memory used=1627.1MB, alloc=382.7MB, time=23.39
memory used=1685.0MB, alloc=383.0MB, time=24.65
memory used=1807.8MB, alloc=511.0MB, time=25.96
> for i from 10^7+1 to 2*10^7 do results[i] := 2*i end do:
memory used=1983.9MB, alloc=513.0MB, time=28.66
memory used=2092.6MB, alloc=513.0MB, time=31.21
memory used=2216.3MB, alloc=641.0MB, time=32.89
memory used=2420.8MB, alloc=773.0MB, time=34.17
memory used=2573.0MB, alloc=773.0MB, time=36.93
> op(result):
> time[real]()-st;
                                         18.309

Darin

@samiyare 

In that case you'll have to do something like this:

> p1 := proc(i, r)
>     r + i;
> end:
>  
> p2 := proc(i, r)
>     r + i;
> end:
>
> result1[1] := 1:
> result2[1] := 2:
> for i from 2 to 10
> do
>     result1[i], result2[i] :=
>         Threads:-Task:-Start( passed, Task=[p1,i,result1[i-1]],
>                                         Task=[p2,i,result2[i-1]] ):
> end:
>
> print(result1);
table([1 = 1, 2 = 3, 3 = 6, 4 = 10, 5 = 15, 6 = 21, 7 = 28, 9 = 45, 8 = 36,

    10 = 55

    ])

> print(result2);
table([1 = 2, 2 = 4, 3 = 7, 4 = 11, 5 = 16, 6 = 22, 7 = 29, 9 = 46, 8 = 37,

    10 = 56

    ])

However, this is only going to be useful if the calculations in p1 and p2 are relatively expensive.  Starting and stopping tasks for each iteration of the loop is going to add some overhead.  

@samiyare 

In that case you'll have to do something like this:

> p1 := proc(i, r)
>     r + i;
> end:
>  
> p2 := proc(i, r)
>     r + i;
> end:
>
> result1[1] := 1:
> result2[1] := 2:
> for i from 2 to 10
> do
>     result1[i], result2[i] :=
>         Threads:-Task:-Start( passed, Task=[p1,i,result1[i-1]],
>                                         Task=[p2,i,result2[i-1]] ):
> end:
>
> print(result1);
table([1 = 1, 2 = 3, 3 = 6, 4 = 10, 5 = 15, 6 = 21, 7 = 28, 9 = 45, 8 = 36,

    10 = 55

    ])

> print(result2);
table([1 = 2, 2 = 4, 3 = 7, 4 = 11, 5 = 16, 6 = 22, 7 = 29, 9 = 46, 8 = 37,

    10 = 56

    ])

However, this is only going to be useful if the calculations in p1 and p2 are relatively expensive.  Starting and stopping tasks for each iteration of the loop is going to add some overhead.  

@marvin There should be a help page for Threads:-Task:-Return.

?Threads:-Task:-Return

brings up the help page in Maple 14 (for me).

Darin

p.s.  I seemed to stop getting notifications about comments on this post, so sorry for the late reply.

@erik10 In Maple, evalf is much more than a rounding function.  evalf[n](x) evaluates x using n digits of precision.    This is particularly useful when you increase n above the default of 10.  You can numerically evaluate the expression using 20, 50 or 1000 digits of precision.  In some sense, there is no "full precision" when using evalf, especially if your calculations include constants like Pi.  Try calling evalf[n](Pi) for various values of n, I think you'll begin to understand why evalf works the way it does.

You can evaluate your expression using evalf to compute the result, then call evalf[2] on the result to reduce it to 2 digits of precision.

Darin

@fwchapman That setup sounds fine.  The most important aspect is probably the processor and RAM.  The Core i7 performs better when executing parallel code than older processors.   Of course, more RAM is almost always better.  There are no limitations that are particular to your setup, but parallel programming in Maple does have some limitations, which I outlined in a previous blog post.

http://www.mapleprimes.com/maplesoftblog/36467-The-Current-Limitations-Of-Parallel

Can you tell I write more C code than Maple?  Thanks for pointing out the mistakes in my code.  I've updated the example.

Darin

-- Kernel Developer Maplesoft

The idea of racing algorithms is often an attractive one, but it has limitations.  In particular, there is usually a fixed number of algorithms that can be tried.  That inherently limits the scalability of this type of solution.  Eventually you'll have more cores than algorithms and then you want the algorithms themselves to be parallel.  As well, you often have some idea of which algorithms are more likely to find a solution, so it would be better for the algorithms themselves to be parallel and then run in the correct order, than running all the algorithms in parallel.

Darin

-- Kernel Developer Maplesoft

 

The reason we do this is that OS X sets a really low datalimit by default (that's where the 6144 number comes from).  This limit is too low for Maple to be run successfully.  So all this stuff is to work around that data limit.  It looks like the script is not smart enough when users have set another data limit.

Darin

-- Kernel Developer Maplesoft

http://www.elsevierdirect.com/companion.jsp?ISBN=9780123705914

It is listed on the page opposite the Preface.

Darin

-- Kernel Developer Maplesoft

Yep, that is basically the problem, and the solution that I came up with.  I think this example shows how easy it is to overlook the fact that variables are being shared between threads.  The solution, wrapping the Add'ed expression in a procedure, is also a nice way of fixing the problem.

Having the kernel automatically determine which variables are shared and creating thread local copies of those variables is possible.  It is sort of the step beyond simply determining if the given expression is thread safe to actually modifying the expression to make it thread safe. Parallelizing compilers are already doing things like this.

Allowing the user to explicitly list the shared variables so the kernel can replace them with thread local instances is quite possible.  However I'd like a consistant way of specifying additional arguments (there are other parameters that the paralleized vesions could accept)  that would work for all of Add, Map, Mul and Seq while keeping them compatible with add, map, mul and seq.  Compatibility with add, map, mul and seq is important because eventually these functions will be using the parallel implementations, and I'd like the users to be able to specify these types of arguments to them as well.  map is the real problem because it passes any additional arguments to the mapped function.

Darin

-- Kernel Developer Maplesoft

If someone does not find a correct solution for my example, I'll post one on Friday.

Darin

-- Kernel Developer Maplesoft

The first argument to add or Add (and mul, seq etc) is evaluated once for each value specified by the second argument.  The parallel versions of the these functions divide the values between multiple threads and the threads perform the evaluations in parallel.  In this case, the expression given to Add needs a value for k to perform the evaluation.  So there is definitely a speed up to be found using Threads:-Add in this case.

Darin

-- Kernel Developer Maplesoft

I'm glad that you were able to make the code faster.  It seems like the recursive task creation has a larger overhead than I was expecting.  That is definitely something for me to investigate.

As for the crash, unfortunately that is undoubtably a bug.  I will investigate that one too.

I notice that you are using

Xpre:=Vector(Xcdim,datatype=float[8]);
seq(assign('Xpre'[k], (X[i,k] + X[j,k])/2.0), k=1..Xcdim ) ;

this is probably faster

Xpre:=Vector(Xcdim,datatype=float[8], k->(X[i,k] + X[j,k])/2.0) ) ;
Darin

-- Kernel Developer Maplesoft

I think the big issue with this example is that the base case of 1000 is too small given how fast it is to compute the add.  Using a larger base case improves the performance.  That is effectively what you are doing by specifying the number tasks, however adjusting the base case size allows the code to remain independant of the number of processors.

Darin

-- Kernel Developer Maplesoft

1 2 3 4 5 6 Page 3 of 6