dohashi

Ask a Question

Create a Post

10 Badges

Member for: 20 years, 139 days

Contact dohashi

I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity

These are replies submitted by dohashi

No Solutions Yet?...

Commented: dohashi 1202

January 04 2010

If someone does not find a correct solution for my example, I'll post one on Friday.

Darin

-- Kernel Developer Maplesoft

Add/add...

Commented: dohashi 1202

January 04 2010

The first argument to add or Add (and mul, seq etc) is evaluated once for each value specified by the second argument. The parallel versions of the these functions divide the values between multiple threads and the threads perform the evaluations in parallel. In this case, the expression given to Add needs a value for k to perform the evaluation. So there is definitely a speed up to be found using Threads:-Add in this case.

Darin

-- Kernel Developer Maplesoft

Good to see the improvements...

Commented: dohashi 1202

December 04 2009

I'm glad that you were able to make the code faster. It seems like the recursive task creation has a larger overhead than I was expecting. That is definitely something for me to investigate.

As for the crash, unfortunately that is undoubtably a bug. I will investigate that one too.

I notice that you are using

Xpre:=Vector(Xcdim,datatype=float[8]);
seq(assign('Xpre'[k], (X[i,k] + X[j,k])/2.0), k=1..Xcdim ) ;

this is probably faster

Xpre:=Vector(Xcdim,datatype=float[8], k->(X[i,k] + X[j,k])/2.0) ) ;

Darin

-- Kernel Developer Maplesoft

Big issue is base case...

Commented: dohashi 1202

November 26 2009

I think the big issue with this example is that the base case of 1000 is too small given how fast it is to compute the add. Using a larger base case improves the performance. That is effectively what you are doing by specifying the number tasks, however adjusting the base case size allows the code to remain independant of the number of processors.

Darin

-- Kernel Developer Maplesoft

I suspect that the new...

Commented: dohashi 1202

November 25 2009

I suspect that the new version could still be faster once compiled, although I have not tested that.

No, I don't think that breaking your code into smaller chunks would help speed up the example. Reducing the total amount of memory used could help. The current garbage collector can misbehave in parallel which leads to Maple allocating more memory than is probably necessary. This can slow Maple down. If you notice the single threaded code takes about 100Megs and that amount remains quite stable. The parallel version starts at 100Megs and grows to about 800 Megs by the time it is done.

Now, the Task Programming Model works best when there are a large number of small(ish) tasks, but for your example, running on machines with 2 or 4 cores, it is probably not a big difference.

Darim

-- Kernel Developer Maplesoft

Memory issues...

Commented: dohashi 1202

November 25 2009

I took a look at this code the last time you submitted it. The big problem is memory usage and garbage collection. If you can reduce the memory used by the code, then it will probably parallelize better.

That said, there is still some room for general improvements. I cleaned up your MakeV0A function and was able to speed it up a bit, both single and multi-threaded.

MakeV0A:=proc(i,j,Nij,X,sigma,Xcdim,Xrfull)
    local V0,den,k,l,pot,psi,tempdiffs,Xpre;

    den:=(2*sigma^2);

    Xpre := [ seq( (X[i,k] + X[j,k])/2.0, k=1..Xcdim ) ];
    tempdiffs := [ seq( add( (Xpre[k]-X[l,k])^2, k=1..Xcdim )/den, l=1..Xrfull) ];
    pot:=add( k*(exp(-k)), k in tempdiffs );
    psi:=add( exp(-k), k in tempdiffs );

    if  pot=0.0 and psi=0.0 then
        V0:=0.0;
    else
        V0 := Nij*pot/psi;
    end if;
    
    return V0;
end proc:

Also, I'm think there might be a bug in your original code, when you call MakeV0 in the single threaded case you pass 1 for istart and jstart, which means when you call MakeV0A for (i,j) you pass (i+1, j+1, N[i+1,j+1] ... ), is that what you wanted?

Darin

-- Kernel Developer Maplesoft

Thanks... and some answers...

Commented: dohashi 1202

November 24 2009

Please see my latest blog post for comments and answers to some of your questions.

Darin

Take a look at the example code...

Commented: dohashi 1202

November 09 2009

I think if you take a look at the Mandelbrot example code, it is very similar to how you would implement your #2. The Mandelbrot code accepts a Maple Matrix and fills it in. If you matrix is triangular, I would suggest simply modifying the if statement that checks that the indicies are in bounds.

There will be some differences for windows, however it should not be too hard to for you to figure those out.

As for double precision, only the most recent CUDA hardware (compute level 1.3 and higher) supports double precision, and it is slower that single precision, (although still faster than doing it on the host).

Darin

-- Kernel Developer Maplesoft

Blocking vs Locking...

Commented: dohashi 1202

November 02 2009

One thing I may need to point out is the difference between locking and blocking. You can lock a structure without causing blocking. Locking only causes blocking when two or more threads attempt to acquire the same lock. If these sub matrixes are not shared between multiple threads, then your code will still lock when you access the table, but this won't cause blocking. Now, there is some performance hit when locking in this case, because it is strictly not necessary, but currently the kernel needs to do this because it does not know if the rtable is shared with another thread.

Darin

-- Kernel Developer Maplesoft

I am planning on doing a...

Commented: dohashi 1202

October 30 2009

I am planning on doing a blog post on GPU compuations in general. I will definitely post a complete example then.

Do you think you are having trouble with the CUDA side or the external call side?

Darin

-- Kernel Developer Maplesoft

Unfortunately I can't really...

Commented: dohashi 1202

October 26 2009

Unfortunately I can't really guess when any particular feature will be done. One of the big problems is the Math Library programmers have way more code to deal with that we have in the kernel. Even putting together the plan for how to start parallelizing the library is going to take some time. I'll talk more about this when I do my "limitations" post.

As for Grid, it depends on how Grid works, unfortunately I'm not that familar with it. If there is only one kernel running on the node computers, and these nodes have multiple cores, then parallel programming can be useful. However if each node is running one kernel per core, then parallel programming on the nodes is probably not a good idea.

I have spent some time investigating and experimenting with CUDA and OpenCL. They are very fast for a limited set of problems. In particlar single precision data oriented parallel programming is where they really excel. By "data oriented" I mean you want to do the same (or a very similar thing) to a large number of data points. Numeric linear algebra is a typical example. The latest generation of cards does support double precision, but it is slower. Currently we don't have any built into Maple way of accessing GPUs, but you can connect to either of these APIs via external call. I have written a test app that generates a Mandelbrot set using CUDA via Maple external call.

Darin

-- Kernel Developer Maplesoft

APG Chapter...

Commented: dohashi 1202

October 26 2009

I have already been tasked with writing a parallel programming chapter for the Advanced Programming Guide. These blog posts are definitely going to be helpful in that regard.

This is another reason I'd like to encourage feedback. Anything that I could improve with these posts will help improve the chapter.

As for the corporate site, we discussed that briefly before I started blogging. I think that the corporate blog people wanted me to make the posts a bit more "corporate", which would cause me to spend more time writing and less time doing my actual job.

Darin

-- Kernel Developer Maplesoft

Definitely...

Commented: dohashi 1202

October 26 2009

Thanks, these are good ideas.

I will post a blog about the current limitations of parallelism in Maple.

Thread Safety is a tricky thing to describe, so maybe a post specific to how data can be shared in Maple is worth a post of its own as well.

Darin

-- Kernel Developer Maplesoft

GPU is an interesting topic....

Commented: dohashi 1202

October 26 2009

GPU is an interesting topic. Is there a way to use it from Maple?

Well, like almost anything you can connect Maple to CUDA or OpenCL via external call. However there is currently no built in support for accessing GPU hardware from Maple. It is something we are investigating.

As far as I understand the current situation, with code in Maple language being 500-1000 slower than, say, in C, it doesn't have much sense to use parallel programming for Maple code other than for some worksheet effects

I would disagree with this assessment. My main argument is described in the Why Go Parallel blog post. If Maple does not go parallel, it won't show significant speeds up on new hardware. Now I am not claiming we have achieved this goal, but we have started taking steps toward that goal. In addition, I think you may also be making a false assumption, that we could simply make Maple as a whole 500 to 1000 times faster. Even doubling Maple's performance, in general, would take a significant amount of work. However these kinds of speed ups are available from going parallel. Getting anywhere close to C type performance requires compiling Maple code. Having something like a JIT would be great, but it would also be a huge amount of work.

Darin

-- Kernel Developer Maplesoft

Multi-threaded Debugging...

Commented: dohashi 1202

October 15 2009

Unfortunately, the debugger does not work very well with threads or Tasks. Fixing this is relatively high on our priorities list.

Currently the debugger does not support any explicit threading commands (listing threads, changing threads, etc). However the debugger will work in any thread. If one thread is stopped in the debugger other threads continue to run, unless they hit breakpoints as well. When this happens, there can be multiple debugger sessions attached to multiple threads. Debugging like this can be confusing as it can be hard to tell which DBG> prompt corresponds to which thread.

Similar rules apply to the Task Programming Model, with the added cavet that the call stack does not work properly. We'd like the call stack for a Task to show its parent tasks, however that does not currently work.

Darin

-- Kernel Developer Maplesoft

1 2 3 4 5 6

Page 4 of 6

Share via:

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

E-Mail Address:
Password:
Remember Me:	Automatically sign in on future visits

Ask a Question

Create a Post

dohashi

1202 Reputation

10 Badges

MaplePrimes Activity

These are replies submitted by dohashi

No Solutions Yet?...

Add/add...

Good to see the improvements...

Big issue is base case...

I suspect that the new...

Memory issues...

Thanks... and some answers...

Take a look at the example code...

Blocking vs Locking...

I am planning on doing a...

Unfortunately I can't really...

APG Chapter...

Definitely...

GPU is an interesting topic....

Multi-threaded Debugging...

Save this setting as your default sorting preference?

Ask a Question

Create a Post

Generating PDF…

Save this setting as your default sorting preference?
Note: You can change your preference any time in your account settings
Don't show this again

From:
To:

Custom Message (optional):