dohashi

1172 Reputation

10 Badges

14 years, 348 days
I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

MaplePrimes Activity


These are replies submitted by dohashi

@Carl Love

The paragraph is a little confusing.  The "called outside the grid, on a non-compute node, not as part of a parallel computation" is basically defining two cases, one where the kernel calling Map is outside of the Grid, or the kernel calling Map is part of the Grid.  If the kernel is outside of the grid, then Map calls Launch to create nodes to compute the Map command and the result is returned to the calling kernel.  This is the case that most users want to use.  This is also the case shown in the first few examples.  The second case is intended to be used with the Grid Toolbox (although it does work without it) where there are kernels running on multiple machines managed by a system external to Maple.  What this case does is really just for sophisticated users who are writing complex Grid code.  The last example shows this case, although it may not do a very good job of illustrating the difference.

Darin

@itsme The anonymous procedure thing definitely looks like a bug.  As for the other issue, that could be a problem with value sharing.  Since the Grid nodes are separate kernels any values needed for the computation on a particular node needs to be shared from the main kernel.  Map is a simplified interface that tries to automatically detect and share required values, but it may not be robust enough for your computation.

Take a look at the 4th paragraph on the Grid:-Map help page.

If it is important enough I'd suggest taking a look at the full Grid api. However then you are responsible for making sure that all the required data is shared.

Darin

@itsme Can you try executing the following bit of code and report back what is printed?  The f here just loops for 5 seconds then quits.  You should see a time reported of around 5 seconds while i <= NumNodes and then 10 seconds for NumNode < i <= 2*NumNodes.

f := proc( x )
local t;
t := time[real]();

# busy wait to burn cpu
while ( time[real]()-t < 5 ) do end;

x;
end;

n := Grid:-NumNodes();

for i from 1 to 2*n
do
t := time[real]():
L := [ $1..i ]:
Grid:-Map( f, L );
print( i, time[real]() - t );
end:

Another question, are the elements of the list very large?  If you are willing to post your actual example, that would be the quickest way for me to help.

Darin

I have a few questions before I know what to suggest:

1.  How many elements in your list?

2.  Does the function you are mapping take approximately the same amount of time per element in the list?

Darin

-- Kernel Developer Maplesoft

@Markiyan Hirnyk From a quick reading of that thread, it appears as though he needs to use a procedure which has not been verified as thread-safe, and so it may not work properly when run in parallel.  The decision to use Grid is probably the best solution for now.

@Markiyan Hirnyk There is not really anything new.  This page is really just to have a single location that lists the articles.  In particular I often tell people to go look at these blog posts I made, but I didn't really have a nice link to send people to.

That said, I am thinking about working on some more content for the near future.

I've updated my code for solving this example, you can see the new code here.  This change fixes a few memory issues that means the new code can actually run large examples.  Here is a (trivial) example that I ran to make sure the code could sucessfully make it through all the combinations.

> CodeTools:-Usage( FilterCombPara( 113,7, x->false ) );
memory used=17.39TiB, alloc change=460.02MiB, cpu time=88.52h, real time=20.15h
                                                  []

So it filtered all of C(113,7) in 20 hours of real time (88 CPU hours on a 8 core machine).  It used 17 terabytes of memory, but only had to allocate 500M.  With a "real" filter function, this will take longer (potentially much longer depending on the filter's complextity), but we are able filter all combinations in memory.  If the number of matches is small then it should be possible to hold all them all in memory too.  If the number of matches is too large to fit in memory, then it will probably be necessary to write them to disk.  (Which is not something this code does).

Anyway try out the new code, I'd love to know if it works for you, and if it doesn't, I'd like to know what's not working.  These huge examples are a great way to stress test Maple.

Darin

 

@Carl Love Making this code to write the matching combinations to disk is a relatively simple modifitcation.  In filterComb instead of collecting and returning the matching combinations we can write them to disk or add them to a buffer that is periodically written to disk.  Whether or not the writing to disk is going to cause problems with parallelism will depend on what fraction of combinations are matches.  If the number of matches is small (relative to the total number of combinations) it shouldn't be too big of a deal.  Otherwise writing is going to be the bottleneck and parallelizing the search may not actually help much.  Of course, perhaps the OP could invest in a SSD that would significantly speed up the writing or act as a fast virutal disk.

@Carl Love The Threads:-Seq, etc functions use a heuristic where they attempts to create n^(1/3) tasks.  However this is not necessarily the best thing for your code to do.  The Threads functions need to work decently in all possible situations, so it needs a heuristic that handles odd cases that you won't need to worry about.  For your own code, you have a better understanding of how the timing will work out, so you can probably do better by picking the "right" base case size for code.

Darin

For optimal performace, you want to use as small a base case as possible.  However you need to be careful as the overhead of creating lots of new tasks will become a problem.  Ideally this overhead would not be significant issue, but it still is.

The reason you want as many tasks as possible is that it provides both load balancing and scaling.  From a load balancing perspective, if you have a small number of tasks, one task could take much longer than the others to complete.  When this happens, one core will be executing the long task while the other cores are idle, having finished their tasks.  With lots of tasks, the other cores will be able to do more work while the one core is dealing with the long task.  The fastest your Task Model execution can ever be is the length of the longest task, so keeping it small is a good thing.  From a scaling perspective, lots of tasks can be spread across lots of processors, without needing to know how many processors there are.  Yes, you can use numcpus, but its not necessary.

Darin

-- Kernel Developer Maplesoft

@TomJohnRiddle Hmmm... You said that while you are waiting, the CPU utilization is basically zero, so you're not seeing Maple's kernel (mserver process) running at 100%, correct?  If you were seeing the kernel running during that time, I would suspect that there was some seriously inefficent algorithm being used in the kernel.  However if you are not seeing that, and it is not a latency problem of trying to pull each element across the network one at a time, then I'm not sure what else it could be.

What version of MS SQL server are you using and what is the type of data stored in the table?

Thanks

Darin

@TomJohnRiddle Hmmm... You said that while you are waiting, the CPU utilization is basically zero, so you're not seeing Maple's kernel (mserver process) running at 100%, correct?  If you were seeing the kernel running during that time, I would suspect that there was some seriously inefficent algorithm being used in the kernel.  However if you are not seeing that, and it is not a latency problem of trying to pull each element across the network one at a time, then I'm not sure what else it could be.

What version of MS SQL server are you using and what is the type of data stored in the table?

Thanks

Darin

There is a difference between calling a method using the standard function call syntax and using :-.

> module A() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module B() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'B' end;
> end module:
>
> foo := proc() print( 'procname'( _passed ) ) end proc:
>
> foo( A );
                                        A:-foo(A)

> foo( B );
                                        B:-foo(B)

> foo( A, B );
                                      A:-foo(A, B)

> foo( B, A );
                                      B:-foo(B, A)

 

Notice how the following two calling sequences differ:


> foo( A, B );
                                      A:-foo(A, B)

> A:-foo(B);

                                        A:-foo(B)

 

In the first case, the object A is passed as a parameter into A:-foo, in the second case it is not.  With objects, methods are generally declared static.  Thus if the methods are going to modify the object (A in this case), it needs to be passed in as a parameter.

So foo( A, B ) and A:-foo( B ) are very different.

I'm not sure what you mean by "nested objects".  When scanning a parameter sequence, Maple is looking for objects, anything else (like a list containing objects) is not an object.

With respect to :-, ( I think you accidently wrote := ), :- is used to specify which package a name is found in.  When used as a unary prefix operator, it specifies the global, top level name.  This is generally used when there exists a module or package procedure that would be used if the :- was not present.  For an object method, the global name is the name that is mapped to the object method, so :-foo( A ) and foo( A ) are the same, assuming foo is not overloaded.  When foo is overloaded, then :- can be used to access the global foo, which is the foo that maps onto the object method.

> module A() option object;
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module fooExporter()
>     option package;
>
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
> end:
>
> foo( A );
                                        A:-foo(A)

> :-foo( A );
                                        A:-foo(A)

>
> fooExporter:-foo( A );
                                   fooExporter:-foo(A)

> with( fooExporter );
                                          [foo]

>
> foo( A );
                                   fooExporter:-foo(A)

> :-foo( A );

                                        A:-foo(A)

Darin

-- Kernel Developer Maplesoft

There is a difference between calling a method using the standard function call syntax and using :-.

> module A() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module B() option object;
>     export foo::static := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'B' end;
> end module:
>
> foo := proc() print( 'procname'( _passed ) ) end proc:
>
> foo( A );
                                        A:-foo(A)

> foo( B );
                                        B:-foo(B)

> foo( A, B );
                                      A:-foo(A, B)

> foo( B, A );
                                      B:-foo(B, A)

 

Notice how the following two calling sequences differ:


> foo( A, B );
                                      A:-foo(A, B)

> A:-foo(B);

                                        A:-foo(B)

 

In the first case, the object A is passed as a parameter into A:-foo, in the second case it is not.  With objects, methods are generally declared static.  Thus if the methods are going to modify the object (A in this case), it needs to be passed in as a parameter.

So foo( A, B ) and A:-foo( B ) are very different.

I'm not sure what you mean by "nested objects".  When scanning a parameter sequence, Maple is looking for objects, anything else (like a list containing objects) is not an object.

With respect to :-, ( I think you accidently wrote := ), :- is used to specify which package a name is found in.  When used as a unary prefix operator, it specifies the global, top level name.  This is generally used when there exists a module or package procedure that would be used if the :- was not present.  For an object method, the global name is the name that is mapped to the object method, so :-foo( A ) and foo( A ) are the same, assuming foo is not overloaded.  When foo is overloaded, then :- can be used to access the global foo, which is the foo that maps onto the object method.

> module A() option object;
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
>     export ModulePrint := proc() 'A' end;
> end module:
>
> module fooExporter()
>     option package;
>
>     export foo := proc() print( 'procname'( _passed ) ) end proc;
> end:
>
> foo( A );
                                        A:-foo(A)

> :-foo( A );
                                        A:-foo(A)

>
> fooExporter:-foo( A );
                                   fooExporter:-foo(A)

> with( fooExporter );
                                          [foo]

>
> foo( A );
                                   fooExporter:-foo(A)

> :-foo( A );

                                        A:-foo(A)

Darin

-- Kernel Developer Maplesoft

@Jarekkk When timing parallel code, you need to use time[real]() instead of time().  time() measures CPU time, which sums the time spend on each cpu when multiple CPUs are running.

Also, when running this example, even 10^5 is relatively short.  However using time[real] and 10^7, I see a decent speed up.  Also, notice that this example only creates two tasks, so the speed up won't be larger that 2 times.  To utilize more that 2 cpus you will need to sub-divide the for loops themselves.

> restart;
> p1 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i-1
>       end;
>  
>       op(result);
>  end:
>  
>  p2 := proc(N)
>       local i, result;
>  
>       for i from 1 to N
>       do
>           result[i] := 2*i;
>       end;
>  
>       op(result);
>  end:
>  
>  st:=time[real]():
>  Threads:-Task:-Start( passed, Task=[p1,10^7], Task=[p2,10^7] ):
memory used=11.4MB, alloc=37.1MB, time=0.18
memory used=25.3MB, alloc=53.1MB, time=0.41
memory used=52.7MB, alloc=85.1MB, time=0.91
memory used=88.7MB, alloc=125.1MB, time=1.40
memory used=189.3MB, alloc=253.9MB, time=3.31
memory used=331.4MB, alloc=381.9MB, time=5.09
memory used=470.8MB, alloc=381.9MB, time=6.81
memory used=610.5MB, alloc=512.0MB, time=8.61
memory used=757.0MB, alloc=768.0MB, time=12.52
memory used=1172.7MB, alloc=900.0MB, time=16.42
>  time[real]()-st;
                                         10.863

>
> restart;
> st := time[real]():
> for i to 10^7 do results[i] := 2*i-1 end do:
memory used=1307.9MB, alloc=18.0MB, time=19.21
memory used=1321.7MB, alloc=42.0MB, time=19.36
memory used=1344.0MB, alloc=66.0MB, time=19.65
memory used=1388.1MB, alloc=122.0MB, time=20.25
memory used=1447.3MB, alloc=253.7MB, time=20.81
memory used=1536.9MB, alloc=253.7MB, time=22.10
memory used=1627.1MB, alloc=382.7MB, time=23.39
memory used=1685.0MB, alloc=383.0MB, time=24.65
memory used=1807.8MB, alloc=511.0MB, time=25.96
> for i from 10^7+1 to 2*10^7 do results[i] := 2*i end do:
memory used=1983.9MB, alloc=513.0MB, time=28.66
memory used=2092.6MB, alloc=513.0MB, time=31.21
memory used=2216.3MB, alloc=641.0MB, time=32.89
memory used=2420.8MB, alloc=773.0MB, time=34.17
memory used=2573.0MB, alloc=773.0MB, time=36.93
> op(result):
> time[real]()-st;
                                         18.309

Darin

1 2 3 4 5 6 Page 2 of 6