Question: How to get Threads to use all avalable cores?


I recently got access to a 12-core Intel MacPro with 64 GB main memory. This motivated me to try parallel programming again, even though in prior years I have never been successfull. The project I am trying to do is particle tracking through a circular accelerator, an embarassingly parallel problem in the sense that you can track n particles in parallel through your machine for many turns and then gather up the results for analysis. The function describing the tracking is a 6-component polynomial function acting on 6-vectors and yielding a 6-Vector as a result. Each accelerator component (magnet, drift section, rf, ...) is described by such a function. I am simplifying this a bit here, but in the problem at hand this is what I am doing. The point is that each particle gets treated independently from the others, hence parallelization should be trivial.

Using an existing package (Lattice, which I published with Maple) as framework I set this up such that the tracking proc for n turns of one particle in the accelerator is a member of a module. This module is in the body of a proc and gets returned when the proc is called, essentially instantiating the tracking object and assigning it to a Vector with as many elements as I have particles to track. The tracking function returns a 6-Vector with the coordinates after all turns are complete. 
A separate proc does the instantiation of all tracking objects, gives each one its particle number (from a Beam object it is being given) and sends it off using Threads:-Create. It then waits until every task is done (using Threads:-Wait) and assembles the result in another Beam object which it returns. Please refer to the enclosed Maple worksheet for how it is done.

This actually all appears to work. MacOS is 15.7.2 (Sequoia); Maple is 2023.2. The results are identical both parallel and serial.

The "interesting" result however, is that usage of the available CPU cores saturates at about 4. In the graph shown below, the green line shows the CPU usage of the mserver process, and it saturates between 400 & 500%, actually going down to 360% as more particles get added. 100% is one core, so I am never getting more than about 4 cores to work for me. Correspondingly, the no. of seconds per particle goes up from about 3 s (particle 1 to 4) up to about 15 s/particle, settling at about 10 s/particle as 12 particles are approached. Below is a graph against no. of particles (n) of running time (red), CPU time (dark blue), CPU usage (yellow) and # or mkernel threads (green). 

Bottom line: I am only getting 4 cores out of the 12. Process limits of MacOS (ulimit -a) do not indicate any limit that would cause this (and I have had build jobs that would merrily use all 12 cores).

Is there a limit in Maple that prevents using all available cores? Am I doing something inefficient that could cause this?? This is the first time I actually got parallel operations in Maple to work, so I am happy about that, but my happiness is tempered by not getting it to work at the level I was aiming for. I did google around a bit and found some prior conversations on MP (mostly involving @acer and @Carl Love) about parallel threads which indicated that (a) environment variable OMP_NUM_THREADS should be set and (b) that numcpus can only be set at the very beginning of a Maple session (which I interpret as "right after firing up Maple"). Did both (and verified the settings were in) but no change in behaviour of this code; I only get four cpu cores to work.

Thanks,

Mac Dude

Parallel_tracking_attempt.mw

Edit: Added graph, fixed up graph.

Please Wait...