@Ronan So you got 400sec for nelems=100mil it seems, with this code.
Not a bad speedup from approx. 1530sec.
One aspect to note is that 300 of those 400 seconds are spent in the garbage collector. I wonder how much allocated memory it'd take to hold any of these examples if gc were (somehow) effectively turned off. I suppose the answer to that is the final value of kernelopts(bytesused) and is likely prohibitively high.
If we subtract the realgctime from the realtime for your 10mil and 100mil runs with the (so far) optimal code then the following observation can be made.
10mil: realgctime = 7sec (totalrealtime-gcrealtime) = 10sec
100mil: realgctime = 300sec (totalrealtime-gcrealtime) = 100sec
So the real time to do the actual computations went up by a factor of 10 (from 10sec to 100sec) as the problem size went from 10mil to 100mil. That's even better than I'd have guessed, since the maximal exponent goes up with problem size for your example. But the garbage collection real time went up by a factor of 42 (from 7sec to 300sec), which is noticeably higher. I don't know whether that could be mitigated.