@acer: Thank you for the additional comments, and no need to apologize. I've just now installed Maple 16 (on Win 7, AMD 6-way, MKL); as long as I am investing this time, I may as well do it with current software. Was Maple 16 assumed in your first snippets of code?
While I chase that, I have a somewhat similar problem involving Vectors but the operation here is not addition; it is taking the element-wise exponential function of each element-wise sum. If G and H are float Vectors containing 1..Last elements, and c is a float scalar, then for a changing intermediate index Mid, the operations would be
G[Mid..Last] := G[Mid..Last] + exp~( c +~ H[Mid..Last] ).
It eventually dawned on me that this is a natural application for LinearAlgebra[Zip]. I tried
LinearAlgebra[Zip] (proc(a,b) options operator, arrow; a + exp(b + c) end proc,
G[Mid..Last], H[Mid..Last], inplace):
with no luck -- got back floating point zeros, and it took 70% more time than the element-wise operator version.
I also tried to avoid range operators with an awkward combination of zero-Fill and Copy, Map2 with a filter to limit computational effort, and vector addition (using a temporary float Vector C in addition to the scalar c mentioned earlier):
Offset := Mid - 1:
Length := Last - Offset:
ArrayTools[Fill](Offset, 0.0, C):
ArrayTools[Copy](Length, H, Offset, C, Offset):
LinearAlgebra[Map2] [proc (ii) options operator, arrow; evalb(Mid< ii) end proc]
( proc (x, a) options operator, arrow; exp(x+a) end proc, c, C ):
LinearAlgebra[VectorAdd] (G, C, inplace):
This took as long as the Zip experiment did, but it failed differently -- the filter seems to always have evaluated to true. I suppose these could be Maple bugs, but it is much more likely that my inexperience with these Maple functions is revealing itself.
So for the moment I am stuck with the element-wise implementation. I too was disappointed to find that MKL did not make any use of the multiple cores on the AMD processors during any of these experiments. Perhaps there are special versions of these libraries that I need to find. But in this case, I also believe that the exp() function is throttling parallelism -- I was surprised to find exp() and ln() aren't thread-safe, so hand-programming threads isn't going to help unless I write my own expansions for these functions. I suppose the functions use shared intermediate storage, but so would other expansions approximations for sin(), cos(), etc.
Any suggestions on this twist of the original question are equally welcome, along with insight on why such elementary functions would not be thread-safe.
With thanks and every good wish,