@longrob : I'll answer your questions in random order.

- Determining the time spent in different parts of a program is called profiling, and in Maple this functionality is provided by the CodeTools:-Profiling package - in particular, the CodeTools:-Profiling:-Profile and showstat commands. I think the help pages probably have enough detail to get you started (now that you know where to look! :) ).
- The difference between versions 2 and 3 are twofold: first of all, the RandomVariables are created only once (creating such a random variable is not a big computation, but the administrative overhead is substantial) and then remembered. This is accomplished in the variableStorage Vector and the getVariables procedure. Secondly, using the profiling mentioned in the previous bullet point, I saw that the line defining
*expr* also took substantial time. This could really only come from the *sqrt* function call. By enclosing *sqrt* in forward quotes, I prevented Maple from doing any actual computation there; when the expression enters *Sample*, and via that enters the external code, all computation happens there anyway.
- I think leaving out
*uses S = Statistics;* was simply a copy and paste error - I'm pretty sure I had it in my original version 3 code as well. But it doesn't matter - thanks for catching it!

Finally, I realized late last night that part of the evaluation in the external C code still happens inside Maple - the external code calls back into the interpreted Maple environment to evaluate part of the expressions. So I came up with the following:

ncubem4 := module()

local

ModuleApply, normalSample, sampleStorage, processor;

normalSample := Statistics:-Sample('Uniform(0, 1)');

sampleStorage := Vector(0, 'datatype = float');

processor := proc(s :: Vector(datatype = float[8]),

n :: integer,

N :: integer)

local

i :: integer,

intermediate :: float[8],

j :: integer,

result :: float[8];

result := 0.;

for i to 2*n*N by 2*n do

intermediate := 0.;

for j from i to i + 2*n - 1 by 2 do

intermediate := intermediate + (s[j] - s[j+1])^2;

end do;

result := result + sqrt(intermediate);

end do;

return result / N;

end proc;

processor := Compiler:-Compile(processor);

ModuleApply := proc(n :: posint, N :: posint, m)

if numelems(sampleStorage) < 2*n*N then

# Sample storage is too small to contain sufficiently many

# sampled numbers: allocate a new Vector.

sampleStorage := normalSample(2*n*N);

elif numelems(sampleStorage) > 2*n*N then

# Sample storage is bigger than needed - fill only an

# initial segment.

normalSample(ArrayTools:-Alias(sampleStorage, [2*n*N]));

else

# Sample storage is exactly the right size - fill it up

# with fresh samples.

normalSample(sampleStorage);

end if;

return m * processor(sampleStorage, n, N);

end proc;

end module;

CodeTools:-Usage(ncubem4(15, 1000, 1), iterations=1000):

memory used=1.20KiB, alloc change=0.75MiB, cpu time=2.30ms, real time=2.31ms

One caveat is that this will only work in hardware float mode; if you set Digits higher than evalhf(Digits) (which is currently 15 on all platforms) or you set UseHardwareFloats to false, the code will die miserably. However, this is stuff you shouldn't do with high precision anyway: the Monte Carlo scheme destroys the high precision anyway.

The numbers here are a bit skewed since this is my work machine instead of home - the numbers I get here for the other versions are: 200.64ms for ncubem, 31.55ms for ncubem2, 7.10ms for ncubem3. So here we have another factor of 3 over ncubem3.

Erik.