(This is a reply to PatrickT, who asked about a certain ArrayTools:-Copy example.)
Suppose that you need to do a computation which requires, in part, a certain subportion of a Vector V. Let's call the routine which does the work as `f`. Let's suppose that `f` is a system command and not something that we write ourselves. One natural way to call `f` and supply the subvector is to call `f` like so:
f( V[a..b] );
Here the inner range a..b denotes which subrange of entries of V you want `f` to receive. There is an efficiency concern which arises immediately if `f` is to be called a very large number of times with such a subvector argument which might differ with each call. The unnamed subvectors arguments to `f` are temporary objects, which will get used by `f` and then become collectible garbage. But garbage collection and memory management can be expensive, and this can be especially true for non-scalar objects such as Vectors. There is also the significant cost of creating the individual subvector objects to consider.
If we had written `f` outselves then of course we could alter `f` to expect all of V as its argument. We could simply call f(V), produce no subvector object, merely access the sub-range of V inside `f` with more judicious element access, and incur no temporary object memory management cost. And of course we could alter `f` so that we only had to call it once, and then put a for-loop inside it! But we have supposed that `f` is a system command and so not available for a re-write.
Many years ago I asked the head of Maplesoft's Kernel Development group how I could best handle the problem of a garbage collector's cost, especially in my Library-size computations. The short response was "Produce less garbage!". At first I was a little disconcerted, as I'd hoped for a brilliant insight into Maple's memory management schemes and only got back a curt, off-the-cuff answer. Over the years the importance of the answer I received has, I hope, sunk in. In particular, I have seen many examples in which "in-place" operations involving rtables (Matrix, Vector, and Array) can be the key to better performance. It's not always the answer, but it's good to keep in mind. Let's see if how it may apply here.
It's not that Maple will run out of memory, if I make that call above to `f` with a million different subvectors (with different `a` and `b`, say). No, Maple has a memory management system which will dispose of the no-longer-wanted subvectors once each call to `f` is returned. The memory used by the subvectors gets recollected and reused, as the temporary Vector objects are found during garbage collections "sweeps" to be found to be no longer referenced or in use.
Now, regardless of how good Maple's memory management system is, it may be possible to attain better performance by side-stepping the temporary object creation altogether! Less garbage produced means less memory management cost.
Let's re-do the above short example, while utilizing a re-usable container `C` to pass the subvector data into `f`. I'll add a little more detail, to make it as example for which we examine timing and memory "use". My computation is such that I have to pass a 10000-element subvector of my large V into `f`. And I have to do it 100000 times. Below, I'll just pass the very same subvector V[3..10003] each time for simplicity, while acknowledging that in the real world example the subrange would differ and we'd need the various different subvectors.
> for i from 1 to 10^5 do
> end do:
That took about 5.6 seconds. And it "used" about 8GB of memory, by which I roughly mean that Maple's kernel had to assign and then garbage collect that much memory over the entire computation. (Maple only had to allocate about 4.5MB in total at any given moment.)
Now let's do it using a re-usable container Vector `C`. Before calling `f`, we'll bother to actually copy the desired subvector portion of V into `C`. We'll do the copying using a routine that is very fast for our example's hardware double-precision float data (it uses a precompiled "external" function to do the copying). Notice that what routine `f` receives as its actual argument is effectively just like what it receives in the previous code.
> for i from 1 to 10^5 do
> end do:
So this only took 1.5 seconds. That's much better, even though it had to copy 10000 entries from V into C, 100000 times. The savings was in terms of the reduction in memory management (creation and garbage collection) of 100000 different temporary 10000-entry float Vectors. The total memory allocatd for the entire computation was pretty much the same, about 4.4MB.
Depending on the datatype, and on the size of the objects and the number or repetitions, the relative performance of the two approaches above will vary. For datatype=anything the equivalent last pair of computations above took 3.3 and 1.3 seconds respectively.
As the size of C gets much smaller there is a point at which the savings are outweighed by the added cost of all the copying. For example when C has only 10 entries then the copying approach takes longer.
All computations above were in Maple 14 on 32bit Windows XP, on a Core2 Duo.
Mathematical Software, Maplesoft