I took a look at this code the last time you submitted it. The big problem is memory usage and garbage collection. If you can reduce the memory used by the code, then it will probably parallelize better.
That said, there is still some room for general improvements. I cleaned up your MakeV0A function and was able to speed it up a bit, both single and multi-threaded.
Xpre := [ seq( (X[i,k] + X[j,k])/2.0, k=1..Xcdim ) ];
tempdiffs := [ seq( add( (Xpre[k]-X[l,k])^2, k=1..Xcdim )/den, l=1..Xrfull) ];
pot:=add( k*(exp(-k)), k in tempdiffs );
psi:=add( exp(-k), k in tempdiffs );
if pot=0.0 and psi=0.0 then
V0 := Nij*pot/psi;
Also, I'm think there might be a bug in your original code, when you call MakeV0 in the single threaded case you pass 1 for istart and jstart, which means when you call MakeV0A for (i,j) you pass (i+1, j+1, N[i+1,j+1] ... ), is that what you wanted?
-- Kernel Developer Maplesoft