# OMP_NUM_THREADS and 64bit Maple on Windows

December 10 2011 by
false
Maple

5

Yesterday I wrote a post that began,

"I realized recently that, while 64bit Maple 15 on Windows (XP64, 7) is now using accelerated BLAS from Intel's MKL, the Operating System environment variable OMP_NUM_THREADS is not being set automatically."

But that first sentence is about where it stopped being correct, as far as how I was interpreting the performance on 64bit Maple on Windows. So I've rewritten the whole post, and this is the revision.

I concluded that, by setting the Windows operating system environment variable OMP_NUM_THREADS to 4, performance would double on a quad core i7. I even showed timings to help establish that. And since I know that memory management and dynamic linking can cause extra overhead, I re-ran all my examples in freshly launched GUI sessions, with the user-interface completely closed between examples. But I got caught out in a mistake, nonetheless. The problem was that there is extra real-time cost to having my machine's Windows operating system dynamically open the MKL dll the very first time after bootup.

So my examples done first after bootup were at a disadvantage. I knew that I could not look just at measured cpu time, since for such threaded applications that reports as some kind of sum of cycles for all threads. But I failed to notice the real-time measurements were being distorted by the cost of loading the dlls the first time. And that penalty is not necessarily paid for each freshly launched, completely new Maple session. So my measurements were not fair.

Here is some illustration of the extra real-time cost, which I was not taking into account. I'll do Matrix-Matrix multiplication for a 1x1 example, to try and show just how much this extra cost is unrelated to the actual computation. In these examples below, I've done a full reboot on Windows 7 where so annotated. The extra time cost for the very first load of the dynamic MKL libraries can be from 1 to over 3 seconds. That's about the same as the cpu time this i7 takes to do the full 3000x3000 Matrix multiplication! Hence the confusion.

Roman brought up hyperthreading in his comment on the original post. So part of redoing all these examples, with full restarts between them, is testing each case both with and without hyperthreading enabled (in the BIOS).

Quad core Intel i7. (four physical cores)

Hyperthreading disabled in BIOS
-------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);   # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.18KiB, alloc change=127.98KiB, cpu time=219.00ms, real time=3.10s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns

> restart: # actual OS reboot
"4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.91KiB, alloc change=127.98KiB, cpu time=140.00ms, real time=2.81s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns

Hyperthreading enabled in BIOS
------------------------------

> restart: # actual OS reboot
> getenv(OMP_NUM_THREADS);    # NULL, unset in OS

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.00KiB, alloc change=127.98KiB, cpu time=202.00ms, real time=2.84s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns

> restart: # actual OS reboot
"4"

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=215.56KiB, alloc change=127.98KiB, cpu time=187.00ms, real time=1.12s

> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ):
memory used=9.46KiB, alloc change=0 bytes, cpu time=0ns, real time=0ns



Having established that the first use after reboot was incurring a real time penalty of a few seconds, I redid the timings in order to gauge the benefit of having OMP_NUM_THREADS set appropriately. These too were done with and without hyperthreading enabled. The timings below appear to indicate that slightly bettern performance can be had for this example in the case that hyperthreading is disabled. The timings also appear to indicate that having OMP_NUM_THREADS unset results in performance competitive with having it set to the number of physical cores.

Hyperthreading disabled in BIOS
-------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=142.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.50s, real time=1.92s

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.84KiB, alloc change=127.98KiB, cpu time=141.00ms, real time=141.00ms

"1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.38s, real time=7.38s

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=217.11KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

"4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.57s, real time=1.94s

Hyperthreading enabled in BIOS
------------------------------

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.57KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.46s, real time=2.15s

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

"1"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=7.35s, real time=7.35s

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

> getenv(OMP_NUM_THREADS);  # NULL, unset in OS
"4"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.56s, real time=2.15s

> restart:
> CodeTools:-Usage( Matrix([[3.]]) . Matrix([[3.]]) ): # initialize external libs
memory used=216.80KiB, alloc change=127.98KiB, cpu time=125.00ms, real time=125.00ms

"8"

> M:=LinearAlgebra:-RandomMatrix(3000,datatype=float[8]):
> CodeTools:-Usage( M . M ):
memory used=68.67MiB, alloc change=68.74MiB, cpu time=8.69s, real time=2.23s


With all those new timing measurements it appears that having to set the global environment variable OMP_NUM_THREADS to the number of physical cores may not be necessary. The performance is comparable, when that variable is left unset. So, while this post is now a non-story, it's interesting to know.

And the lesson about comparitive timings is also useful. Sometimes, even complete GUI/kernel relaunch is not enough to get a level and fair field for comparison.

Dave Linder
Mathematical Software, Maplesoft

[Roman submitted a comment on the earlier version of this post. I'll leave it to him, whether he'd prefer to edit or delete it, not that I've rewritten the post.]