Items tagged with parallel parallel Tagged Items Feed

Hello wizards,

I'm given to understand that using add() or if possible Task:-Add() is more efficient than a FOR...DO loop. Today I'm asking about the limits of this generalization. My illustration is probably missing some evalf's, but hopefully the concept is clear:

A computation I'm working on involves between 10E4 and 10E6 computations like

Consider the following C code:

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

long *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (long*)malloc( sizeof(long)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i] = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

Basically this code

   for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

runs in parallel for 1 to NUM_THREADS threads and prints how long it takes to run and the number of iterations executed per second. data is an array and j is a thread specific index into the array. Thus each thread is given its own memory address. This is important to recognize, the threads are not sharing data.

As we increase the number of threads (up to the number of cores), the running time should stay the same, but the number of iterations per second should increase linearly (remember each thread does NUM_ITERATIONS, so adding a new thread adds NUM_ITERATIONS of work). Let's see what happens when we actually run this code:

1) 15.758 2.72558e+08 ips
2) 28.8921 2.97311e+08 ips
3) 40.1948 3.20561e+08 ips
4) 46.6236 3.6848e+08 ips

Well, that is nothing like what we were expecting. The running time increases significantly and the iterations per second increase only slightly. It would appear as though the threads are interfering with each other, however we can see that they don't actually share any data. What's happening?

This is an example of false sharing. False sharing occurs when multiple threads modify data that share the same cache line. I'm not going to go into too much detail about caching, see this wikipedia article for more details. Briefly, each core has a small private cache that stores a copy of a region of main memory (accessing the local copy is faster than accessing main memory). When the corresponding region in main memory changes, the value stored in the cache is no longer correct. Thus if the core wants to use anything in that memory region, it needs to make a new copy. When multiple threads share a single cache line, modification of the memory in one thread will force the other threads to refresh their caches. This refresh forces the threads to wait for memory transfers far more then they would if the thread was running exclusively. If there are enough threads with enough memory requests it is also possible to saturate the memory bus, which can delay memory transfers even more.

To fix this we need to make sure that the threads aren't sharing cache lines. The easiest way to fix a problem like this is to add padding around the data.

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

#define CACHE_SIZE 128

typedef struct {
long d;
char buffer[CACHE_SIZE-sizeof(long)];
} Data;

Data *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j].d = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (Data*)malloc( sizeof(Data)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i].d = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

We have changed our array from type long, to type Data, a struct that contains padding. This padding guarantees that one struct's data won't share a cache line with other structs, in particular structs used in other threads. Now let's run this code:

1) 15.6487 2.74462e+08 ips
2) 15.0241 5.71744e+08 ips
3) 13.5294 9.52364e+08 ips
4) 12.5411 1.36989e+09 ips

This is what we were expecting (perhaps even a little better).

So how does this effect Maple? I originally tried to write these examples in Maple, but I didn't get significant problems. This is because the overhead of Maple evaluating a statement probably clobbers the shared cache line so it will need to be reloaded in either case. However, as we make Maple faster, it may be possible for false sharing to become a problem in your Maple code.

If you want to download and try executing this code for yourself, be careful of compiler optimizations.  The function that we execute in the threads can easily be optimized into one that does not use the shared memory in each iteration.

I built this example on Linux, using

gcc falsesharing.c -lpthread -o falsesharing

A user recently asked how to find the set of indices corresponding to a given value of a two-dimensional Array.

There are several ways to handle this problem.  For  a single value, a simple scan through the Array suffices:

FindIndices1 := proc(A :: Array, val)
local i,j,irng,jrng;
    (irng,jrng) := rtable_dims(A);
    {seq(seq(`if`(A[i,j]=val, [i,j], NULL), i=irng), j=jrng)};
end proc:

With a multi-core machine, the ...

It has been a while since my last post, mostly because of a combination of getting Maple 14 ready to ship and a lack of meaty topics to write about. I am trying to get back into the habit of posting more regularly. You can help me achieve my goal by posting questions about parallel programming. I'll do my best to answer. However for now, I'll give a brief overview of the new parallel programming features in Maple 14.

A new function has been added to the Task Programming Model. The Threads:-Task:-Return function allows a parallel algorithm implemented on top of the Task Programming Model to perform an early bail out. Lets imagine that you have implemented a parallel search. You are looking for a particular element in a large set of data. Using the Task Programming Model, you've created a bunch of tasks, each searching a particular subset of the data. However one of the first tasks to execute finds the element you are looking for. In Maple 13, there was no built in way of telling the other tasks that the result have been found and they they should not execute. In Maple 14, the Return function allows one task to specify a return value (which will be returned from Threads:-Task:-Start) and signal the other tasks that the algorithm is complete and that additional tasks should not be executed. Tasks that are already running will still run to completion, but tasks that have not started executing will not be started.

You may have noticed that there is a race condition with Return. What happens if two tasks both call Return in parallel? Only one of the values will become the value that is passed to Threads:-Task:-Start. I suppose I could say the "first" value is the one that is used, but really, what does that mean? If you call Return, then the value passed to Return should be an acceptable result for the algorithm.  If you call Return more than once, any of those values should be valid, thus it shouldn't matter which one becomes the return value.  That said, the Return function does give some feedback. In the task that succeeds in having its value accepted, Return will return true. In all other tasks that call Return, it will return false. This allows the code to know if a particular result was or was not accepted.

Maple 14 also adds the Task Programming Model to the C External Calling API. This means that you can write your algorithms in C and make use of the Task Programming Model. The C API is similar to the Maple API, with a few differences. In particular, you need to create each child task individually, instead of as a single call to Continue, as you would in Maple. As well, because it is C code, you need to worry about a few details like memory management that are handled automatically in Maple.  Using External Call is fairly advanced, so I won't go into too much detail here.  If you'd like to see more details of using the Task Programming Model in External Calling, I can write a seperated post dedicated to that.

As with every release of Maple, we spent some time trying to make our existing functionality faster and more stable. For parallel programming, we reduced the overhead of using the Task Programming Model, as well as reducing the locking in the kernel (which should help improve parallelism). Of course many bugs have been fixed, which should make parallel programming more reliable in Maple 14.

Our previous article described the design of fast algorithms for multiplying and dividing sparse polynomials. We have integrated these algorithms into the expand and divide commands of Maple 14. In this post I want to talk a bit about what you might see when you try Maple 14. Keep in mind that the product isn't released yet and I don't work for Maplesoft, so general disclaimers apply. Nevertheless, one of the first things you may notice is this.

task manager with maple 14

Hello all, I am about to write a simulation which is remarkably redundant and suits itself quite well to parallelization on my multi-core, shared memory machine. Also, I am new to OpenMaple. How does OpenMaple behave in parallelized situations? Is it possible to have multiple threads with independent Maple kernels? Could multiple threads coded with OpenMP or the fork command find uncorrupted solutions? Thank you for any help you may offer! -asn

Page 1 of 1