Maplesoft Blogger Profile: Darin Ohashi

Maplesoft Employee

Senior Software Developer

I am a Senior Software Developer in the Kernel Group, working on the Maple language interpreter. I have been working at Maplesoft since 2001 on many aspects of the Kernel, however recently I have been focusing on enabling parallel programming in Maple. I have added various parallel programming tools to Maple, and have been trying to teaching parallel programming techniques to Maple programmers. I have a Master's degree in Mathematics (although really Computer Science) from the University of Waterloo. My research focused on Algorithm and Data Structure, Design and Analysis.

Posts by Darin Ohashi

Consider the following C code:

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

long *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (long*)malloc( sizeof(long)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i] = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

Basically this code

   for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j] = i;
}

runs in parallel for 1 to NUM_THREADS threads and prints how long it takes to run and the number of iterations executed per second. data is an array and j is a thread specific index into the array. Thus each thread is given its own memory address. This is important to recognize, the threads are not sharing data.

As we increase the number of threads (up to the number of cores), the running time should stay the same, but the number of iterations per second should increase linearly (remember each thread does NUM_ITERATIONS, so adding a new thread adds NUM_ITERATIONS of work). Let's see what happens when we actually run this code:

1) 15.758 2.72558e+08 ips
2) 28.8921 2.97311e+08 ips
3) 40.1948 3.20561e+08 ips
4) 46.6236 3.6848e+08 ips

Well, that is nothing like what we were expecting. The running time increases significantly and the iterations per second increase only slightly. It would appear as though the threads are interfering with each other, however we can see that they don't actually share any data. What's happening?

This is an example of false sharing. False sharing occurs when multiple threads modify data that share the same cache line. I'm not going to go into too much detail about caching, see this wikipedia article for more details. Briefly, each core has a small private cache that stores a copy of a region of main memory (accessing the local copy is faster than accessing main memory). When the corresponding region in main memory changes, the value stored in the cache is no longer correct. Thus if the core wants to use anything in that memory region, it needs to make a new copy. When multiple threads share a single cache line, modification of the memory in one thread will force the other threads to refresh their caches. This refresh forces the threads to wait for memory transfers far more then they would if the thread was running exclusively. If there are enough threads with enough memory requests it is also possible to saturate the memory bus, which can delay memory transfers even more.

To fix this we need to make sure that the threads aren't sharing cache lines. The easiest way to fix a problem like this is to add padding around the data.

#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 4
#define NUM_ITERATIONS (1L << 32)

#define CACHE_SIZE 128

typedef struct {
long d;
char buffer[CACHE_SIZE-sizeof(long)];
} Data;

Data *data;

void *thread_function( void *args )
{
long i;
long j = (long)args;

for ( i = 0; i < NUM_ITERATIONS; i++ )
{
data[j].d = i;
}

return NULL;
}

int main( int argc, char **argv )
{
long i, j;
pthread_t ids[NUM_THREADS-1];
struct timeval start, end;
double d;

data = (Data*)malloc( sizeof(Data)*NUM_THREADS );

for ( j = 1; j <= NUM_THREADS; j++ )
{
for ( i = 0; i < NUM_THREADS; i++ )
{
data[i].d = 0;
}

gettimeofday( &start, NULL );
for ( i = 0; i < j-1; i++ )
{
pthread_create( ids+i, NULL, thread_function, (void*)i );
}

thread_function( (void*)i );

for ( i = 0; i < j-1; i++ )
{
pthread_join( ids[i], NULL );
}

gettimeofday( &end, NULL );

d = end.tv_sec - start.tv_sec + (1.0e-6)*end.tv_usec - (1.0e-6)*start.tv_usec;

printf( "%ld) %g %g ips\n", j, d, ((double)j*NUM_ITERATIONS)/d );
}

return 0;
}

We have changed our array from type long, to type Data, a struct that contains padding. This padding guarantees that one struct's data won't share a cache line with other structs, in particular structs used in other threads. Now let's run this code:

1) 15.6487 2.74462e+08 ips
2) 15.0241 5.71744e+08 ips
3) 13.5294 9.52364e+08 ips
4) 12.5411 1.36989e+09 ips

This is what we were expecting (perhaps even a little better).

So how does this effect Maple? I originally tried to write these examples in Maple, but I didn't get significant problems. This is because the overhead of Maple evaluating a statement probably clobbers the shared cache line so it will need to be reloaded in either case. However, as we make Maple faster, it may be possible for false sharing to become a problem in your Maple code.

If you want to download and try executing this code for yourself, be careful of compiler optimizations.  The function that we execute in the threads can easily be optimized into one that does not use the shared memory in each iteration.

I built this example on Linux, using

gcc falsesharing.c -lpthread -o falsesharing

It has been a while since my last post, mostly because of a combination of getting Maple 14 ready to ship and a lack of meaty topics to write about. I am trying to get back into the habit of posting more regularly. You can help me achieve my goal by posting questions about parallel programming. I'll do my best to answer. However for now, I'll give a brief overview of the new parallel programming features in Maple 14.

A new function has been added to the Task Programming Model. The Threads:-Task:-Return function allows a parallel algorithm implemented on top of the Task Programming Model to perform an early bail out. Lets imagine that you have implemented a parallel search. You are looking for a particular element in a large set of data. Using the Task Programming Model, you've created a bunch of tasks, each searching a particular subset of the data. However one of the first tasks to execute finds the element you are looking for. In Maple 13, there was no built in way of telling the other tasks that the result have been found and they they should not execute. In Maple 14, the Return function allows one task to specify a return value (which will be returned from Threads:-Task:-Start) and signal the other tasks that the algorithm is complete and that additional tasks should not be executed. Tasks that are already running will still run to completion, but tasks that have not started executing will not be started.

You may have noticed that there is a race condition with Return. What happens if two tasks both call Return in parallel? Only one of the values will become the value that is passed to Threads:-Task:-Start. I suppose I could say the "first" value is the one that is used, but really, what does that mean? If you call Return, then the value passed to Return should be an acceptable result for the algorithm.  If you call Return more than once, any of those values should be valid, thus it shouldn't matter which one becomes the return value.  That said, the Return function does give some feedback. In the task that succeeds in having its value accepted, Return will return true. In all other tasks that call Return, it will return false. This allows the code to know if a particular result was or was not accepted.

Maple 14 also adds the Task Programming Model to the C External Calling API. This means that you can write your algorithms in C and make use of the Task Programming Model. The C API is similar to the Maple API, with a few differences. In particular, you need to create each child task individually, instead of as a single call to Continue, as you would in Maple. As well, because it is C code, you need to worry about a few details like memory management that are handled automatically in Maple.  Using External Call is fairly advanced, so I won't go into too much detail here.  If you'd like to see more details of using the Task Programming Model in External Calling, I can write a seperated post dedicated to that.

As with every release of Maple, we spent some time trying to make our existing functionality faster and more stable. For parallel programming, we reduced the overhead of using the Task Programming Model, as well as reducing the locking in the kernel (which should help improve parallelism). Of course many bugs have been fixed, which should make parallel programming more reliable in Maple 14.

Maplesoft Employee

Atomic Operations

November 27 2009 by dohashi 802 Maple

Atomic operations are CPU instructions that are guaranteed to execute in a single CPU cycle. This means that the operation is guaranteed to complete without being interrupted by the actions of another thread. Although this may not sound too exciting, careful programming using these instructions can lead to algorithms and data structures that can be used in parallel without needing locks. Maple currently does not support atomic operations, however they are an interesting tool and are used in the kernel to help improve Maple's parallelism in general.

In this post I'll take a closer look at the ways in which Maple code can be thread unsafe. If you have not already seen my post on Thread Safety, consider reading that post first. As a brief review, a procedure is thread safe if it works correctly when run in parallel.

The most obvious way in which procedures can be thread unsafe is if they share data without synchronizing access (using a Mutex, for example). So how can two threads share data?

I'd like to start by thanking all those readers who left feedback on my last post. It was good hear that most of you enjoy reading my posts and that they are generally helpful. I would like to encourage you to continue posting feedback, especially questions or comments about anything that I fail to explain sufficiently.

The following is a discussion of the limitations of parallel programming in Maple. These are the issues that we are aware of and are hoping to fix in future releases.

In my previous posts I have discussed various difficulties encountered when writing parallel algorithms. At the end of the last post I concluded that the best way to solve some of these problems is to introduce a higher level programming model. This blog post will discuss the Task Programming Model, the high level parallel programming model introduced in Maple 13.

In my previous posts I discussed the basic difference between parallel programming and single threaded programming. I also showed how controlling access to shared variables can be used to solve some of those problems. For this post, I am going to discuss more difficulties of writing good parallel algorithms.

Here are some definitions used in this post:

  • scale: the ability of a program to get faster as more cores are available
  • load balancing: how effectively work is distributed over the available cores
  • coarse grained parallelism: parallelizing routines at a high level
  • fine grained parallelism: parallelizing routines at a low level

Consider the following example

In the previous post, I described why parallel programming is hard. Now I am going to start describing techniques for writing parallel code that works correctly.

First some definitions.

  • thread safe: code that works correctly even when called in parallel.
  • critical section: an area of code that will not work correctly if run in parallel.
  • shared: a resource that can be accessed by more than one thread.
  • mutex: a programming tool that controls access to a section of code

In my previous post, I tried to convince you that going parallel is necessary for high performance applications. In this post, I am going to show what makes parallel programming different, and why that makes it harder.

Here are some definitions used in this post:

  • process: the operating system level representation of a running executable. Each process has memory that is protected from other processes.
  • thread: within a single process each thread represents an independent, concurrent path of execution. Threads within a process share memory.

We think of a function as a series of discrete steps. Lets say a function f, is composed of steps f1, f2, f3... e.g.

Maplesoft Employee

Why Go Parallel?

September 14 2009 by dohashi 802 Maple

Computers with multiple processors have been around for a long time and people have been studying parallel programming techniques for just as along. However only in the last few years have multi-core processors and parallel programming become truly mainstream. What changed?

Here are some definitions for terms used in this post:

  • core: the part of a processor responsible for executing a single series of instructions at a time.
  • processor: the physical chip that plugs into a motherboard. A computer can have multiple processors, and each processor can have multiple cores
  • process: a running instance of a program. A process's memory is usually protected from access by other processes.
  • thread: a running instance of a process's code. A single process can have multiple threads, and multiple threads can be executing at the same on multiple cores
  • parallel: the ability to utilize more than one processor at a time to solve problems more quickly, usually by being multi-threaded.

For years, processors designers had been able to increase the performance of processors by increasing their clock speeds. However a few years ago they ran into a few serious problems. RAM access speeds were not able to keep up with the increased speed of processors, causing processors to waste clock cycles waiting for data. The speed at which electrons can flow through wires is limited, leading to delays within the chip itself. Finally, increasing a processor's clock speed also increases its power requirements. Increased power requirements leads to the processor generating more heat (which is why overclockers come up with such ridiculous cooling solutions). All of these issues meant that is was getting harder and harder to continue to increase clock speeds.  The designers realized that instead of increasing the core's clock speed, they could keep the clock speed fairly constant, but put more cores on the chip. Thus was born the multi-core revolution.

My name is Darin Ohashi and I am a senior kernel developer at Maplesoft. For the last few years I have been focused on developing tools to enable parallel programming in Maple. My background is in Mathematics and Computer Science, with a focus on algorithm and data structure design and analysis. Much of my experience with parallel programming has been acquired while working at Maplesoft, and it has been a very interesting ride.

In Maple 13 we added the Task Programming Model, a high level parallel programming system. With the addition of this feature, and a few significant kernel optimizations, useful parallel programs can now be written in Maple. Although there are still limitations and lots more work to be done on our side, adventurous users may want to try writing parallel code for themselves.

To encourage those users, and to help make information about parallel programming more available, I have decided to write a series of blog posts here at Maple Primes. My hope is that I can help explain parallel programming in general terms, with a focus on the tools available in Maple 13. Along the way I may post links to sites, articles and blogs that discuss parallel programming issues, as well as related topics, such as GPU programming (CUDAOpenCL, etc).

My next post, the first real one, I am going to explain why parallel programming has suddenly become such an important topic.

Page 1 of 1