Consider the following situation. A thread acquires a mutex, then enters a critical section. However when executing in the critical section, the thread access a memory location. If that memory location is not in cache, the thread will wait for a few hundred cycles. If accessing the memory location causes a page fault (the memory was swapped out to disk), the thread may need to wait a few million cycles. However while this thread is waiting, it is still holding the mutex. This means that other threads will not be able to enter the critical section, and may be forced to sleep while trying to acquire the mutex. Thus the behaviour of one thread stops other threads from making progress.
This is known as blocking. When implementing a parallel algorithm, blocking can cause serious reductions in parallelism. Thus reducing the amount of blocking is important. This has lead people to design non-blocking algorithms. Non-blocking algorithms are algorithms where the delaying of one thread will not stop other threads from executing.
There are further sub-divisions of non-blocking algorithms. A lock-free algorithm is one that guarentees that at least one thread will complete the algorithm in a finite number of steps. Other threads may suffer from starvation (get stuck in the algorithm), however at least one thread will always be able to get through. More strict than lock-free is wait-free. In a wait-free algorithm all threads are guarenteed to make it through the algorithm in a finite number of steps. Even stronger that wait-free is bounded wait-free. A bounded wait-free algorithm guarentees that all threads can progress through the algorithm in a bounded number of steps.
It is interesting to note that although lock-free algorithms can suffer from thread starvation, in reality the starvation can be a rare situation. Thus a faster lock-free algorithm may be better than a slower wait-free one.
Now a short example. Imagine we have an atomic increment instruction incr . Then incr is a bounded wait-free algorithm for adding a value to a shared variable. Multiple threads could all call incr in parallel to increment a value and none of those threads will ever need to execute more that the one instruction. Now, if we did not have a true atomic increment instruction, but we did have a compare and swap function, cas, we could implement incr like this
incr := proc( a::evaln, b )
tmp := eval(a);
while ( not cas( a, tmp, tmp+b ) )
tmp := eval(a);
However this implementation is lock-free, but not wait-free. One thread could get stuck executing the while loop while other threads slip through modifing a.