Now that processor manufacturers have largely given up trying to squeeze more performance out of individual processors and have refocused their attention on providing us with more processors instead, the biggest gains in performance are to be had by using parallel techniques in our programs so as to make use of these extra cores. Parallel Haskell is aimed at providing access to multiple processors in a natural and robust way.

You might wonder whether the compiler could automatically parallelize programs for us. After all, it should be easier to do this in a purely functional language, where the only dependencies between computations are data dependencies, which are mostly perspicuous and thus readily analyzed. However, even in a purely functional language, automatic parallelization is thwarted by an age-old problem: To make the program faster, we have to gain more from parallelism than we lose due to the overhead of adding it, and compile-time analysis cannot make good judgments in this area. An alternative approach is to use runtime profiling to find good candidates for parallelization and to feed this information back into the compiler. Even this, however, has not been terribly successful in practice.

Fully automatic parallelization is still a pipe dream. However, the parallel programming models provided by Haskell do succeed in eliminating some mundane or error-prone aspects traditionally associated with parallel programming:

• Parallel programming in Haskell is deterministic: The parallel program always produces the same answer, regardless of how many processors are used to run it. So parallel programs can be debugged without actually running them in parallel. Furthermore, the programmer can be confident that adding parallelism will not introduce lurking race conditions or deadlocks that would be hard to eliminate with testing.
• Parallel Haskell programs are high-level and declarative and do not explicitly deal with concepts like synchronization or communication. The programmer indicates where the parallelism is, and the details of actually running the program in parallel are left to the runtime system. This is both a blessing and a curse:

• By embodying fewer operational details, parallel Haskell programs are abstract and are therefore likely to work on a wide range of parallel hardware.
• Parallel Haskell programs can take advantage of existing highly tuned technology in the runtime system, such as parallel garbage collection. Furthermore, the program gets to benefit from future improvements made to the runtime with no additional effort.
• Because a lot of the details of execution are hidden, performance problems can be hard to understand. Moreover, the programmer has less control than he would in a lower-level programming language, so fixing performance problems can be tricky. Indeed, this problem is not limited to Parallel Haskell: It will be familiar to anyone who has tried to optimize Haskell programs at all. In this book, I hope to demonstrate how to identify and work around the most common issues that can occur in practice.

The main thing that the parallel Haskell programmer has to think about is partitioning: dividing up the problem into pieces that can be computed in parallel. Ideally, you want to have enough tasks to keep all the processors busy continuously. However, your efforts may be frustrated in two ways:

Granularity
If you make your tasks too small, the overhead of managing the tasks outweighs any benefit you might get from running them in parallel. So granularity should be large enough to dwarf overhead, but not too large, because then you risk not having enough work to keep all the processors busy, especially toward the end of the execution when there are fewer tasks left.
Data dependencies
When one task depends on another, they must be performed sequentially. The first two programming models we will be encountering in this book take different approaches to data dependencies: In Chapter 3, data dependencies are entirely implicit, whereas in Chapter 4 they are explicit. Programming with explicit data dependencies is less concise, but it can be easier to understand and fix problems when the data dependencies are not hidden.

In the following chapters, we will describe the various parallel programming models that Haskell provides:

• Chapters 2 and 3 introduce the Eval monad and Evaluation Strategies, which are suitable for expressing parallelism in Haskell programs that are not heavily numerical or array-based. These programming models are well established, and there are many good examples of using them to achieve parallelism.
• Chapter 4 introduces the Par monad, a more recent parallel programming model that also aims at parallelizing ordinary Haskell code but with a different trade-off: It affords the programmer more control in exchange for some of the conciseness and modularity of Strategies.
• Chapter 5 looks at the Repa library, which provides a rich set of combinators for building parallel array computations. You can express a complex array algorithm as the composition of several simpler operations, and the library automatically optimizes the composition into a single-pass algorithm using a technique called fusion. Furthermore, the implementation of the library automatically parallelizes the operation using the available processors.
• Chapter 6 discusses programming with a graphics processing unit (GPU) using the Accelerate library, which offers a similar programming model to Repa but runs the computation directly on the GPU.

Parallelizing Haskell code can be a joyful experience: Adding a small annotation to your program can suddenly make it run several times faster on a multicore machine. It can also be a frustrating experience. As we’ll see over the course of the next few chapters, there are a number of pitfalls waiting to trap you. Some of these are Haskell-specific, and some are part and parcel of parallel programming in any language. Hopefully by the end you’ll have built up enough of an intuition for parallel programming that you’ll be able to achieve decent parallel speedups in your own code using the techniques covered.

Keep in mind while reading this part of the book that obtaining reliable results with parallelism is inherently difficult because in today’s complex computing devices, performance depends on a vast number of interacting components. For this reason, the results I get from running the examples on my computers might differ somewhat from the results you get on your hardware. Hopefully the difference isn’t huge—if it is, that might indicate a problem in GHC that you should report. The important thing is to be aware that performance is fragile, especially where parallelism is concerned.