The problem
Back in 1996 I was working for the Symbolic Computation Group at the University of Waterloo, developing algorithms and code for the arbitrary precision evaluation of special functions. I got an email from Peter Borwein at Simon Fraser University reporting a possible bug in Maple's code for the Riemann zeta function, which I had written for the thencurrent release. Peter was correct and I fixed the bug, but I was curious about how he had discovered it in the first place, as it did not show up until several thousand digits of some zeta function values were computed.
Peter was trying to compute a highprecision value for the LandauRamanujan constant, which is defined as
, where is the number of positive integers less than or equal to which can be written as the sum of exactly 2 squares. The convergence of this formula is extremely slow, yielding 1 digit at . Ramanujan computed K ≈.764. In 1908, Landau proved
, where is the set of primes. This is better, but still very slow: using primes < 100000 gives 6 digits for K.
In 1996, Flajolet & Vardi [1] derived the remarkable formula
where is the Riemann zeta function and is the Dirichlet beta function. This formula is extremely rapidly converging: 9 terms give nearly 500 digits, 10 terms nearly 1000 digits, 11 terms nearly 2000 digits, and so on. It was the publication of this formula which led Peter Borwein to try the computation of in Maple, and led to his discovery of the zeta code bug.
The Dirichlet beta function has a representation in terms of Hurwitz (or generalized) zeta functions:
where . Note the relationship between the Riemann and Hurwitz zeta functions: . For even integers , the Riemann zeta function has a simple form in terms of Bernoulli numbers:
for even n
Now, having found and fixed the problem that Peter reported, I decided to see just how far I could push this computation, that is, how many digits of I could compute in Maple. Flajolet had used his new formula to obtain 1024 digits.
The algorithm
The bulk of the work is the Hurwitz zeta function computations, which are done in Maple using the EulerMacLaurin summation formula:
This formula requires Bernoulli numbers, lots of them for high precision. The standard formula for Bernoulli numbers is
Note for odd . , even, is rational (with typically a small denominator), not integer. This formula is highly recursive, with each nonzero Bernoulli number defined in terms of all previous ones. It is really only usable up to about . Using all of the above, in 1996 I was able to compute 10,000 digits of . It took a full week of cpu time on what was, for its day, a pretty powerful computer.
The next technological advance which allowed higher precision computation of occurred in 2007, with the publication by Fee & Plouffe [2] of a fast, direct algorithm for the computation of Bernoulli numbers. This algorithm is derived from the clever observation that the integer and fractional parts of the Bernoulli numbers can be computed separately, by different algorithms, each of which is very efficient. First, we turn around the formula above for the Riemann zeta function at even integer values:
Next, we use the classical formula for in terms of prime numbers:
This representation converges slowly to the exact value of , but only a very few terms of the product are required to get the integer part of .
A different formula very quickly gives the fractional part of :
Using this approach, we can compute any Bernoulli number very quickly, and independently of any other Bernoulli number, which means that the algorithm is completely parallelizable.
Putting it all together, we need to compute
For each term in the product we need 2 Hurwitz zeta computations, one extra Bernoulli number and a factorial. The zeta function computations will need lots of Bernoulli numbers, but for the most part the different zeta calculations will use the same Bernoulli numbers, so the first step is to precompute and store these. Extrapolating from lower precision computations, I determined that I would need no more than about the first 57,000 nonzero Bernoulli numbers to compute to about 128,000 digits, plus a few more for the factors in the formula, so I set a Maple job running to collect all these while I worked on the code for the main computation.
With the Bernoulli numbers now essentially free, all the cost of computing is in the Hurwitz zeta computations. Notice that these computations are independent of each other.
The technology
Maplesoft has a 24node system running Microsoft's HPC Server 2008 R2 operating system. Using the Maple Grid Computing Toolbox, we can submit a job  a file of Maple commands  to this system and get (almost) 24X (coarsegrained) parallelization for very little effort.
Nodes communicate by sending messages using the MPI system. The Maple Grid Computing Toolbox commands in Maple needed for this project are:
MPI:Init() 
 
Setup 
MPI:Comm_Rank( MPI:COMM_WORLD ) 
 
Who am I? 
MPI:Send( MPI:COMM_WORLD, dest, msg ) 
 
Send msg to node dest 
MPI:Recv( MPI:COMM_WORLD, src ) 
 
Receive a message from src (which can be MPI:ANY_SOURCE) 
MPI:Comm_Size( MPI:COMM_WORLD ) 
 
Number of nodes available to computation 
MPI:Finalize() 
 
Node is done 
Handling the Bernoulli numbers turned out to be slightly tricky, due to the hardware configuration of the system. Specifically, the 24 nodes of the grid platform are organized into 3 "planes" of 8 nodes each. The 8 nodes on a plane share 8G of RAM. When the memory required by the zeta computations and the memory already in use by the o/s is accounted for, there isn't enough room left to hold all the 57,000 precomputed Bernoulli numbers  stored in Maples .m format, they require more than 6G of memory. This means that one node can't serve out all the required Bernoulli numbers (well, it could if it didn't keep them all in memory, but since the different nodes will typically be at different points in their zeta calculations it is hard to envision how this could be done efficiently).
Instead, I experimented with two different methods:
1. 
Select one node on each plane as a Bernoulli number server, with the node on plane 1 serving out the first 1000 nonzero Bernoulli numbers, the node on plane 2 the next 1000, the node on plane 3 the next 1000, and back to the first for the next 1000, and so on; in consequence, only 21 nodes are available for zeta calculations.

2. 
Have each node read the Bernoulli numbers it needs (again in batches of 1000, as that's how I chose to store them on disk), and then discard those (allowing the memory to be reclaimed) when a new batch is needed. In this design, 23 nodes are available for zeta calculations (node 0 manages the overall job).

One optimization was derived from observations of the computation time for each zeta value. Specifically, the calculations for terms near the middle of the set required generally took the longest, so those were handed out to the compute nodes earliest.
The code for method (2) is attached. Observe that the design is a fairly simple message pump: Node 0, the master node, continuously runs a "receive  respond" loop, listening for and interpreting messages from the other nodes. The different messages each consist of a message code (e.g., 1 = "I need more work") and whatever other information the master node requires in order to respond. Similarly, the worker nodes (nodes 123) request work from the master node, carry out the task they are given, send the result back and ask for more work. The point here is that getting access to a grid system may be the hard part. Using the toolbox to deploy jobs on the grid is very straightforward.
The results
This table summarizes the results. The first column gives the setting of Digits for the computation. The second column gives the number of terms of the FlajoletVardi product formula used for the computation. The third column indicates the number of digits deemed correct. This value is obtained by comparison with the value obtained for the next row down in the table. This is why the last calculation, at 128,000 digits, includes one extra term. The fourth column specifies the highest index Bernoulli number required by the Hurwitz zeta computations, with any higher index Bernoulli numbers required by the product formula given in parentheses. Note that the last result requires two Bernoulli numbers which are not among the precomputed 57,000 nonzero values. These were also precomputed using the direct FeePlouffe formula, requiring very little time (relatively speaking).
It turned out that Bernoulli number method (1) required the least amount of total memory, about 5G from each plane. However, method (2), which required close to the maximum RAM available on each plane, was noticeably faster for the higher precision computations. The last column of the table gives the time required by the method (2) version.
Digits

# Terms

Accuracy of result

Max. (non0) Bernoulli number

Time

1000

10

979

1200

1.0 sec

2000

11

1957

2272

1.8 sec

4000

12

3911

4262

8.5 sec

8000

13

7820

< 8000 (8192)

46 sec

16000

14

15637

< 16000 (16384)

4.5 min

32000

15

31272

< 28000 (32768)

33 min

64000

16

62541

< 54000 (65536)

3.4 hr

128000

18

125079

< 108000 (131072, 262144)

15.3 hr

Here are the previous computation records. The fifth entry is an uncredited value which was found on the OnLine Encyclopedia of Integer Sequences, where the current record of 125,079 digits is now stored.
Ramanujan

3

w1910

Flajolet

1024

1996

Hare

10,000

1996

Sebah

30,010

2002

?

65,536

?

Hare

125,079

2010

Download the Maple Document
Download the Maple Code File