BLOKPI: Yet Another Program to Calculate Pi Introduction ------------ This documentation explains how blokpi.c works and what you can do with it. Like the source code, this file is placed in the public domain by its author, Jason Papadopoulos. Give it to anyone you want, but please try and keep it bundled together with blokpi.c so that future generations will have a chance to decipher what I did. What's so special about this particular program? A couple of things. First, it's extremely fast; on modern processors it averages three times faster than all the other public domain calculators that add up series. On a 200MHz Pentium it can add up a million digits of pi in under 2.5 hours! Second, it's versatile: I've written it to compile portably on an ANSI C compiler, and also included options to tailor performance to a specific machine and FPU. For my favorite combo (gcc on a Pentium) there's some inline assembly code for better performance. I also made it generic enough to sum any series that has a special form, so you can use it to compute other constants to monster decimal places, at breakneck speed. Third, it's new. The formula used is an unusual one I haven't seen used elsewhere, and the digit hunting tricks that make it so fast don't appear anywhere else. I've seen lots of pi-crunching code over the years, and it's always a treat to stumble onto something different. I've verified that it produces correct results up to 20,000 digits of pi. Configuring the Thing --------------------- Basically, the only part of the code you have to worry about is at the very beginning. You should tweak the #defines for your particular compiler and system. Brief explanations follow: BLOCKPI: This should be set to (at most) the number of doubles that will fit into half your cache. For a Pentium it should be large but less than 512; with MMX it can go up to 1024, but the exact number isn't critically important. Bigger=better. BLOCKTERM: This should be set to (at most) the number of doubles that will fit into 1/8 of your cache. For a Pentium it should be large but less than 128; with MMX it can go up to 256. Bigger=better, but unlike BLOCKPI the exact number you choose is critical for the program to work correctly. See the code for details. INTEL: Intel FPU registers have 64 bits, and rounding a floating point value to the nearest whole number requires a little finesse. For any other machine, don't define this unless you know for sure that its FPU registers are 64 bits long. BASE: The base you want to work in. Good choices are 1.0e10 for Intel stuff, and 1.0e7 for everybody else. If you know for a fact that your FPU registers have more than 53 bits of precision you can set this higher, but unless the mantissa is 64 bits long you have to fiddle with the value of ROUND yourself (it's in the code). The higher you make BASE, the faster the program will run but the fewer terms you can add up. Details are in the code; you can always make it lower, and every power of ten will multiply by ten the number of terms you can sum up. BASEDIGITS: the power of 10 in BASE GCCDOS: compile assembly code optimized for DJGPP for DOS. The program will run about 20% faster if you do this. Everything else allows you to change formulas at compile time; an expla- nation appears a little later. How it Works ------------ The idea is to take infinite series of a certain form and rearrange them into another form, suitable for compuation. As an example, take the usual series for "e": 1 1 1 1 e = 1 + - + --- + ----- + ------- + ... 2 2*3 2*3*4 2*3*4*5 The traditional method of adding up such a series is to have two arrays, one to hold the sum (starts with 1) and the other to hold one term (starts with 1/2). Divide term by 3, add to sum; divide term by 4, add to sum; divide term by 5, add to sum; and so on. There are advantages to this method (hopefully obvious): first, as the computation progresses the array "term" becomes smaller and smaller because you're dividing by integers over and over again. Thus the program will speed up as time goes by. Second, you don't have to figure out in advance how many terms to add up, just keep going until the term array is all zeros (this is a real plus in the case of this series because its convergenece is superlinear, i.e. faster than "x digits per term"). It also has disadvantages, though: every time you add on a term you have to be careful that the elements of "sum" don't overflow, so either the elements in the sum have to have lots of unnecessary extra digits, or you "carry the one" with every term and pay a substantial loss in per- formance. There's another way to compute this series, though. It first appeared officially in a paper by Rabinowitz and Wagon in the American Mathematical Monthly of about 1995 but the idea is much older; the first mention of it in the literature was from a little paper in an ACM journal in the 60s. 1 1 1 1 e = 1 + - ( 1 + - ( 1 + - ( 1 + - ( 1 + ... )))...) 2 3 4 5 Adding this thing up is a completely different matter from the first method. You have one array, initialized to 1. Divide by huge number, add 1; divide by huge number minus 1, add one; divide by huge number minus 2, add 1; and so on. After dividing by 2 and adding one the result is e. First the advantages compared to the standard way: the memory needed is cut in half; you only need one array, for the answer. Second, instead of a multiple-precision add (i.e. adding one array to another) you only have a single-precision add between divisions (i.e. add 1). This cuts lots of complexity from the code for a significant time savings. The disadvantages are hopefully obvious too: the worst is that the array being worked on *never gets smaller* so the computation will never speed up. This amounts to a large increase in computation time. Note that the nested method above involves less arithmetic per term than before but requires full precision in every term. We'll see in a minute for a simpler series whether time is wasted or saved from these two issues. There are other, comparatively minor disadvantages to worry about too: first, you have to start at the end and work backwards, meaning you have to know in advance how many terms to add up. For the e series this is a problem but for the series we'll look at the convergence rate is very regular. Also, you get no correct digits of the answer at all, until the very end when you get them all at once. This isn't a problem either since you'd have to be nuts to print out partial results while the computation is in progress. For the curious, the series I use is 8 1 2 3 4 5 6 pi = -- (6 - -- * -- (13 - -- * -- (20 - -- * -- (26 - ... 15 21 11 39 17 51 23 first discovered by RW Gosper. See RW Gosper, "Acceleration of Series", Memo #304, MIT AI Lab, 1973. It's way more complicated than the series for e, but it converges at a nice regular 1.43 digits per term, about as fast as Machin's arctangent formula. Incidentally, Gosper's paper is wonderful, you'll want to spend hours with Mathematica after reading it (that is, if you're very sick like me). Changing Formulas ----------------- Do you have another series you can rewrite in nested form? blokpi.c will add it up too, if you change a few things at compile time. Suppose you wanted to use another pi series I discovered: 2 1 1 2 3 3 5 pi = - (5 - - * - (11 - - * -- (17 - -- * -- (23 - ... 3 5 7 9 11 13 17 Here each term is 1/8 the size of the previous one, so it's slower than the default series the prorgram comes with. You only have to change a few things: FRACTIONS_PER_TERM: How many fractions are in front of each parenthesis? CORRECT_DIGITS_PER_TERM: This will be log10(8); for the default series the program comes with, this is log10(27). Finally, you have to describe the series in the stuff[] array. To do this, you have to look at the first term in the series that has all its fractions. Here it's the one with the 1/5 and the 1/7. Then the stuff[] array is { ignore, ignore, the number after the ( in the first term (11 here), how much that number increases with every term (6 here), numerator of the fraction in front (2), denominator of the fraction in front (3), numerator of first fraction (1), how much it increases with every term (1), denominator of first fraction (-5), how much it increases with every term (-4), numerator of second fraction (1), how much it increases with every term (2), denominator of second fraction (7), how much it increases with every term (4) }; If there are more fractions, just enter them in the same fashion. Note that I made one of the fractions negative, because there are minus signs in the series. Everything would be positive if there were all plus signs. Also, if you had to add more fractions make sure stuff[] is declared big enough to hold them. If you think pi is passe' you can do a lot of good with this program. Here are some series you can crunch away on: 5 1 1 1 2 2 2 3 3 3 Zeta(3) = - (1 - - * - * - (1 - -- * - * - ( 1 - -- * - * - (1 - ... 4 6 2 2 10 3 3 14 4 4 1 3 3 7 5 11 Lemniscate A = 1 + - * -- (4 + - * -- (7 + -- * -- (10 + ... 5 10 9 18 13 26 The first has been known for a while, the second I found. In both cases the world record is only a few million digits, and a fast Pentium can get that many in a few days with this program. See the CECM (the Center for Experimental and Constructive Mathematics at Simon Fraser U) for up-to-date reports on the present records. Some Analysis ------------- Here's a simpler series as another example, found by using Euler's transformation used on Gregory's series ( pi = 4 - 4/3 + 4/5 - 4/7 + ...): 1 1 2 1 2 3 1 2 3 4 pi = 2 ( 1 + - + -*- + -*-*- + -*-*-*- + ... ) (1) 3 3 5 3 5 7 3 5 7 9 1 2 3 4 = 2 + - ( 2 + - ( 2 + - ( 2 + - ( 2 + ... )))...) (2) 3 5 7 9 Every term in (1) is about half the size of the previous term, so (1) converges at at about log10(.5) = .301 correct digits per term. Like the series for e, (2) saves one full-size addition. You don't have to do full size addition, and pretend for the moment that you don't have to propagate carries when multiplying by a fraction. Is equation (2) going to be faster or slower than equation (1)? A back of the envelope calculation shows the following: time for (1) 1 extra work in (1) ------------ ~= - ( 1 + ----------------- ) time for (2) 2 work in (2) For the Euler series the work in (2) involes multiplying by a single fraction, and the extra work for (1) is a full size add and carry adjust. Suppose the extra work comes out to 1/5 of the work in (2) (not unrea- sonable, since division takes a long time); then equation (1) will finish in only 3/5 the time that (2) needs! If all the work you do for one term can carry over to the next term the standard method of summing series runs almost twice as fast as the nested method! Do you have to carry any 1's in formulas like (2)? Suppose you had to multiply the array by the fraction num/den; in C the code would look like this: remainder = 0; for(i=0; i