Number Representations & States

"how numbers are stored and used in computers"

Rounding halfway

Rounding is straightforward, with the exception of how to round halfway cases - for example, should round to or ? One school of thought is that half the digits () should round down, and the other half () should round up, so would round to .

Another school of thought says that since numbers ending in are exactly halfway between two possible roundings, they should round down half the time and round up the other half. One way of obtaining this 50% behavior to determine the rounding direction by the cardinality of the least significant digit, so rounds to rather than because its least significant digit is even.

Which of these methods is best, round up or round to even? Reiser and Knuth [2] offer the following reason for preferring round to even.

Theorem

Let and be floating-point numbers, and define , . If and are exactly rounded using round to even, then either for all or for all .

Proof

TODO

Discussion

To clarify this result, consider the floating point parameters , with . When rounding up, the sequence becomes



Each successive value of increases by until . Under round to even, is always . This example suggests that when using the round up rule, computations can gradually drift upward, whereas when using round to even the theorem says this cannot happen.

One application of exact rounding occurs in multiple precision arithmetic. There are two basic approaches to higher precision. One approach represents floating-point numbers using a very large significand, which is stored in an array of words, and codes the routines for manipulating these numbers in assembly language. The second approach represents higher precision floating-point numbers as an array of ordinary floating-point numbers, where adding the elements of the array in infinite precision recovers the high precision floating-point number. It is this second approach that will be discussed here. The advantage of using an array of floating-point numbers is that it can be coded portably in a high level language, but it requires exactly rounded arithmetic.

The key to multiplication in this system is representing a product as a sum, where each summand has the same precision as and . This can be done by splitting and . Writing and , the exact product is

If and have bit significands, the summands will also have bit significands provided that can be represented using bits. When is even, it is easy to find a splitting. The number can be written as the sum of and . When is odd, this simple splitting method does not work. However, an extra bit can be gained by using negative numbers. For example if , can be split as . There is more than one way to split a number. A splitting method that is easy to compute is due to Dekker [1971], but it requires more than a single guard digit.

References

  1. Goldberg D.. (1991). What Every Computer Scientist Should Know About Floating-Point Arithmetic. Computing Surveys.