Number Representations & States

"how numbers are stored in computers"

Splitting floating point numbers

This theorem presents a powerful technique in floating-point arithmetic: any number can be split into two parts — a high part and a low part — such that , and each part is representable using only half the original precision. This decomposition is possible under the assumption of exact rounding, and it's especially useful for reducing rounding errors in sensitive calculations.

To perform the split, we define , where , and is the number of bits of floating-point precision. Then, the high part is computed as , and the low part as . When rounding is exact, both and retain -bit precision, making them safely usable in subsequent computations.

Consider a calculation of the quadratic formula with (i.e. 4 digits of precision), with values like , , and . A naive calculation of using rounded values yields , which is significantly different from the correctly rounded result . However, using this theorem, we can justify splitting the numbers around a common base (even a fractional base like ), performing algebraic expansion (e.g. ), and evaluating each term exactly. Subtracting the expanded forms for and yields a result that matches the correctly rounded value, with no significant loss of precision.

The theorem critically depends on exact rounding - otherwise, the decomposition may fail. For instance, if and only one guard digit is used, the calculation of and can yield parts that aren't even representable in the required bit-width, invalidating the method. This limitation underlines why floating-point standards, such as IEEE 754, mandate exact rounding — it is essential for accurate and reliable numerical algorithms.

Theorem

Let be the floating-point precision, with the restriction that is even when , and assume that floating-point operations are exactly rounded. Then if is half the precision (rounded up) and , can be split as , where

and each is representable using bits of precision.

Proof

By Theorem 14, is rounded to places. If there is no carry out, then certainly can be represented with significant digits.

Suppose there is a carry-out. If , then rounding adds to , and the only way there can be a carry-out is if , but then the low order digit of is , and so again is representable in digits.

To deal with , scale to be an integer satisfying . Let , where is the high order digits of , and is the low order digits.

Consider three cases. If , then rounding to places is the same as chopping and , and .

Since has at most digits, if is even, then has at most digits. Otherwise, and is representable with significant bits.

The second case is when , and then computing involves rounding up, so , and . Once again, has at most digits, so is representable with digits.

Finally, if , then or depending on whether there is a round up. Therefore, is either or , both of which are represented with 1 digit.

References

1. What Every Computer Scientist Should Know About Floating-Point Arithmetic David Goldberg