"how numbers are stored in computers""hownumbersarestoredincomputers"
Precise rounding with multiplication and subtraction
This theorem provides a clever method for rounding a floating-point number to significant digits using only basic floating-point operations—multiplication and subtraction—with exact rounding. By defining , the expression produces rounded to digits.
The key insight is that multiplying by shifts the digits, effectively exposing the lower bits to rounding, and the subsequent subtraction removes the influence of these bits, recovering the correctly rounded value. The proof shows that this method captures the high part of accurately under all rounding cases—chopping, rounding down, or rounding up—and that any rounding error is carefully controlled and representable within the precision bounds.
Theorem
Let , and define , where floating-point operations are exactly rounded. Then:
equals exactly, rounded to significant digits. More precisely, this operation rounds the significand of by imagining a radix point placed just to the left of the least significant digits and rounding to the nearest integer.
Example
Consider a floating point system where (binary), (8-bit significands), and (i.e. we will round to significant bits). If we define , the theorem proposes that gives you rounded to 5 significant bits.
Let's take a floating-point number (using 8 bits of precision, i.e. the full ). We want to round this to 5 significant bits, so
Because the 6th bit is a 1, we will round up. Now we compute the expression :
However, since we are rounding and therefore only keeping only 8 significant bits, . Next, we compute , again rounded to 8 bits:
Which yields a final result of , which is rounded to 5 bits.
Proof
Let denote rounded to digits. We'll show that the expression:
computes exactly .
High part ()
Assume has the normalized form .
Rounding to digits affects only digit . If there's no carry-out during rounding, then the result fits within digits and is representable.
If there is a carry-out (i.e., rounding causes a chain of 9s to roll over), then the low-order digit of becomes 0, and still has only nonzero digits. So in all cases, is representable.
Low part ()
To isolate the rounding behavior, we scale to be an integer such that . Let , where is the high part (rounded to digits), and is the low part (remainder).
Now consider .
Then:
So:
But since all operations are rounded to digits, what we actually get is:
The high part is rounded to remove the least significant digits.
Subtracting that from recovers —i.e., rounded to digits.
Accuracy of the low part
Depending on the exact value of , there are three cases:
No rounding required (chopping): If the digits beyond are zero, then rounding is the same as chopping and .
Rounding down: Then is simply the chopped value and has at most nonzero digits, all at the lowest significance—so it's representable.
Rounding up: Then and
Again, this is small and precisely representable.
So in all cases, the expression computes , i.e., rounded to digits.