"how numbers are stored in computers""hownumbersarestoredincomputers"
Relative error of addition
When adding two non-negative floating-point numbers and , the relative error in computing is at most (twice the machine epsilon), even without using extra bits to protect against rounding errors.
This proof addresses the precision loss caused by adding two numbers, where the smaller number potentially loses digits when shifting to align with the larger one . Even without guard digits, the shifted-off digits of contribute to a very small relative error in the full result, especially when is much larger than . The rounding error from this shift, plus the final rounding to digits, leads to a maximum relative error of .
This theorem illustrates the surprisingly good accuracy of floating-point addition, even in the absence of guard digits, and reinforces the idea that the structure of a formula greatly affects its numerical stability.
Theorem
If and , the relative error in computing is at most , even when no guard digits are used.
Proof
The algorithm for floating-point addition with guard digits closely resembles that used for subtraction. When , we right-shift until its radix point aligns with that of . Any digits shifted beyond the -th place are discarded. We then compute the exact sum of these two truncated numbers (each with digits), and round the result to digits.
To analyze the worst-case scenario, it suffices to consider the case with no guard digits (). Without loss of generality, we may assume , and scale to the normalized form , where is the base of the floating-point system.
Case 1: No carry-out during addition
In this case, the digits of that are shifted out contribute a value less than . Since the sum is at least , the relative error due to truncation is bounded by
Case 2: Carry-out occurs
Here, we must account for both the truncation error and the rounding error. The rounding error is at most , and the truncation error is again at most . Since the sum must now be at least , the relative error is bounded by
In both cases, the relative error is bounded by .
Discussion
This theorem provides part of the proof for Theorem 2, which establishes bounds for relative error in a single operation. These theorems collectively demonstrate why computing an expression like is generally more accurate than computing , especially when and are close together. The latter can suffer large relative errors due to catastrophic cancellation, while the former maintains more numerical stability.
References
1. What Every Computer Scientist Should Know About Floating-Point Arithmetic David Goldberg