Number Representations & States

"how numbers are stored in computers"

Rounding error of subtraction

When subtracting two positive floating-point numbers using one guard digit, the relative error in the result is very small - specifically, less than about twice the normal rounding error, or .

This can be proven by examining three different cases depending on how the two numbers and are represented. If they are close together, or if one has fewer digits than the other, the guard digit ensures that the subtraction is either exact or accurately rounded. In each case, a careful estimation of how much error is introduced supports the assertion that the relative error always stays under the allowed limit. While this is a generalized proof for any base representation, in binary (base 2) this bound is exactly .

Overall, this provides a reassuring guarantee that subtraction with a guard digit will be reliably accurate, even in worst-case scenarios.

Theorem

If and are positive floating-point numbers in a format with parameters and , and if subtraction is done with digits (i.e. one guard digit), then the relative rounding error in the result is less than .

Proof

Interchange and if necessary so that . It is also harmless to scale and so that is represented by . If is represented as , then the difference is exact. If is represented as , then the guard digit ensures that the computed difference will be the exact difference rounded to a floating-point number, so the rounding error is at most . In general, let and be truncated to digits. Then

From the definition of guard digit, the computed value of is rounded to a floating-point number , where the rounding error satisfies

The exact difference is , so the error is .

Consider three possible cases:

  1. If , then the relative error is bounded as

  1. If , then . Since the smallest that can be is

in this case the relative error is bounded by

  1. If and , which implies , in which case . If , then the same bound as above applies, so the relative error is also bounded by .

When , the bound is exactly , and this bound is achieved for and in the limit as .

Related theorems

When adding numbers of the same sign, a guard digit is not necessary to achieve good accuracy, as shown in Theorem 10.

References

1. David Goldberg