"how numbers are stored and used in computers"

Floating Point Numerical Analysis

If you work on building computer systems, chances are you'll need to understand how floating-point arithmetic works at a deeper level. Surprisingly, there aren't many clear and detailed resources out there on the topic.

The theorems below are a practical introduction to the numerical analysis behind floating-point arithmetic. They are written for a technical audience, but should be comprehensible by anyone - even the less mathematically inclined. The purpose of these proofs is to understand how floating point calculations may be reasoned about in general, especially when it comes to rounding and exactness.

Numerical analysis of floating point arithmetic

Detailed explanations of the theorems presented in David Goldberg's seminal 1991 paper "What Every Computer Scientist Should Know About Floating Point Arithmetic", with code examples and visualizations for the less mathematically-inclined.

Digits and relative error

Establishes an upper bound on relative error as a function of total digits.

Theorem 1

Relative rounding error

Establishes the upper bound on relative rounding error.

Theorem 2

Rounding error of triangle area

An illustrative example of how rounding error bounds might be established for a formula.

Theorem 3

Catastrophic collision

Explores the implications of rounding errors and collision when computing ln(1 + x).

Theorem 4

Rounding halfway

Considers the implications of rounding at halfway points like 0.5.

Theorem 5

Splitting floating point numbers

How exact rounding enables precise arithmetic through high-low decomposition

Theorem 6

Exact multiplications and divisions

Behavior of floating point arithmetic when the base is 2 and operations are exactly rounded.

Theorem 7

Kahan Summation Formula

Reducing numerical error from adding a sequence of numbers by tracking total accumulated error.

Theorem 8

Rounding error of subtraction

Establishes bounds on relative error for certain subtraction operations.

Theorem 9

Relative error of addition

Establishes bounds on relative error for certain subtraction operations.

Theorem 10

Performing exact subtractions

Establishes conditions by which an exact subtraction may be performed in floating point arithmetic.

Theorem 11

Numerically stable triangle area

Examines the numerical stability of a formula for calculating the area of a triangle.

Theorem 12

The logarithmic mean function

Bounding values and derivatives of the logarithmic mean function.

Theorem 13

Precise rounding with multiplication and subtraction

Rounding to fewer significant digits using exact operations.

Theorem 14

Decimal precision

Conversion between decimal precision numbers.

Theorem 15

Exactness

In floating-point, exactness is rare due to rounding. But knowing when you can subtract without error is useful for designing robust numerical algorithms—like square root calculations, computing differences, or detecting small perturbations.