Programmers use numbers every day, but the precision and limitations of those numbers is often considered an esoteric topic. This is a free and open-source guide to binary representations, numerical analysis, floating point numbers, and character encodings that is designed to be approachable by a wider audience.
This is essential information for software developers, but particularly for those working on the internals of LLMs, machine learning systems, and database technology. You can safely use any of the code and information on this website in any commercial setting without any licensing requirements.
Interactive lessons
Floating point formats
Double precision
A 64-bit format offering very high precision and a wide dynamic range, commonly used in scientific computing.
FP64
Single precision
A 32-bit format balancing precision and performance, widely used in general-purpose computing and graphics.
FP32
Half precision
Optimized for memory efficiency and speed with lower precision and dynamic range, used in AI and graphics.
FP16
Brain Floating Point
Same exponent size as FP32 but reduced mantissa, enabling faster computation while retaining dynamic range.
BF16
8-bit floating point
A low precision format for maximum efficiency, mainly used in deep learning models for training and inference.
FP8
4-bit floating point
An extremely compact format used in specialized machine learning and graphics applications.
FP4
Floating point theorems
Digits and relative error
Establishes an upper bound on relative error as a function of total digits.
Theorem 1
Relative rounding error
Establishes the upper bound on relative rounding error.
Theorem 2
Rounding error of triangle area
An illustrative example of how rounding error bounds might be established for a formula.
Theorem 3
Catastrophic collision
Explores the implications of rounding errors and collision when computing ln(1 + x).
Theorem 4
Rounding halfway
Considers the implications of rounding at halfway points like 0.5.
Theorem 5
Splitting floating point numbers
How exact rounding enables precise arithmetic through high-low decomposition
Theorem 6
Exact multiplications and divisions
Behavior of floating point arithmetic when the base is 2 and operations are exactly rounded.
Theorem 7
Kahan Summation Formula
Reducing numerical error from adding a sequence of numbers by tracking total accumulated error.
Theorem 8
Rounding error of subtraction
Establishes bounds on relative error for certain subtraction operations.
Theorem 9
Relative error of addition
Establishes bounds on relative error for certain subtraction operations.
Theorem 10
Performing exact subtractions
Establishes conditions by which an exact subtraction may be performed in floating point arithmetic.
Theorem 11
Numerically stable triangle area
Examines the numerical stability of a formula for calculating the area of a triangle.
Theorem 12
The logarithmic mean function
Bounding values and derivatives of the logarithmic mean function.
Theorem 13
Precise rounding with multiplication and subtraction
Rounding to fewer significant digits using exact operations.
Theorem 14
Decimal precision
Conversion between decimal precision numbers.
Theorem 15
Contact
You can contact the author at keshav@keshavsaharia.com