number.rest

Programmers use numbers every day, but the precision and limitations of those numbers is often considered an esoteric topic. This is a free and open-source guide to binary representations, numerical analysis, floating point numbers, and character encodings that is designed to be approachable by a wider audience.

This is essential information for software developers, but particularly for those working on the internals of LLMs, machine learning systems, and database technology. You can safely use any of the code and information on this website in any commercial setting without any licensing requirements.

Interactive lessons

Binary integers

Learn how whole numbers are stored in a digital computer.

Floating point numbers

Learn how floating point numbers are represented in a digital computer.

Floating point formats

Double precision

A 64-bit format offering very high precision and a wide dynamic range, commonly used in scientific computing.

FP64

Single precision

A 32-bit format balancing precision and performance, widely used in general-purpose computing and graphics.

FP32

Half precision

Optimized for memory efficiency and speed with lower precision and dynamic range, used in AI and graphics.

FP16

Brain Floating Point

Same exponent size as FP32 but reduced mantissa, enabling faster computation while retaining dynamic range.

BF16

8-bit floating point

A low precision format for maximum efficiency, mainly used in deep learning models for training and inference.

FP8

4-bit floating point

An extremely compact format used in specialized machine learning and graphics applications.

FP4

Numerical analysis of floating point arithmetic

Detailed explanations of the theorems presented in David Goldberg's seminal 1991 paper "What Every Computer Scientist Should Know About Floating Point Arithmetic", with code examples and visualizations for the less mathematically-inclined.

Digits and relative error

Establishes an upper bound on relative error as a function of total digits.

Theorem 1

Relative rounding error

Establishes the upper bound on relative rounding error.

Theorem 2

Rounding error of triangle area

An illustrative example of how rounding error bounds might be established for a formula.

Theorem 3

Catastrophic collision

Explores the implications of rounding errors and collision when computing ln(1 + x).

Theorem 4

Rounding halfway

Considers the implications of rounding at halfway points like 0.5.

Theorem 5

Splitting floating point numbers

How exact rounding enables precise arithmetic through high-low decomposition

Theorem 6

Exact multiplications and divisions

Behavior of floating point arithmetic when the base is 2 and operations are exactly rounded.

Theorem 7

Kahan Summation Formula

Reducing numerical error from adding a sequence of numbers by tracking total accumulated error.

Theorem 8

Rounding error of subtraction

Establishes bounds on relative error for certain subtraction operations.

Theorem 9

Relative error of addition

Establishes bounds on relative error for certain subtraction operations.

Theorem 10

Performing exact subtractions

Establishes conditions by which an exact subtraction may be performed in floating point arithmetic.

Theorem 11

Numerically stable triangle area

Examines the numerical stability of a formula for calculating the area of a triangle.

Theorem 12

The logarithmic mean function

Bounding values and derivatives of the logarithmic mean function.

Theorem 13

Precise rounding with multiplication and subtraction

Rounding to fewer significant digits using exact operations.

Theorem 14

Decimal precision

Conversion between decimal precision numbers.

Theorem 15

Contact

You can contact the author at keshav@keshavsaharia.com

GitHub