"how numbers are stored and used in computers"
Floating point numbers represent numeric values in a form of scientific notation (
Try editing the 8 bit floating point number below.
Let's try to first understand this intuitively. The exponent is really just an interval between two successive powers of 2, like -
).
Try moving the slider along the number line below to see how the sign, exponent, and mantissa change for numbers in the interval
Try editing the 8 bit floating point number below.
Just as binary integers can be incremented to the next binary integer, adding one to a floating-point number's binary representation produces the next floating point number, which is not necessarily the number plus one.
As a floating point number gets larger, it "floats" to the next interval, and as it gets smaller, it "floats" to the previous interval. Intervals closer to zero are "more dense", in the sense that the mantissa can provide a more precise number along that interval. Intervals farther from zero are "more sparse", where it becomes increasingly likely that the mantissa will not provide enough resolution to precisely represent a particular number.
The floating point format shown here is named E4M3
, because it allocates 4 bits for an exponent and 3 bits for a mantissa. This format is usually only found in highly specialized machine learning applications, but is conceptually similar to the floating point formats that software developers typically interact with (32 bit and 64 bit).
Since there are only 256 possible values for any E4M3
number, it is easy to visualize the distribution of representable values in a table.
E4M3Each row is the first four bits, and each column is the second four bits.
Notice how the representable values for a floating point number are "denser" around 0, and exponentially less dense in higher intervals. This observation is equally applicable to larger floating point formats.
Since the best achievable precision depends on the magnitude of the number, the accuracy of floating point computations is often expressed in terms of ulps - an error metric which accounts for the density of representable numbers around a computed result. For example, if you were to compute E4M3
floating point numbers, there is no possible binary sequence to provide the exact answer
Log-scaled distribution of positive values for E4M3 floating point numbers.
This particular format also allows two possible representations of zero (00000000
and 10000000
) and two possible representations of NaN
(not a number). Some floating point formats also define special values to represent positive and negative Infinity
. One criteria for considering a floating point format to be IEEE 754 compliant is whether it can represent NaN
, +Infinity
, and -Infinity
.
E5M2Follows IEEE 754 conventions for representation of special values.
There are certain small integer values like 17
that cannot be represented by this floating point format. We can represent 16
as 01011000
, but the next binary digit 01011001
has a floating point approximation of 18
. Even if you use a larger floating point format, like the 64 bit format in your web browser, you will eventually reach a maximum safe integer.
code.js1console.log(2 ** 53) // 9007199254740992 2console.log(2 ** 53 + 1) // 9007199254740992 - same!
Notice that
code.js1// For compatibility with older JavaScript runtimes which use 32 bit numbers 2console.log(Number.MAX_SAFE_INTEGER) 3console.log(Number.MIN_SAFE_INTEGER) 4 5// Maximum and minimum representable values are many orders of magnitude higher 6console.log(Number.MAX_VALUE) 7console.log(Number.MIN_VALUE)
Depending on the application, a machine learning developer might choose a more specialized 8 bit floating point format like E5M2FNUZ
, which uses 5 bits for the exponent, 2 bits for the mantissa, and defines itself as finite (FN
) with unsigned zeroes (UZ
).
E5M2FNUZEach row is the first four bits, and each column is the second four bits.
It is helpful to understand how to express floating point numbers in mathematical notation if you plan to formally reason about floating point computations.
0
is positive and 1
is negative. This is usually expressed as TODO: number line intuition for sign bit
0
, the value is referred to as a subnormal number, and is computed in a slightly different way.TODO: number line intuition for exponent
The mantissa (
For subnormal numbers (
The bias (written as
For this particular format, where
The widely-used 64 bit floating point format allocates 11 bits for the exponent and 52 bits for the mantissa, with
In numerical analysis of floating point numbers, the base (written as
An example of a numerical analysis theorem from David Goldberg's seminal work What Every Computer Scientist Should Know About Floating Point Arithmetic, published in 1991.
The rounding error incurred when using the following formula to compute the area of a triangle:
is at most
You can edit the bits, exponent, and mantissa of a 64 bit floating point number below to get a sense of how much precision this actually provides.
Try editing the underlying numbers or bits of the floating point number.
This is often referred to as a double precision floating point number. The single precision floating point format allocates 8 bits for the exponent and 23 bits for the mantissa, which still provides fairly high precision with half the memory usage.
The significand is being displayed as a number, but formal mathematical proofs often reason about the significand as a sequence of binary digits of a normalized value which takes the form
You will often see floating point numbers written in scientific notation (
In floating point arithmetic and associated proofs, a floating point number is typically written as
Understanding the nuances of floating point numbers is especially helpful when designing machine learning systems.