Number Representations & States

"how numbers are stored and used in computers"

Floating point numbers

Floating point numbers represent numeric values in a form of scientific notation (). Each number is actually three numbers stored within a certain bit length.

Try editing the 8 bit floating point number below.

0
sign
0000
exponent
000
mantissa
=(023)×2(1 - 7)=0×2-7=0

Let's try to first understand this intuitively. The exponent is really just an interval between two successive powers of 2, like , , and so on. The mantissa (also called the significand) is the relative offset within this interval, and the sign bit controls the presence of a negative sign (-).

Try moving the slider along the number line below to see how the sign, exponent, and mantissa change for numbers in the interval .

Try editing the 8 bit floating point number below.

0
sign
0000
exponent
000
mantissa
=(023)×2(1 - 7)=0×2-7=0

Just as binary integers can be incremented to the next binary integer, adding one to a floating-point number's binary representation produces the next floating point number, which is not necessarily the number plus one.

As a floating point number gets larger, it "floats" to the next interval, and as it gets smaller, it "floats" to the previous interval. Intervals closer to zero are "more dense", in the sense that the mantissa can provide a more precise number along that interval. Intervals farther from zero are "more sparse", where it becomes increasingly likely that the mantissa will not provide enough resolution to precisely represent a particular number.

Floating point formats

The floating point format shown here is named E4M3, because it allocates 4 bits for an exponent and 3 bits for a mantissa. This format is usually only found in highly specialized machine learning applications, but is conceptually similar to the floating point formats that software developers typically interact with (32 bit and 64 bit).

Since there are only 256 possible values for any E4M3 number, it is easy to visualize the distribution of representable values in a table.

  E4M3Each row is the first four bits, and each column is the second four bits.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
+0
0.03125
0.125
0.5
2
8
32
128
-0
-0.03125
-0.125
-0.5
-2
-8
-32
-128
...0001
0.001953
0.035156
0.140625
0.5625
2.25
9
36
144
-0.001953
-0.035156
-0.140625
-0.5625
-2.25
-9
-36
-144
...0010
0.003906
0.039063
0.15625
0.625
2.5
10
40
160
-0.003906
-0.039063
-0.15625
-0.625
-2.5
-10
-40
-160
...0011
0.005859
0.042969
0.171875
0.6875
2.75
11
44
176
-0.005859
-0.042969
-0.171875
-0.6875
-2.75
-11
-44
-176
...0100
0.007813
0.046875
0.1875
0.75
3
12
48
192
-0.007813
-0.046875
-0.1875
-0.75
-3
-12
-48
-192
...0101
0.009766
0.050781
0.203125
0.8125
3.25
13
52
208
-0.009766
-0.050781
-0.203125
-0.8125
-3.25
-13
-52
-208
...0110
0.011719
0.054688
0.21875
0.875
3.5
14
56
224
-0.011719
-0.054688
-0.21875
-0.875
-3.5
-14
-56
-224
...0111
0.013672
0.058594
0.234375
0.9375
3.75
15
60
240
-0.013672
-0.058594
-0.234375
-0.9375
-3.75
-15
-60
-240
...1000
0.015625
0.0625
0.25
1
4
16
64
256
-0.015625
-0.0625
-0.25
-1
-4
-16
-64
-256
...1001
0.017578
0.070313
0.28125
1.125
4.5
18
72
288
-0.017578
-0.070313
-0.28125
-1.125
-4.5
-18
-72
-288
...1010
0.019531
0.078125
0.3125
1.25
5
20
80
320
-0.019531
-0.078125
-0.3125
-1.25
-5
-20
-80
-320
...1011
0.021484
0.085938
0.34375
1.375
5.5
22
88
352
-0.021484
-0.085938
-0.34375
-1.375
-5.5
-22
-88
-352
...1100
0.023438
0.09375
0.375
1.5
6
24
96
384
-0.023438
-0.09375
-0.375
-1.5
-6
-24
-96
-384
...1101
0.025391
0.101563
0.40625
1.625
6.5
26
104
416
-0.025391
-0.101563
-0.40625
-1.625
-6.5
-26
-104
-416
...1110
0.027344
0.109375
0.4375
1.75
7
28
112
448
-0.027344
-0.109375
-0.4375
-1.75
-7
-28
-112
-448
...1111
0.029297
0.117188
0.46875
1.875
7.5
30
120
+NaN
-0.029297
-0.117188
-0.46875
-1.875
-7.5
-30
-120
-NaN

Notice how the representable values for a floating point number are "denser" around 0, and exponentially less dense in higher intervals. This observation is equally applicable to larger floating point formats.

Since the best achievable precision depends on the magnitude of the number, the accuracy of floating point computations is often expressed in terms of ulps - an error metric which accounts for the density of representable numbers around a computed result. For example, if you were to compute with E4M3 floating point numbers, there is no possible binary sequence to provide the exact answer , but how would you express the accuracy of the nearest approximations in this case ( or )?

Log-scaled distribution of positive values for E4M3 floating point numbers.

                                                                                                    
0
112
224
336
448

This particular format also allows two possible representations of zero (00000000 and 10000000) and two possible representations of NaN (not a number). Some floating point formats also define special values to represent positive and negative Infinity. One criteria for considering a floating point format to be IEEE 754 compliant is whether it can represent NaN, +Infinity, and -Infinity.

  E5M2Follows IEEE 754 conventions for representation of special values.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
+0
0.000488
0.007813
0.125
2
32
512
8192
-0
-0.000488
-0.007813
-0.125
-2
-32
-512
-8192
...0001
0.000015
0.00061
0.009766
0.15625
2.5
40
640
10240
-0.000015
-0.00061
-0.009766
-0.15625
-2.5
-40
-640
-10240
...0010
0.000031
0.000732
0.011719
0.1875
3
48
768
12288
-0.000031
-0.000732
-0.011719
-0.1875
-3
-48
-768
-12288
...0011
0.000046
0.000854
0.013672
0.21875
3.5
56
896
14336
-0.000046
-0.000854
-0.013672
-0.21875
-3.5
-56
-896
-14336
...0100
0.000061
0.000977
0.015625
0.25
4
64
1024
16384
-0.000061
-0.000977
-0.015625
-0.25
-4
-64
-1024
-16384
...0101
0.000076
0.001221
0.019531
0.3125
5
80
1280
20480
-0.000076
-0.001221
-0.019531
-0.3125
-5
-80
-1280
-20480
...0110
0.000092
0.001465
0.023438
0.375
6
96
1536
24576
-0.000092
-0.001465
-0.023438
-0.375
-6
-96
-1536
-24576
...0111
0.000107
0.001709
0.027344
0.4375
7
112
1792
28672
-0.000107
-0.001709
-0.027344
-0.4375
-7
-112
-1792
-28672
...1000
0.000122
0.001953
0.03125
0.5
8
128
2048
32768
-0.000122
-0.001953
-0.03125
-0.5
-8
-128
-2048
-32768
...1001
0.000153
0.002441
0.039063
0.625
10
160
2560
40960
-0.000153
-0.002441
-0.039063
-0.625
-10
-160
-2560
-40960
...1010
0.000183
0.00293
0.046875
0.75
12
192
3072
49152
-0.000183
-0.00293
-0.046875
-0.75
-12
-192
-3072
-49152
...1011
0.000214
0.003418
0.054688
0.875
14
224
3584
57344
-0.000214
-0.003418
-0.054688
-0.875
-14
-224
-3584
-57344
...1100
0.000244
0.003906
0.0625
1
16
256
4096
+Infinity
-0.000244
-0.003906
-0.0625
-1
-16
-256
-4096
-Infinity
...1101
0.000305
0.004883
0.078125
1.25
20
320
5120
+NaN
-0.000305
-0.004883
-0.078125
-1.25
-20
-320
-5120
-NaN
...1110
0.000366
0.005859
0.09375
1.5
24
384
6144
+NaN
-0.000366
-0.005859
-0.09375
-1.5
-24
-384
-6144
-NaN
...1111
0.000427
0.006836
0.109375
1.75
28
448
7168
+NaN
-0.000427
-0.006836
-0.109375
-1.75
-28
-448
-7168
-NaN

There are certain small integer values like 17 that cannot be represented by this floating point format. We can represent 16 as 01011000, but the next binary digit 01011001 has a floating point approximation of 18. Even if you use a larger floating point format, like the 64 bit format in your web browser, you will eventually reach a maximum safe integer.

code.js
1console.log(2 ** 53) // 9007199254740992 2console.log(2 ** 53 + 1) // 9007199254740992 - same!

Notice that is the maximum safe integer for a floating point format which allocate 52 bits for the mantissa. This lends to the intuition that is an approximate threshold beyond which a floating point format will lose integer precision.

code.js
1// For compatibility with older JavaScript runtimes which use 32 bit numbers 2console.log(Number.MAX_SAFE_INTEGER) 3console.log(Number.MIN_SAFE_INTEGER) 4 5// Maximum and minimum representable values are many orders of magnitude higher 6console.log(Number.MAX_VALUE) 7console.log(Number.MIN_VALUE)

Depending on the application, a machine learning developer might choose a more specialized 8 bit floating point format like E5M2FNUZ, which uses 5 bits for the exponent, 2 bits for the mantissa, and defines itself as finite (FN) with unsigned zeroes (UZ).

  E5M2FNUZEach row is the first four bits, and each column is the second four bits.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
0
0.000244
0.003906
0.0625
1
16
256
4096
NaN
-0.000244
-0.003906
-0.0625
-1
-16
-256
-4096
...0001
0.000008
0.000305
0.004883
0.078125
1.25
20
320
5120
-0.000008
-0.000305
-0.004883
-0.078125
-1.25
-20
-320
-5120
...0010
0.000015
0.000366
0.005859
0.09375
1.5
24
384
6144
-0.000015
-0.000366
-0.005859
-0.09375
-1.5
-24
-384
-6144
...0011
0.000023
0.000427
0.006836
0.109375
1.75
28
448
7168
-0.000023
-0.000427
-0.006836
-0.109375
-1.75
-28
-448
-7168
...0100
0.000031
0.000488
0.007813
0.125
2
32
512
8192
-0.000031
-0.000488
-0.007813
-0.125
-2
-32
-512
-8192
...0101
0.000038
0.00061
0.009766
0.15625
2.5
40
640
10240
-0.000038
-0.00061
-0.009766
-0.15625
-2.5
-40
-640
-10240
...0110
0.000046
0.000732
0.011719
0.1875
3
48
768
12288
-0.000046
-0.000732
-0.011719
-0.1875
-3
-48
-768
-12288
...0111
0.000053
0.000854
0.013672
0.21875
3.5
56
896
14336
-0.000053
-0.000854
-0.013672
-0.21875
-3.5
-56
-896
-14336
...1000
0.000061
0.000977
0.015625
0.25
4
64
1024
16384
-0.000061
-0.000977
-0.015625
-0.25
-4
-64
-1024
-16384
...1001
0.000076
0.001221
0.019531
0.3125
5
80
1280
20480
-0.000076
-0.001221
-0.019531
-0.3125
-5
-80
-1280
-20480
...1010
0.000092
0.001465
0.023438
0.375
6
96
1536
24576
-0.000092
-0.001465
-0.023438
-0.375
-6
-96
-1536
-24576
...1011
0.000107
0.001709
0.027344
0.4375
7
112
1792
28672
-0.000107
-0.001709
-0.027344
-0.4375
-7
-112
-1792
-28672
...1100
0.000122
0.001953
0.03125
0.5
8
128
2048
32768
-0.000122
-0.001953
-0.03125
-0.5
-8
-128
-2048
-32768
...1101
0.000153
0.002441
0.039063
0.625
10
160
2560
40960
-0.000153
-0.002441
-0.039063
-0.625
-10
-160
-2560
-40960
...1110
0.000183
0.00293
0.046875
0.75
12
192
3072
49152
-0.000183
-0.00293
-0.046875
-0.75
-12
-192
-3072
-49152
...1111
0.000214
0.003418
0.054688
0.875
14
224
3584
57344
-0.000214
-0.003418
-0.054688
-0.875
-14
-224
-3584
-57344


Mathematical notation

It is helpful to understand how to express floating point numbers in mathematical notation if you plan to formally reason about floating point computations.

TODO: number line intuition for sign bit

TODO: number line intuition for exponent

For this particular format, where , the represented value might be formally expressed as

The widely-used 64 bit floating point format allocates 11 bits for the exponent and 52 bits for the mantissa, with . The represented value is calculated as:

In numerical analysis of floating point numbers, the base (written as ) usually refers to , but IEEE 854 also allows for base 10 to be used. There is also a special constant which refers to machine epsilon - a worst case relative error for a single rounding operation.

An example of a numerical analysis theorem from David Goldberg's seminal work What Every Computer Scientist Should Know About Floating Point Arithmetic, published in 1991.

The rounding error incurred when using the following formula to compute the area of a triangle:

is at most , provided that subtraction is performed with a guard digit, , and square roots are computed to within >1 ulp.

Relative error

You can edit the bits, exponent, and mantissa of a 64 bit floating point number below to get a sense of how much precision this actually provides.

Try editing the underlying numbers or bits of the floating point number.

sign
exponent
mantissa

This is often referred to as a double precision floating point number. The single precision floating point format allocates 8 bits for the exponent and 23 bits for the mantissa, which still provides fairly high precision with half the memory usage.

Floating point boundaries

The significand is being displayed as a number, but formal mathematical proofs often reason about the significand as a sequence of binary digits of a normalized value which takes the form , or for subnormal values. When we know the first digit is always a 1, we can actually get an extra bit of precision for free - this is called the hidden bit trick.

You will often see floating point numbers written in scientific notation (), but as you can see, this does not indicate a standard multiplication operation.

Floating point proofs

In floating point arithmetic and associated proofs, a floating point number is typically written as , where represents digits of the significand, is the base, and is the exponent.

For engineers

Understanding the nuances of floating point numbers is especially helpful when designing machine learning systems.