Number Representations & States

"how numbers are stored and used in computers"

Floating point numbers

Floating point numbers represent numeric values in a form of scientific notation (). Each number is actually three numbers stored within a certain bit length.

Try editing the 8 bit floating point number below.

0
sign
0000
exponent
000
mantissa
=(023)×2(1 - 7)=0×2-7=0

Let's try to first understand this intuitively. The exponent is really just an interval between two successive powers of 2, like , , and so on. The mantissa (also called the significand) is the relative offset within this interval, and the sign bit controls the presence of a negative sign (-).

Try moving the slider along the number line below to see how the sign, exponent, and mantissa change for numbers in the interval .

Try editing the 8 bit floating point number below.

0
sign
0000
exponent
000
mantissa
=(023)×2(1 - 7)=0×2-7=0

Just as binary integers can be incremented to the next binary integer, adding one to a floating-point number's binary representation produces the next floating point number, which is not necessarily the number plus one. As a floating point number gets larger, it "floats" to the next interval, and as it gets smaller, it "floats" to the previous interval.

0

Intervals closer to zero are "more dense", in the sense that the mantissa can provide a more precise fraction of that interval. Intervals farther from zero are "more sparse", where it becomes increasingly likely that the mantissa will not provide enough resolution to precisely represent a particular number.

448

Floating point formats

The floating point format shown here is named E4M3, because it allocates 4 bits for an exponent and 3 bits for a mantissa. This format is usually only found in highly specialized machine learning applications, but is conceptually similar to the floating point formats that software developers typically interact with (32 bit and 64 bit).

Since there are only 256 possible values for any E4M3 number, it is easy to visualize the distribution of representable values in a table.

  E4M3Each row is the first four bits, and each column is the second four bits.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
+0
0.03125
0.125
0.5
2
8
32
128
-0
-0.03125
-0.125
-0.5
-2
-8
-32
-128
...0001
0.001953
0.035156
0.140625
0.5625
2.25
9
36
144
-0.001953
-0.035156
-0.140625
-0.5625
-2.25
-9
-36
-144
...0010
0.003906
0.039063
0.15625
0.625
2.5
10
40
160
-0.003906
-0.039063
-0.15625
-0.625
-2.5
-10
-40
-160
...0011
0.005859
0.042969
0.171875
0.6875
2.75
11
44
176
-0.005859
-0.042969
-0.171875
-0.6875
-2.75
-11
-44
-176
...0100
0.007813
0.046875
0.1875
0.75
3
12
48
192
-0.007813
-0.046875
-0.1875
-0.75
-3
-12
-48
-192
...0101
0.009766
0.050781
0.203125
0.8125
3.25
13
52
208
-0.009766
-0.050781
-0.203125
-0.8125
-3.25
-13
-52
-208
...0110
0.011719
0.054688
0.21875
0.875
3.5
14
56
224
-0.011719
-0.054688
-0.21875
-0.875
-3.5
-14
-56
-224
...0111
0.013672
0.058594
0.234375
0.9375
3.75
15
60
240
-0.013672
-0.058594
-0.234375
-0.9375
-3.75
-15
-60
-240
...1000
0.015625
0.0625
0.25
1
4
16
64
256
-0.015625
-0.0625
-0.25
-1
-4
-16
-64
-256
...1001
0.017578
0.070313
0.28125
1.125
4.5
18
72
288
-0.017578
-0.070313
-0.28125
-1.125
-4.5
-18
-72
-288
...1010
0.019531
0.078125
0.3125
1.25
5
20
80
320
-0.019531
-0.078125
-0.3125
-1.25
-5
-20
-80
-320
...1011
0.021484
0.085938
0.34375
1.375
5.5
22
88
352
-0.021484
-0.085938
-0.34375
-1.375
-5.5
-22
-88
-352
...1100
0.023438
0.09375
0.375
1.5
6
24
96
384
-0.023438
-0.09375
-0.375
-1.5
-6
-24
-96
-384
...1101
0.025391
0.101563
0.40625
1.625
6.5
26
104
416
-0.025391
-0.101563
-0.40625
-1.625
-6.5
-26
-104
-416
...1110
0.027344
0.109375
0.4375
1.75
7
28
112
448
-0.027344
-0.109375
-0.4375
-1.75
-7
-28
-112
-448
...1111
0.029297
0.117188
0.46875
1.875
7.5
30
120
+NaN
-0.029297
-0.117188
-0.46875
-1.875
-7.5
-30
-120
-NaN

Notice how the representable values for a floating point number are most dense around 0, and exponentially less dense in higher intervals. This observation is equally applicable to any floating point format, regardless of bit length.

Since the achievable precision depends on the magnitude of the number, accuracy of floating point computations is often expressed in terms of ulps - an error metric which accounts for the density of representable numbers around a computed result. For example, if you were to compute with E4M3 floating point numbers, there is no possible sequence of 8 bits to represent the exact answer , but how would you express the accuracy of the nearest approximations in this case ( or )?

There are simpler ways to describe error like relative error and absolute error, but in the case of floating point arithmetic, it's often useful to account for the density of representable numbers around an expected result because it grows exponentially sparse for larger numbers.

Log-scaled distribution of positive values for E4M3 floating point numbers.

                                                                                                    
0
112
224
336
448

Zeroes, NaN and Infinity

The E4M3 format also allows two possible representations of zero (00000000 and 10000000) and two possible representations of NaN (not a number). Some floating point formats also define special values to represent positive and negative Infinity. One criteria for considering a floating point format to be IEEE 754 compliant is whether it can represent NaN, +Infinity, and -Infinity.

  E5M2Follows IEEE 754 conventions for representation of special values.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
+0
0.000488
0.007813
0.125
2
32
512
8192
-0
-0.000488
-0.007813
-0.125
-2
-32
-512
-8192
...0001
0.000015
0.00061
0.009766
0.15625
2.5
40
640
10240
-0.000015
-0.00061
-0.009766
-0.15625
-2.5
-40
-640
-10240
...0010
0.000031
0.000732
0.011719
0.1875
3
48
768
12288
-0.000031
-0.000732
-0.011719
-0.1875
-3
-48
-768
-12288
...0011
0.000046
0.000854
0.013672
0.21875
3.5
56
896
14336
-0.000046
-0.000854
-0.013672
-0.21875
-3.5
-56
-896
-14336
...0100
0.000061
0.000977
0.015625
0.25
4
64
1024
16384
-0.000061
-0.000977
-0.015625
-0.25
-4
-64
-1024
-16384
...0101
0.000076
0.001221
0.019531
0.3125
5
80
1280
20480
-0.000076
-0.001221
-0.019531
-0.3125
-5
-80
-1280
-20480
...0110
0.000092
0.001465
0.023438
0.375
6
96
1536
24576
-0.000092
-0.001465
-0.023438
-0.375
-6
-96
-1536
-24576
...0111
0.000107
0.001709
0.027344
0.4375
7
112
1792
28672
-0.000107
-0.001709
-0.027344
-0.4375
-7
-112
-1792
-28672
...1000
0.000122
0.001953
0.03125
0.5
8
128
2048
32768
-0.000122
-0.001953
-0.03125
-0.5
-8
-128
-2048
-32768
...1001
0.000153
0.002441
0.039063
0.625
10
160
2560
40960
-0.000153
-0.002441
-0.039063
-0.625
-10
-160
-2560
-40960
...1010
0.000183
0.00293
0.046875
0.75
12
192
3072
49152
-0.000183
-0.00293
-0.046875
-0.75
-12
-192
-3072
-49152
...1011
0.000214
0.003418
0.054688
0.875
14
224
3584
57344
-0.000214
-0.003418
-0.054688
-0.875
-14
-224
-3584
-57344
...1100
0.000244
0.003906
0.0625
1
16
256
4096
+Infinity
-0.000244
-0.003906
-0.0625
-1
-16
-256
-4096
-Infinity
...1101
0.000305
0.004883
0.078125
1.25
20
320
5120
+NaN
-0.000305
-0.004883
-0.078125
-1.25
-20
-320
-5120
-NaN
...1110
0.000366
0.005859
0.09375
1.5
24
384
6144
+NaN
-0.000366
-0.005859
-0.09375
-1.5
-24
-384
-6144
-NaN
...1111
0.000427
0.006836
0.109375
1.75
28
448
7168
+NaN
-0.000427
-0.006836
-0.109375
-1.75
-28
-448
-7168
-NaN

There are certain small integer values like 17 that cannot be represented by this floating point format. We can represent 16 as 01011000, but the next binary digit 01011001 has a floating point approximation of 18. Even if you use a larger floating point format, like the 64 bit format in your web browser, you will eventually reach a maximum safe integer.

code.js
1console.log(2 ** 53) // 9007199254740992 2console.log(2 ** 53 + 1) // 9007199254740992 - same!

Notice that is the maximum safe integer for a floating point format which allocate 52 bits for the mantissa. This lends to the intuition that is an approximate threshold beyond which a floating point format will lose integer precision.

code.js
1// For compatibility with older JavaScript runtimes which use 32 bit numbers 2console.log(Number.MAX_SAFE_INTEGER) 3console.log(Number.MIN_SAFE_INTEGER) 4 5// Maximum and minimum representable values are many orders of magnitude higher 6console.log(Number.MAX_VALUE) 7console.log(Number.MIN_VALUE)

Depending on the application, a machine learning developer might choose a more specialized 8 bit floating point format like E5M2FNUZ, which uses 5 bits for the exponent, 2 bits for the mantissa, and defines itself as finite (FN) with unsigned zeroes (UZ).

  E5M2FNUZEach row is the first four bits, and each column is the second four bits.

 
0000...
0001...
0010...
0011...
0100...
0101...
0110...
0111...
1000...
1001...
1010...
1011...
1100...
1101...
1110...
1111...
...0000
0
0.000244
0.003906
0.0625
1
16
256
4096
NaN
-0.000244
-0.003906
-0.0625
-1
-16
-256
-4096
...0001
0.000008
0.000305
0.004883
0.078125
1.25
20
320
5120
-0.000008
-0.000305
-0.004883
-0.078125
-1.25
-20
-320
-5120
...0010
0.000015
0.000366
0.005859
0.09375
1.5
24
384
6144
-0.000015
-0.000366
-0.005859
-0.09375
-1.5
-24
-384
-6144
...0011
0.000023
0.000427
0.006836
0.109375
1.75
28
448
7168
-0.000023
-0.000427
-0.006836
-0.109375
-1.75
-28
-448
-7168
...0100
0.000031
0.000488
0.007813
0.125
2
32
512
8192
-0.000031
-0.000488
-0.007813
-0.125
-2
-32
-512
-8192
...0101
0.000038
0.00061
0.009766
0.15625
2.5
40
640
10240
-0.000038
-0.00061
-0.009766
-0.15625
-2.5
-40
-640
-10240
...0110
0.000046
0.000732
0.011719
0.1875
3
48
768
12288
-0.000046
-0.000732
-0.011719
-0.1875
-3
-48
-768
-12288
...0111
0.000053
0.000854
0.013672
0.21875
3.5
56
896
14336
-0.000053
-0.000854
-0.013672
-0.21875
-3.5
-56
-896
-14336
...1000
0.000061
0.000977
0.015625
0.25
4
64
1024
16384
-0.000061
-0.000977
-0.015625
-0.25
-4
-64
-1024
-16384
...1001
0.000076
0.001221
0.019531
0.3125
5
80
1280
20480
-0.000076
-0.001221
-0.019531
-0.3125
-5
-80
-1280
-20480
...1010
0.000092
0.001465
0.023438
0.375
6
96
1536
24576
-0.000092
-0.001465
-0.023438
-0.375
-6
-96
-1536
-24576
...1011
0.000107
0.001709
0.027344
0.4375
7
112
1792
28672
-0.000107
-0.001709
-0.027344
-0.4375
-7
-112
-1792
-28672
...1100
0.000122
0.001953
0.03125
0.5
8
128
2048
32768
-0.000122
-0.001953
-0.03125
-0.5
-8
-128
-2048
-32768
...1101
0.000153
0.002441
0.039063
0.625
10
160
2560
40960
-0.000153
-0.002441
-0.039063
-0.625
-10
-160
-2560
-40960
...1110
0.000183
0.00293
0.046875
0.75
12
192
3072
49152
-0.000183
-0.00293
-0.046875
-0.75
-12
-192
-3072
-49152
...1111
0.000214
0.003418
0.054688
0.875
14
224
3584
57344
-0.000214
-0.003418
-0.054688
-0.875
-14
-224
-3584
-57344


Transcendental functions

Suppose you are building a system that calculates a transcendental function, which produces an output that cannot be precisely expressed like or .

Suppose we want to calculate - we could use the Taylor expansion of around .

code.js
1function ln(x, terms = 100) { 2 // Boundary conditions 3 if (x < 0) 4 throw new Error('complex numbers not supported') 5 if (x > 2) 6 throw new Error('only works for 0 < x <= 2 for now') 7 8 // Special condition 9 if (x == 0) 10 return -Infinity 11 12 let sum = 0 13 const z = x - 1 14 15 for (let n = 1; n <= terms; n++) { 16 const term = ((-1) ** (n + 1)) * (z ** n) / n 17 sum += term 18 } 19 20 return sum 21}

We can then employ this logarithm identity to generalize this solution for all possible values of .

If we choose a value of which is the -th power of 2 (i.e. ), then we can simplify this further into an algorithm that only needs to store the precomputed value of .

code.js
1const LN2 = 0.6931471805599453 2 3function ln(x, terms = 100) { 4 // Boundary conditions 5 if (x < 0) 6 throw new Error('complex numbers not supported') 7 if (x > 2) { 8 let k = 0 9 while (x > 2) { 10 x /= 2 11 k++ 12 } 13 return ln(x) + k * LN2 14 } 15 16 // Special condition 17 if (x == 0) 18 return -Infinity 19 20 let sum = 0 21 const z = x - 1 22 23 for (let n = 1; n <= terms; n++) { 24 const term = ((-1) ** (n + 1)) * (z ** n) / n 25 sum += term 26 } 27 28 return sum 29}

This implementation is intentionally imprecise. Can you improve it?

Floating Point Number System

A floating point number system is a subset of the real numbers (i.e. ) where elements have the form . There are four integer parameters here which characterize any floating point number system:

  • the base (sometimes called the radix), which in this case is
  • the exponent which lies in an exponent range
  • the precision , which is subtracted from the exponent
  • the significand , where

To ensure a unique representation for each nonzero , it is assumed that if , so that the system is normalized. The number is a special case in that it does not have a normalized representation.

The range of the nonzero floating point numbers in is given by .

Note that an alternative (and more common) way of expressing is:

where each digit satisfies , and for normalized numbers. While this "positional" representation is more common, we prefer the more concise first representation, which we usually find easier to work with. The first representation has pedagogical advantages, being entirely integer-based and therefore simpler to grasp. In the positional representation, is called the most significant digit and the least significant digit.

Mathematical notation

It is helpful to understand how to express floating point numbers in mathematical notation if you plan to formally reason about floating point computations.

  • The sign () is a single bit which indicates whether the number is positive or negative, where 0 is positive and 1 is negative. This is usually expressed as , since and .

TODO: number line intuition for sign bit

  • The exponent () - determines the scale of the number (how big or small it is). When the exponent is 0, the value is referred to as a subnormal number, and is computed in a slightly different way.

TODO: number line intuition for exponent

  • The mantissa () determines the precision of the number (the actual digits), and is often referred to as the significand. In numerical analysis, it is often expressed as the sequence of binary digits after the decimal point.

    For subnormal numbers (), the mantissa represents the fractional part of , which lies somewhere in the interval . For all other exponents , the mantissa represents the fractional part of , which lies in the interval . One reason for this is a hidden bit trick that will be elaborated on later.

  • The bias (written as ) is subtracted from the exponent, so positive and negative exponents can both be stored with an unsigned integer.

For this particular format, where , the represented value might be formally expressed as

The widely-used 64 bit floating point format allocates 11 bits for the exponent and 52 bits for the mantissa, with . The represented value is calculated as:

In numerical analysis of floating point numbers, the base (written as ) usually refers to , but IEEE 854 also allows for base 10 to be used. There is also a special constant which refers to machine epsilon - a worst case relative error for a single rounding operation.

An example of a numerical analysis theorem from David Goldberg's seminal work What Every Computer Scientist Should Know About Floating Point Arithmetic, published in 1991.

The rounding error incurred when using the following formula to compute the area of a triangle:

is at most , provided that subtraction is performed with a guard digit, , and square roots are computed to within >1 ulp.

Relative error

You can edit the bits, exponent, and mantissa of a 64 bit floating point number below to get a sense of how much precision this actually provides.

Try editing the underlying numbers or bits of the floating point number.

sign
exponent
mantissa

This is often referred to as a double precision floating point number. The single precision floating point format allocates 8 bits for the exponent and 23 bits for the mantissa, which still provides fairly high precision with half the memory usage.

Floating point boundaries

The significand is being displayed as a number, but formal mathematical proofs often reason about the significand as a sequence of binary digits of a normalized value which takes the form , or for subnormal values. When we know the first digit is always a 1, we can actually get an extra bit of precision for free - this is called the hidden bit trick.

You will often see floating point numbers written in scientific notation (), but as you can see, this does not indicate a standard multiplication operation.

Floating point proofs

In floating point arithmetic and associated proofs, a floating point number is typically written as , where represents digits of the significand, is the base, and is the exponent.

For engineers

Understanding the nuances of floating point numbers is especially helpful when designing machine learning systems.