"how numbers are stored and used in computers"

Floating point numbers

Floating point numbers represent numeric values in a form of scientific notation (). Each number is actually three numbers stored within a certain bit length.

Try editing the 8 bit floating point number below.

sign

0000

exponent

000

mantissa

=(02³)×2^(1 - 7)=0×2^-7=0

Let's try to first understand this intuitively. The exponent is really just an interval between two successive powers of 2, like , , and so on. The mantissa (also called the significand) is the relative offset within this interval, and the sign bit controls the presence of a negative sign (-).

Try moving the slider along the number line below to see how the sign, exponent, and mantissa change for numbers in the interval .

Try editing the 8 bit floating point number below.

sign

0000

exponent

000

mantissa

=(02³)×2^(1 - 7)=0×2^-7=0

Just as binary integers can be incremented to the next binary integer, adding one to a floating-point number's binary representation produces the next floating point number, which is not necessarily the number plus one.

As a floating point number gets larger, it "floats" to the next interval, and as it gets smaller, it "floats" to the previous interval. Intervals closer to zero are "more dense", in the sense that the mantissa can provide a more precise number along that interval. Intervals farther from zero are "more sparse", where it becomes increasingly likely that the mantissa will not provide enough resolution to precisely represent a particular number.

Floating point formats

The floating point format shown here is named E4M3, because it allocates 4 bits for an exponent and 3 bits for a mantissa. This format is usually only found in highly specialized machine learning applications, but is conceptually similar to the floating point formats that software developers typically interact with (32 bit and 64 bit).

Since there are only 256 possible values for any E4M3 number, it is easy to visualize the distribution of representable values in a table.

E4M3Each row is the first four bits, and each column is the second four bits.

0000...

0001...

0010...

0011...

0100...

0101...

0110...

0111...

1000...

1001...

1010...

1011...

1100...

1101...

1110...

1111...

...0000

0.03125

0.125

0.5

128

-0

-0.03125

-0.125

-0.5

-2

-8

-32

-128

...0001

0.001953

0.035156

0.140625

0.5625

2.25

144

-0.001953

-0.035156

-0.140625

-0.5625

-2.25

-9

-36

-144

...0010

0.003906

0.039063

0.15625

0.625

2.5

160

-0.003906

-0.039063

-0.15625

-0.625

-2.5

-10

-40

-160

...0011

0.005859

0.042969

0.171875

0.6875

2.75

176

-0.005859

-0.042969

-0.171875

-0.6875

-2.75

-11

-44

-176

...0100

0.007813

0.046875

0.1875

0.75

192

-0.007813

-0.046875

-0.1875

-0.75

-3

-12

-48

-192

...0101

0.009766

0.050781

0.203125

0.8125

3.25

208

-0.009766

-0.050781

-0.203125

-0.8125

-3.25

-13

-52

-208

...0110

0.011719

0.054688

0.21875

0.875

3.5

224

-0.011719

-0.054688

-0.21875

-0.875

-3.5

-14

-56

-224

...0111

0.013672

0.058594

0.234375

0.9375

3.75

240

-0.013672

-0.058594

-0.234375

-0.9375

-3.75

-15

-60

-240

...1000

0.015625

0.0625

0.25

256

-0.015625

-0.0625

-0.25

-1

-4

-16

-64

-256

...1001

0.017578

0.070313

0.28125

1.125

4.5

288

-0.017578

-0.070313

-0.28125

-1.125

-4.5

-18

-72

-288

...1010

0.019531

0.078125

0.3125

1.25

320

-0.019531

-0.078125

-0.3125

-1.25

-5

-20

-80

-320

...1011

0.021484

0.085938

0.34375

1.375

5.5

352

-0.021484

-0.085938

-0.34375

-1.375

-5.5

-22

-88

-352

...1100

0.023438

0.09375

0.375

1.5

384

-0.023438

-0.09375

-0.375

-1.5

-6

-24

-96

-384

...1101

0.025391

0.101563

0.40625

1.625

6.5

104

416

-0.025391

-0.101563

-0.40625

-1.625

-6.5

-26

-104

-416

...1110

0.027344

0.109375

0.4375

1.75

112

448

-0.027344

-0.109375

-0.4375

-1.75

-7

-28

-112

-448

...1111

0.029297

0.117188

0.46875

1.875

7.5

120

+NaN

-0.029297

-0.117188

-0.46875

-1.875

-7.5

-30

-120

-NaN

Notice how the representable values for a floating point number are "denser" around 0, and exponentially less dense in higher intervals. This observation is equally applicable to larger floating point formats.

Since the best achievable precision depends on the magnitude of the number, the accuracy of floating point computations is often expressed in terms of ulps - an error metric which accounts for the density of representable numbers around a computed result. For example, if you were to compute with E4M3 floating point numbers, there is no possible binary sequence to provide the exact answer , but how would you express the accuracy of the nearest approximations in this case ( or )?

Log-scaled distribution of positive values for E4M3 floating point numbers.

112

224

336

448

This particular format also allows two possible representations of zero (00000000 and 10000000) and two possible representations of NaN (not a number). Some floating point formats also define special values to represent positive and negative Infinity. One criteria for considering a floating point format to be IEEE 754 compliant is whether it can represent NaN, +Infinity, and -Infinity.

E5M2Follows IEEE 754 conventions for representation of special values.

0000...

0001...

0010...

0011...

0100...

0101...

0110...

0111...

1000...

1001...

1010...

1011...

1100...

1101...

1110...

1111...

...0000

0.000488

0.007813

0.125

512

8192

-0

-0.000488

-0.007813

-0.125

-2

-32

-512

-8192

...0001

0.000015

0.00061

0.009766

0.15625

2.5

640

10240

-0.000015

-0.00061

-0.009766

-0.15625

-2.5

-40

-640

-10240

...0010

0.000031

0.000732

0.011719

0.1875

768

12288

-0.000031

-0.000732

-0.011719

-0.1875

-3

-48

-768

-12288

...0011

0.000046

0.000854

0.013672

0.21875

3.5

896

14336

-0.000046

-0.000854

-0.013672

-0.21875

-3.5

-56

-896

-14336

...0100

0.000061

0.000977

0.015625

0.25

1024

16384

-0.000061

-0.000977

-0.015625

-0.25

-4

-64

-1024

-16384

...0101

0.000076

0.001221

0.019531

0.3125

1280

20480

-0.000076

-0.001221

-0.019531

-0.3125

-5

-80

-1280

-20480

...0110

0.000092

0.001465

0.023438

0.375

1536

24576

-0.000092

-0.001465

-0.023438

-0.375

-6

-96

-1536

-24576

...0111

0.000107

0.001709

0.027344

0.4375

112

1792

28672

-0.000107

-0.001709

-0.027344

-0.4375

-7

-112

-1792

-28672

...1000

0.000122

0.001953

0.03125

0.5

128

2048

32768

-0.000122

-0.001953

-0.03125

-0.5

-8

-128

-2048

-32768

...1001

0.000153

0.002441

0.039063

0.625

160

2560

40960

-0.000153

-0.002441

-0.039063

-0.625

-10

-160

-2560

-40960

...1010

0.000183

0.00293

0.046875

0.75

192

3072

49152

-0.000183

-0.00293

-0.046875

-0.75

-12

-192

-3072

-49152

...1011

0.000214

0.003418

0.054688

0.875

224

3584

57344

-0.000214

-0.003418

-0.054688

-0.875

-14

-224

-3584

-57344

...1100

0.000244

0.003906

0.0625

256

4096

+Infinity

-0.000244

-0.003906

-0.0625

-1

-16

-256

-4096

-Infinity

...1101

0.000305

0.004883

0.078125

1.25

320

5120

+NaN

-0.000305

-0.004883

-0.078125

-1.25

-20

-320

-5120

-NaN

...1110

0.000366

0.005859

0.09375

1.5

384

6144

+NaN

-0.000366

-0.005859

-0.09375

-1.5

-24

-384

-6144

-NaN

...1111

0.000427

0.006836

0.109375

1.75

448

7168

+NaN

-0.000427

-0.006836

-0.109375

-1.75

-28

-448

-7168

-NaN

There are certain small integer values like 17 that cannot be represented by this floating point format. We can represent 16 as 01011000, but the next binary digit 01011001 has a floating point approximation of 18. Even if you use a larger floating point format, like the 64 bit format in your web browser, you will eventually reach a maximum safe integer.

code.js
1console.log(2 ** 53)       // 9007199254740992
2console.log(2 ** 53 + 1)   // 9007199254740992 - same!

Notice that is the maximum safe integer for a floating point format which allocate 52 bits for the mantissa. This lends to the intuition that is an approximate threshold beyond which a floating point format will lose integer precision.

code.js
1// For compatibility with older JavaScript runtimes which use 32 bit numbers
2console.log(Number.MAX_SAFE_INTEGER)
3console.log(Number.MIN_SAFE_INTEGER)
4
5// Maximum and minimum representable values are many orders of magnitude higher
6console.log(Number.MAX_VALUE)
7console.log(Number.MIN_VALUE)

Depending on the application, a machine learning developer might choose a more specialized 8 bit floating point format like E5M2FNUZ, which uses 5 bits for the exponent, 2 bits for the mantissa, and defines itself as finite (FN) with unsigned zeroes (UZ).

E5M2FNUZEach row is the first four bits, and each column is the second four bits.

0000...

0001...

0010...

0011...

0100...

0101...

0110...

0111...

1000...

1001...

1010...

1011...

1100...

1101...

1110...

1111...

...0000

0.000244

0.003906

0.0625

256

4096

NaN

-0.000244

-0.003906

-0.0625

-1

-16

-256

-4096

...0001

0.000008

0.000305

0.004883

0.078125

1.25

320

5120

-0.000008

-0.000305

-0.004883

-0.078125

-1.25

-20

-320

-5120

...0010

0.000015

0.000366

0.005859

0.09375

1.5

384

6144

-0.000015

-0.000366

-0.005859

-0.09375

-1.5

-24

-384

-6144

...0011

0.000023

0.000427

0.006836

0.109375

1.75

448

7168

-0.000023

-0.000427

-0.006836

-0.109375

-1.75

-28

-448

-7168

...0100

0.000031

0.000488

0.007813

0.125

512

8192

-0.000031

-0.000488

-0.007813

-0.125

-2

-32

-512

-8192

...0101

0.000038

0.00061

0.009766

0.15625

2.5

640

10240

-0.000038

-0.00061

-0.009766

-0.15625

-2.5

-40

-640

-10240

...0110

0.000046

0.000732

0.011719

0.1875

768

12288

-0.000046

-0.000732

-0.011719

-0.1875

-3

-48

-768

-12288

...0111

0.000053

0.000854

0.013672

0.21875

3.5

896

14336

-0.000053

-0.000854

-0.013672

-0.21875

-3.5

-56

-896

-14336

...1000

0.000061

0.000977

0.015625

0.25

1024

16384

-0.000061

-0.000977

-0.015625

-0.25

-4

-64

-1024

-16384

...1001

0.000076

0.001221

0.019531

0.3125

1280

20480

-0.000076

-0.001221

-0.019531

-0.3125

-5

-80

-1280

-20480

...1010

0.000092

0.001465

0.023438

0.375

1536

24576

-0.000092

-0.001465

-0.023438

-0.375

-6

-96

-1536

-24576

...1011

0.000107

0.001709

0.027344

0.4375

112

1792

28672

-0.000107

-0.001709

-0.027344

-0.4375

-7

-112

-1792

-28672

...1100

0.000122

0.001953

0.03125

0.5

128

2048

32768

-0.000122

-0.001953

-0.03125

-0.5

-8

-128

-2048

-32768

...1101

0.000153

0.002441

0.039063

0.625

160

2560

40960

-0.000153

-0.002441

-0.039063

-0.625

-10

-160

-2560

-40960

...1110

0.000183

0.00293

0.046875

0.75

192

3072

49152

-0.000183

-0.00293

-0.046875

-0.75

-12

-192

-3072

-49152

...1111

0.000214

0.003418

0.054688

0.875

224

3584

57344

-0.000214

-0.003418

-0.054688

-0.875

-14

-224

-3584

-57344

Mathematical notation

It is helpful to understand how to express floating point numbers in mathematical notation if you plan to formally reason about floating point computations.

The sign () is a single bit which indicates whether the number is positive or negative, where 0 is positive and 1 is negative. This is usually expressed as , since and .

TODO: number line intuition for sign bit

The exponent () - determines the scale of the number (how big or small it is). When the exponent is 0, the value is referred to as a subnormal number, and is computed in a slightly different way.

TODO: number line intuition for exponent

The mantissa () determines the precision of the number (the actual digits), and is often referred to as the significand. In numerical analysis, it is often expressed as the sequence of binary digits after the decimal point.

For subnormal numbers (), the mantissa represents the fractional part of , which lies somewhere in the interval . For all other exponents , the mantissa represents the fractional part of , which lies in the interval . One reason for this is a hidden bit trick that will be elaborated on later.
The bias (written as ) is subtracted from the exponent, so positive and negative exponents can both be stored with an unsigned integer.

For this particular format, where , the represented value might be formally expressed as

The widely-used 64 bit floating point format allocates 11 bits for the exponent and 52 bits for the mantissa, with . The represented value is calculated as:

In numerical analysis of floating point numbers, the base (written as ) usually refers to , but IEEE 854 also allows for base 10 to be used. There is also a special constant which refers to machine epsilon - a worst case relative error for a single rounding operation.

An example of a numerical analysis theorem from David Goldberg's seminal work What Every Computer Scientist Should Know About Floating Point Arithmetic, published in 1991.

The rounding error incurred when using the following formula to compute the area of a triangle:

is at most , provided that subtraction is performed with a guard digit, , and square roots are computed to within >1 ulp.

Relative error

You can edit the bits, exponent, and mantissa of a 64 bit floating point number below to get a sense of how much precision this actually provides.

Try editing the underlying numbers or bits of the floating point number.

sign

exponent

mantissa

This is often referred to as a double precision floating point number. The single precision floating point format allocates 8 bits for the exponent and 23 bits for the mantissa, which still provides fairly high precision with half the memory usage.

Floating point boundaries

The significand is being displayed as a number, but formal mathematical proofs often reason about the significand as a sequence of binary digits of a normalized value which takes the form , or for subnormal values. When we know the first digit is always a 1, we can actually get an extra bit of precision for free - this is called the hidden bit trick.

You will often see floating point numbers written in scientific notation (), but as you can see, this does not indicate a standard multiplication operation.

Floating point proofs

In floating point arithmetic and associated proofs, a floating point number is typically written as , where represents digits of the significand, is the base, and is the exponent.

For engineers

Understanding the nuances of floating point numbers is especially helpful when designing machine learning systems.