"how numbers are stored and used in computers"
The bfloat16 (BF16) format is a 16-bit floating-point representation designed for machine learning and numerical computing. Unlike the IEEE 754 half-precision (FP16) format, BF16 prioritizes range over precision, with a much higher proportion of bits allocated toward the exponent as compared to other floating point formats.
BF16 was originally developed by Google for use in Tensor Processing Units (TPUs). It has since gained widespread support across CPUs, GPUs, and AI accelerators, and is close to being a drop-in replacement for FP32 in specialized machine learning applications.
BF16 is not formally defined by IEEE 754, though it is closely aligned with IEEE formats. Like FP32, BF16 uses the same exponent size and bias, enabling a comparable dynamic range while cutting the number of mantissa bits in half.
The value is computed as
Consider the decimal number 1.5. Its IEEE 754 FP32 binary encoding is:
To obtain the BF16 representation, we simply take the upper 16 bits:
This yields a BF16 hex encoding of 0x3FC0
.
This truncation introduces a small rounding error compared to FP32 but preserves the scale of the number and its representation in exponent space.
The development of BF16 was driven by empirical insights from deep learning workloads. Many neural network operations—such as convolution, matrix multiplication, and activation functions—are tolerant to precision loss in individual elements, but highly sensitive to range errors (e.g., overflow during training). By matching the exponent range of FP32, BF16 enables computations involving large gradients, scaled learning rates, and exponential moving averages, all without suffering from numerical instability. This allows models to train using low-precision data representations without requiring careful tuning of input magnitudes or loss scaling.
BF16 was designed with hardware simplification in mind. Conversion between BF16 and FP32 is straightforward, since an FP32 number can be converted to BF16 by simply truncating the lowest 16 mantissa bits, and preserving the exponent and sign bits. This enables high-throughput mixed-precision execution pipelines where storage and communication are in BF16, arithmetic is performed in FP32 or FP64, and downcasting is done via simple bit manipulation.
For example, Intel's AVX-512 BF16 instructions and ARM's BFloat16 extension support fast accumulation and conversion pathways to minimize accuracy loss in training and inference.
torch.bfloat16
, autocast via torch.cuda.amp.autocast()
tf.bfloat16
and mixed-precision APInumpy.dtype('bfloat16')
in third-party extensions (e.g., Intel Extension for NumPy)Many frameworks automatically cast parameters to BF16 where supported and fall back to FP32 where not, enabling portability and performance without sacrificing compatibility.
BF16 is particularly effective in the following scenarios:
It is essential to combine BF16 usage with loss scaling and master weights in FP32 to prevent degradation during gradient descent. Most frameworks provide automatic tools to handle these transitions.
While BF16 provides excellent dynamic range, it is unsuitable for algorithms requiring high relative accuracy, such as linear solvers with ill-conditioned matrices, eigenvalue decomposition, and scientific simulations requiring conservation laws. Developers should be aware that BF16 can yield visibly different results from FP32, even if convergence is preserved. Debugging numerical issues may also be more difficult when using truncated formats, due to the absence of low-order mantissa bits.