"how numbers are stored and used in computers"
Block Floating Point (BFP) is a hybrid numerical format that blends the efficiency of fixed-point arithmetic with the dynamic range of floating point. Unlike traditional floating-point representations, where each value stores its own exponent and mantissa, BFP assigns a shared exponent to a group (or "block") of values, where each individual value only stores a mantissa and sign bit. This format is especially valuable in low-precision deep learning inference, where reducing memory footprint and improving compute throughput are of utmost importance.
In BFP, a vector or matrix is divided into blocks of fixed size, such as 8, 16, or 32 elements. Each block computes a shared exponent (usually 4 to 8 bits), which is applied uniformly to all elements in the block. It also includes a set of low-precision mantissas, typically 2 to 8 bits each, and optional sign bits depending on the application and format.
This reduces the per-element bit count significantly. For example, an 8-element BFP4 block with 3-bit mantissas, 1-bit signs, and a single 5-bit shared exponent would use just 37 bits total, versus 64 bits for 8 FP8 values.
Each value in a BFP block is reconstructed as
Where
The mantissas are typically interpreted as fixed-point numbers (e.g., with implicit normalization), and the entire value is then scaled by a common power of two.
Let’s say we have a block of 4 values stored as:
[+3, -2, +1, 0]
Each value becomes:
These values are quantized representations of floating-point numbers close in magnitude. The shared exponent ensures they are all scaled correctly, while using very little memory.
| Feature | Traditional FP (e.g., FP32) | BFP | |----------------|-----------------------------|-------------------------| | Exponent Scope | Per value | Per block | | Precision | High (per value) | Lower, block-specific | | Range | Wide | Medium (shared exponent)| | Memory usage | High (e.g., 32 bits/value) | Low (e.g., 4–8 bits/value) | | Speed on SIMD | Lower | Higher (aligned ops) |
BFP is widely used in AI inference accelerators where performance and bandwidth are constrained:
Pros:
Cons:
Block Floating Point (BFP) is a practical compromise between fixed-point efficiency and floating-point flexibility. By applying a shared exponent to groups of low-precision mantissas, BFP enables high-throughput, low-memory computations well suited for modern ML hardware. As neural networks continue to push toward quantization-aware training and ultra-low-precision inference, BFP and its variants (like FP4 with shared exponent) play a key role in scaling AI efficiently.
Would you like a diagram of a BFP block or example code for encoding/decoding BFP values?