"how numbers are stored in computers"
While the IEEE 754 single-precision (32-bit) and double-precision (64-bit) formats are familiar to most developers, the 16-bit floating-point representation — commonly referred to as FP16 or half-precision — has gained prominence in recent years due to its efficiency advantages in specialized computing domains.
The FP16 format is defined by the IEEE 754-2008 standard. It encodes a real number using 16 bits, divided into three fields:
The value represented by an FP16 number is determined as follows:
For normalized numbers (where
For subnormal numbers (when
For special values:
The exponent field enables a range of approximately
To encode the decimal number 1.5 in FP16:
01111
1000000000
(representing the .1
after the leading implicit 1
)The complete bit pattern is:
This demonstrates the use of the normalized encoding format. Lossy conversion is often necessary when casting from FP32 to FP16, due to overflow, underflow, or truncation of significand bits.
The resurgence of interest in FP16 arises from its compelling trade-offs in memory footprint, computational throughput, and energy efficiency — all critical considerations in large-scale or resource-constrained environments.
Many modern GPUs (e.g., NVIDIA's Tensor Cores) and dedicated ML accelerators (e.g., Google TPUs) provide native support for FP16 arithmetic, allowing for massively parallel computation with lower latency and higher throughput. In machine learning, this enables mixed-precision training, where most operations are performed in FP16 while maintaining master copies in FP32 to mitigate precision loss.
FP16 consumes only 2 bytes per value, compared to 4 bytes for FP32 and 8 bytes for FP64. This reduction significantly decreases memory bandwidth usage, enabling larger batch sizes or models to be processed in the same memory space—a major advantage in deep learning training and inference pipelines.
Arithmetic at lower precision requires fewer transistors to execute and less data movement, both of which reduce power consumption. This is particularly valuable in mobile and embedded systems where battery life is critical.
Despite its advantages, FP16 should be applied with care due to its narrow dynamic range and limited precision. The reduced exponent width (5 bits) means that FP16 cannot represent very large or very small values. In practical terms, underflow and overflow become significantly more likely, particularly in algorithms involving accumulations or divisions.
For instance, the smallest normalized value
In many scientific and engineering applications, FP16 is unsuitable as a standalone representation. Instead, it is often paired with higher-precision formats in hybrid computation schemes, such as FP16 arithmetic with FP32 accumulation, to reduce error propagation. There are also lossless casting strategies to minimize degradation of critical parameters (e.g., gradients, loss values).
Support for FP16 is now widely available across modern platforms:
numpy.float16
, struct
, and external IEEE 754 conversion libraries.torch.float16
, torch.autocast
), TensorFlow (tf.float16
, tf.keras.mixed_precision
).In many of these systems, FP16 support is tightly integrated with kernel-level optimizations and vectorized operations.