"how numbers are stored and used in computers"
While the IEEE 754 single-precision (32-bit) and double-precision (64-bit) formats are familiar to most developers, the 16-bit floating-point representation — commonly referred to as FP16 or half-precision — has gained prominence in recent years due to its efficiency advantages in specialized computing domains.
In 2018, a paper titled Mixed Precision Training, authored by researchers from NVIDIA and Baidu Research, introduced a methodology for training deep neural networks using half-precision (FP16) floating point numbers. This approach was aimed at reducing memory usage and accelerating computation without compromising model accuracy or necessitating changes to hyperparameters.
Micikevicius, P., Narang S., et al. (2018). Mixed Precision Training. ICLR 2018. Modern deep learning training systems use single-precision (FP32) format. In this paper, we address the training with reduced precision while maintaining model accuracy. Specifically, we train various neural networks using IEEE half-precision format (FP16). Since FP16 format has a narrower dynamic range than FP32, we introduce three techniques to prevent model accuracy loss: maintaining a master copy of weights in FP32, loss-scaling that minimizes gradient values becoming zeros, and FP16 arithmetic with accumulation in FP32. Using these techniques we demonstrate that a wide variety of network architectures and applications can be trained to match the accuracy FP32 training. Experimental results include convolutional and recurrent network architectures, trained for classification, regression, and generative tasks. Applications include image classification, image generation, object detection, language modeling, machine translation, and speech recognition. The proposed methodology requires no changes to models or training hyper-parameters.
The FP16 format is defined by the IEEE 754-2008 standard. It encodes a real number using 16 bits, divided into three fields:
The value represented by an FP16 number is determined as follows:
For normalized numbers (where
For subnormal numbers (when
For special values:
The exponent field enables a range of approximately
To encode the decimal number 1.5 in FP16:
01111
1000000000
(representing the .1
after the leading implicit 1
)The complete bit pattern is:
This demonstrates the use of the normalized encoding format. Lossy conversion is often necessary when casting from FP32 to FP16, due to overflow, underflow, or truncation of significand bits.
The resurgence of interest in FP16 arises from its compelling trade-offs in memory footprint, computational throughput, and energy efficiency — all critical considerations in large-scale or resource-constrained environments.
Many modern GPUs (e.g., NVIDIA's Tensor Cores) and dedicated ML accelerators (e.g., Google TPUs) provide native support for FP16 arithmetic, allowing for massively parallel computation with lower latency and higher throughput. In machine learning, this enables mixed-precision training, where most operations are performed in FP16 while maintaining master copies in FP32 to mitigate precision loss.
FP16 consumes only 2 bytes per value, compared to 4 bytes for FP32 and 8 bytes for FP64. This reduction significantly decreases memory bandwidth usage, enabling larger batch sizes or models to be processed in the same memory space—a major advantage in deep learning training and inference pipelines.
Arithmetic at lower precision requires fewer transistors to execute and less data movement, both of which reduce power consumption. This is particularly valuable in mobile and embedded systems where battery life is critical.
Despite its advantages, FP16 should be applied with care due to its narrow dynamic range and limited precision. The reduced exponent width (5 bits) means that FP16 cannot represent very large or very small values. In practical terms, underflow and overflow become significantly more likely, particularly in algorithms involving accumulations or divisions.
For instance, the smallest normalized value
In many scientific and engineering applications, FP16 is unsuitable as a standalone representation. Instead, it is often paired with higher-precision formats in hybrid computation schemes, such as FP16 arithmetic with FP32 accumulation, to reduce error propagation. There are also lossless casting strategies to minimize degradation of critical parameters (e.g., gradients, loss values).
Support for FP16 is now widely available across modern platforms:
numpy.float16
, struct
, and external IEEE 754 conversion libraries.torch.float16
, torch.autocast
), TensorFlow (tf.float16
, tf.keras.mixed_precision
).In many of these systems, FP16 support is tightly integrated with kernel-level optimizations and vectorized operations.