"how numbers are stored and used in computers"
Dropout is a stochastic technique where individual neurons are randomly deactivated (set to zero) during training with a probability
During training, given a vector of weights
Dropout is not applied during inference time, but the activations are scaled by
Dropout can be viewed as a form of ensemble learning: at each training step, a different sub-network (a subset of the full model) is trained. The full network at test time effectively averages the predictions of an exponential number of these sub-networks. This leads to better generalization and mitigates overfitting, especially in larger networks.
Unlike L1/L2 regularization, dropout does not modify the loss function explicitly. However, it introduces noise into the optimization process, which acts as an implicit regularizer. It can reduce co-adaptation between neurons, forcing each one to learn more robust features that don't depend on others being present.