"how numbers are stored and used in computers"

Regularization

Regularization improves generalization by constraining the hypothesis space. It's especially critical in overparameterized settings, where many different parameter configurations can perfectly fit the training data, but only a subset of them will generalize well.

Regularization refers to a set of techniques used to prevent overfitting by adding a penalty term to the loss function. This discourages the model from fitting the training data too closely by controlling the complexity of the model, particularly the size of its parameters.

Standard loss function

Given a dataset of examples , and a prediction function parameterized by , we aim to minimize a loss function . In regression, we often use the Mean Squared Error (MSE):

With regularization, we modify the loss to include a penalty term :

Where is the regularization strength — a hyperparameter that controls the trade-off between data fit and parameter size - and is a regularization function, commonly based on the norm of .

Regularization techniques

L2 Regularization

Penalizes large weights and encourages smaller, more evenly distributed values.

L1 Regularization

Promotes sparsity by encouraging some weights to become exactly zero.

Dropout

Promotes redundancy and robustness by randomly setting weights to zero during training.

Early Stopping

Limits how long the model can optimize itself on the training data.

Impact

The addition of a regularization term modifies the gradient during optimization. For example, in gradient descent with L2 regularization, the weight update becomes:

Here, the term acts like a "friction" that resists the growth of parameters. There are a range of possible techniques to achieve the same outcome, or optimize model weights for a particular floating point format.