Number Representations & States

"how numbers are stored and used in computers"

Regularization

Regularization improves generalization by constraining the hypothesis space. It's especially critical in overparameterized settings, where many different parameter configurations can perfectly fit the training data, but only a subset of them will generalize well.

Regularization refers to a set of techniques used to prevent overfitting by adding a penalty term to the loss function. This discourages the model from fitting the training data too closely by controlling the complexity of the model, particularly the size of its parameters.

Standard loss function

Given a dataset of examples , and a prediction function parameterized by , we aim to minimize a loss function . In regression, we often use the Mean Squared Error (MSE):

With regularization, we modify the loss to include a penalty term :

Where is the regularization strength — a hyperparameter that controls the trade-off between data fit and parameter size - and is a regularization function, commonly based on the norm of .

Regularization techniques

Impact

The addition of a regularization term modifies the gradient during optimization. For example, in gradient descent with L2 regularization, the weight update becomes:

Here, the term acts like a "friction" that resists the growth of parameters. There are a range of possible techniques to achieve the same outcome, or optimize model weights for a particular floating point format.