"how numbers are stored and used in computers"
This is a guide for those that are curious about the precise cost of training a large language model (LLM) or the memory requirements for hosting one.
Many of these concepts are taken from Scaling LLMs.
Promising model architectures routinely fail either because they can’t run efficiently at scale or because no one puts in the work to make them do so.
The goal of "model scaling" is to be able to increase the number of chips used for training or inference while achieving a proportional, linear increase in throughput. This is known as “strong scaling”.