"how numbers are stored and used in computers"

UTF-8: Unicode Transformation Format - 8-bit

UTF-8 is a variable-width character encoding that can represent any character in the Unicode standard. It was designed by Ken Thompson and Rob Pike in 1992 and has become the dominant character encoding for the World Wide Web, accounting for over 98% of all web pages.

History and Development

UTF-8 was developed as a solution to the problem of representing Unicode characters in a way that was both efficient and backward compatible with ASCII. The encoding was designed to be self-synchronizing, meaning that it's possible to start reading from any point in a UTF-8 sequence and correctly identify the start of the next character.

The design of UTF-8 was influenced by the need to maintain compatibility with existing ASCII-based systems while providing support for the full range of Unicode characters. This backward compatibility was crucial for its adoption, as it allowed systems to gradually transition from ASCII to Unicode without breaking existing functionality.

Technical Details

UTF-8 uses a variable number of bytes to represent each character, with the number of bytes depending on the character's Unicode code point. The encoding scheme is as follows:

Single-byte characters (0-127): These are identical to ASCII, using the same byte values
Two-byte characters (128-2047): First byte starts with 110, second byte starts with 10
Three-byte characters (2048-65535): First byte starts with 1110, following bytes start with 10
Four-byte characters (65536-1114111): First byte starts with 11110, following bytes start with 10

This design ensures that:

ASCII characters (0-127) are represented by a single byte
The first byte of a multi-byte sequence indicates how many bytes follow
All bytes in a multi-byte sequence start with 10, making it easy to identify continuation bytes

Advantages

UTF-8 has several key advantages that have contributed to its widespread adoption:

Backward Compatibility: Any valid ASCII text is also valid UTF-8 text
Self-Synchronizing: It's possible to find the start of a character from any point in the text
Efficiency: It uses the minimum number of bytes needed for each character
No Byte Order Mark: Unlike UTF-16 and UTF-32, UTF-8 doesn't require a byte order mark
Wide Support: It's supported by virtually all modern programming languages and systems

Implementation Considerations

When implementing UTF-8, several important considerations must be taken into account:

Validation: Ensuring that byte sequences are valid UTF-8
String Length: Counting characters rather than bytes
Substring Operations: Handling multi-byte characters correctly
Sorting: Implementing proper collation for different languages
Display: Handling characters that may require special rendering

Common Issues

Despite its advantages, UTF-8 can present some challenges:

String Length: The number of bytes may not equal the number of characters
Random Access: Direct access to characters requires scanning from the start
Validation: Invalid byte sequences must be handled appropriately
Performance: Processing multi-byte characters can be slower than single-byte encodings

Best Practices

When working with UTF-8, it's important to follow these best practices:

Always validate input: Check that byte sequences are valid UTF-8
Use appropriate string functions: Avoid byte-based operations that might split characters
Handle errors gracefully: Provide appropriate error handling for invalid sequences
Consider performance: Use optimized UTF-8 processing libraries when available
Document encoding: Clearly specify UTF-8 encoding in file headers and protocols

References

RFC 3629: UTF-8, a transformation format of ISO 10646
Unicode Consortium. (2022). "The Unicode Standard"
Yergeau, F. (2003). "UTF-8, a transformation format of ISO 10646"