"how numbers are stored and used in computers"
In PostgreSQL, indexes are auxiliary data structures used to speed up data retrieval operations. By maintaining separate structures that map key values to physical row locations (TIDs), indexes allow the query planner to avoid full table scans and instead locate matching rows more efficiently. PostgreSQL supports multiple index types, each optimized for different workloads and query patterns, with B-tree being the most common and versatile.
Indexes are automatically used by the planner based on query predicates, column statistics, and cost estimates. However, effective index usage depends heavily on schema design, workload characteristics, and query patterns.
The B-tree index is the default and most widely used index type in PostgreSQL. It maintains a balanced tree where keys are stored in sorted order, allowing efficient logarithmic-time lookups, range queries, and ordered scans. B-tree indexes also support equality (=
), range comparisons (<
, <=
, >
, >=
), and prefix matching for text types.
Internally, a B-tree index is structured as a hierarchy of pages: root, internal, and leaf nodes. Leaf nodes store key values alongside TID
s (tuple IDs) that point to the corresponding rows in the heap. Pages are automatically split as they fill up, maintaining balance and preserving order.
PostgreSQL B-tree indexes support multicolumn indexes, partial indexes (with WHERE
clauses), and unique constraints.
Each index is stored in its own physical file on disk, separate from the heap. Index pages, like heap pages, are typically 8KB in size and managed by the same buffer and write-ahead logging (WAL) infrastructure. Each index entry includes:
ctid
), encoded as a block number and offset.For multicolumn indexes, values are stored in the order specified at index creation. This impacts scan performance and determines which types of queries can benefit from index usage.
Unlike some MVCC-based systems with "index-only" tuple storage, PostgreSQL heaps and indexes are decoupled. Indexes do not store visibility metadata, so a lookup in an index does not guarantee that the pointed-to tuple is visible to the current transaction. This is why heap access is typically required to check visibility during a standard index scan.
PostgreSQL indexes must be manually kept in sync with the heap. Any INSERT
, DELETE
, or UPDATE
operation on a table results in corresponding changes to the relevant index(es). These updates are WAL-logged to ensure durability, and are handled transactionally.
Updates that change indexed columns must delete the old index entry and insert a new one. This contributes to index bloat over time, which can be mitigated using REINDEX
or VACUUM
.
PostgreSQL supports several specialized index types beyond B-tree, each with its own internal logic and applicable use cases.
Choosing the right index type is crucial. GIN, for example, is excellent for containment operations (@>
), while BRIN offers performance on huge append-only tables at a fraction of the storage cost.
PostgreSQL supports index-only scans, where the index itself can satisfy a query without visiting the heap, provided that all needed columns are present in the index and the heap page is marked all-visible in the visibility map.
To facilitate this, developers can use the INCLUDE
clause to add non-key columns to a B-tree index, allowing it to act as a covering index for a broader range of queries.
code.txt1CREATE INDEX idx_covering ON logs (user_id) INCLUDE (status, created_at);
Good index design involves tradeoffs. Indexes speed up reads but impose overhead on writes (INSERT
/UPDATE
/DELETE
). Each additional index must be updated during data changes, and unused indexes consume disk and memory resources. Effective indexing strategy involves:
pg_stat_user_indexes
and pg_stat_all_indexes
Developers should also monitor index bloat, which can accumulate over time and degrade performance. Tools like pgstattuple
, pg_repack
, or REINDEX
can help analyze and mitigate index fragmentation.