"how numbers are stored and used in computers"
A hash index in PostgreSQL is a specialized index type used primarily for fast equality comparisons. Unlike B-tree indexes, which support a wide range of comparison operations (e.g., less than, greater than, etc.), hash indexes are designed for use cases where only equality (=
) queries are required. They are most effective when queries frequently filter on a single column using equality conditions.
Historically, hash indexes were not WAL-logged (Write-Ahead Logging) and could be corrupted in the event of a crash. However, starting with PostgreSQL 10, hash indexes are WAL-logged and therefore crash-safe, which significantly improves their reliability and suitability for production workloads.
To create a hash index, use the CREATE INDEX
statement with the USING HASH
clause.
code.txt1CREATE INDEX idx_hash_email ON users USING HASH (email);
This index will now optimize queries that filter on the email
column using equality.
code.txt1SELECT * FROM users WHERE email = 'example@example.com';
In a hash index, each key value is passed through a hash function, which produces a fixed-size value called a hash value. This hash value is then used to determine the location of the data in the index. A good hash function should spread its hash values evenly across a relatively uniform set of slots in the index, allowing for quick lookups.
When a query is executed that uses a hash index, PostgreSQL applies the hash function to the search key and looks up the corresponding hash value in the index. If the hash value matches, it retrieves the tuple identifier (TID
) for the corresponding row in the table.
In terms of internal structure, hash indexes are organized as buckets where the hash values are stored, and each bucket contains a list of TIDs pointing to the actual rows in the heap. Hash indexes are typically faster for equality lookups than B-tree indexes because they avoid the need for comparisons between values and instead use direct hash-based lookups.
The hash function used by PostgreSQL's hash index is designed to distribute values uniformly across the hash table, minimizing the likelihood of "collisions" (when different input values hash to the same bucket). The goal is to ensure that index lookups remain efficient, with a relatively small number of keys per bucket, and that access time remains constant, ideally O(1).
Internally, a hash index is stored as a hashtable where each entry contains a hash value corresponding to a row's key, and a list of tuple IDs (TIDs) pointing to rows in the heap that match the hashed value. Hash index buckets are essentially linked lists, and each bucket can store multiple TIDs in case of hash collisions, though PostgreSQL tries to minimize these occurrences.
Before PostgreSQL 10, hash indexes were not WAL-logged, meaning they were not crash-safe. If the database crashed before an index operation was written to WAL, the hash index could become corrupted, leading to data inconsistency.
Starting with PostgreSQL 10, hash indexes are WAL-logged, which ensures that updates to the index are durable and consistent, even in the event of a crash. This improvement makes hash indexes safe for use in production environments, where durability is essential.
Hash indexes are particularly effective for equality queries on columns where you don't need to perform any ordering or range comparisons. For example, hash indexes work well with exact match lookups such as WHERE column = value
, and primary keys or unique constraints on columns where equality is the only relevant operation.
However, hash indexes are not suitable for range queries (<
, >
, BETWEEN
), as they only support equality comparisons. Nor are they suitable for sorting operations, since hash indexes do not maintain any specific ordering of the indexed values.
The only way to use a hash index in a multi-column query is if the hashed column is the only one used for equality filtering.
Hash indexes are efficient for equality queries because they allow direct access to the corresponding bucket without the need for traversing an ordered structure, as is the case with B-tree indexes. This can result in faster lookups for specific types of queries.
However, hash indexes have some limitations:
While both hash indexes and B-tree indexes are used for speeding up query performance, they have key differences:
<
, >
, etc.), and prefix matching. They are generally more versatile and suitable for a broader range of use cases.In practice, hash indexes can be an excellent choice for simple lookups where only equality comparisons are needed, especially when query performance is a critical concern.
While hash indexes can improve performance for specific queries, they come with some trade-offs:
ORDER BY
, GROUP BY
, or other operations that require sorted data.Thus, hash indexes are best used for highly specific cases where equality lookups dominate the workload, and range queries are not a concern.