"how numbers are stored and used in computers"

ClickHouse

ClickHouse is a high-performance, column-oriented database optimized for real-time analytics on large datasets.

This is a deep dive into the internals of ClickHouse, which attempts to trace an optimal learner's path through the source code to uncover how its core components work. We'll explore the data structures, algorithms, and design principles that make it one of the fastest OLAP databases in the world, and a brilliant contribution to the open-source ecosystem. Whether you're a systems engineer, database developer, or open-source contributor, this will hopefully equip you with a detailed understanding of how ClickHouse operates from the inside out.

Architecture

Introduction to ClickHouse Architecture
- High-level overview of ClickHouse
- Core principles and design goals
- Key components and how they interact
- Build system and codebase structure
Storage Engine Internals
- MergeTree family overview
- How data is written and stored on disk
- Indexing strategies
- Code walkthrough: StorageMergeTree.cpp, MergeTreeDataPart.cpp
Query Execution Pipeline
- Parsing, planning, and execution
- The pipeline of stages: AST → Plan → Interpreter → Pipeline
- Code walkthrough: InterpreterSelectQuery.cpp, QueryPipeline.cpp
Data Compression and Encoding
- Supported codecs and when they’re used
- How compression interacts with performance
- Code walkthrough: CompressionCodec.cpp, CompressedReadBuffer.cpp
Vectorized Execution Engine
- Columnar processing model
- Vectorized operations on blocks
- Code walkthrough: IProcessor.h, Block.cpp, Column*.cpp
Merge and Mutation Mechanics
- Background merges and data consistency
- Mutations and TTL management
- Code walkthrough: MergeTreeDataMergerMutator.cpp, ReplicatedMergeTree*.cpp
Distributed Query Execution
- Cluster queries and coordination
- Role of RemoteBlockInputStream and DistributedQueryExecutor
- Code walkthrough: Cluster.cpp, DistributedBlockInputStream.cpp
Caching and Memory Management
- Mark cache, uncompressed cache, and index caches
- Memory tracking and limits
- Code walkthrough: MarkCache.cpp, Arena.cpp, MemoryTracker.cpp
Replication and Fault Tolerance
- Zookeeper-based coordination
- ReplicatedMergeTree and quorum writes
- Code walkthrough: ReplicatedMergeTreeLogEntry.cpp, ZooKeeper.cpp
Extensibility and Plugin Interfaces
- Adding new table engines or functions
- The role of factories in extensibility
- Code walkthrough: TableFunctionFactory.cpp, AggregateFunctionFactory.cpp
Performance Tuning and Observability
- Profiling and tracing query execution
- System tables and logs
- Code walkthrough: SystemLog.cpp, TraceCollector.cpp
Contributing
- Contributing to ClickHouse
- Paths to explore more (e.g., internals of specific functions or engines)
- Open research questions in OLAP system design