Table indexes help databases retrieve data quickly. Instead of scanning the whole table, the system uses indexes to find data faster. For searches like range scans, hash tables aren't effective because they're unordered. This is where the B+ tree comes in. It's a standard for file organization, providing an efficient and ordered structure for data management and index-based searches.
B+ trees manage sorted data efficiently, which is important in modern databases and file systems. They extend binary search trees by allowing more than two children per node. This reduces disk reads and writes, essential for large-scale data storage. Unlike binary trees with up to two children, B+ tree nodes can have many children within a range, keeping the tree balanced. This allows search, insertion, and deletion to run in logarithmic time.
A B+ tree's structure depends on its branching factor—the number of keys and children each node holds. Nodes have keys that split the data range and pointers to child nodes, guiding searches. B+ trees adjust dynamically by splitting or merging nodes as data is inserted or deleted, keeping things organized efficiently.
B+ trees are great for disk-based storage systems. They maximize data stored in each node, reducing how often the disk is accessed. Because they can quickly access a range of sorted records, B+ trees are vital in modern databases, where fast and efficient data handling is crucial.
B+tree performance considerations
B+ tree performance depends a lot on how well its design matches the hardware, like disks and memory. A key factor is the branching factor, which sets how many children each node has, affecting tree depth and lookup speed. The main goal of B+ tree tuning is to minimize I/O operations and use CPU caches effectively, especially with large datasets.
Branching factor and node size
The branching factor affects a B+ tree's depth and how many I/O operations are needed for lookups. A larger branching factor makes the tree shallower but increases node size. The best setup depends on disk block size, CPU cache line size, and key size.
For example, with 16KB pages and 8-byte keys and pointers, a node can hold about 1,000 key/value pairs. This lets a three-level B-tree index up to a billion records. The goal is to match node size to the disk block size (usually 4KB, 8KB, or 16KB) to reduce I/O overhead. We also consider CPU cache efficiency for in-memory operations. Balancing these helps optimize disk access and cache use, improving performance.
See this Blog article on B-trees for an interactive example.
Primary key choice
Your choice of primary key affects B+ tree structure and performance. Sequential keys, like auto-incrementing integers, allow predictable inserts and fast range queries, but they can create hotspots in multi-threaded systems. Random keys like UUIDs spread writes out better but may cause more fragmentation and deeper trees.
In high-concurrency systems, these trade-offs are more noticeable. Some systems use hybrid approaches, combining timestamps with sequential IDs to balance write distribution and range query speed.
Write amplification
As B+ trees grow, nodes split to stay balanced. This causes write amplification, which can slow down writes, especially on SSDs. To reduce this, you can use strategies like fractional cascading, where parent nodes store some child keys, and adaptive node sizes that change with the workload.
Some systems use write-optimized B+ tree variants, like Bε-trees, which buffer updates to reduce write amplification.
Cache efficiency
Modern B+ tree designs use cache-conscious methods to improve in-memory performance. These include node compression to fit more keys in a single cache line, cache-oblivious algorithms that work well regardless of cache size, and SIMD-optimized searches for parallel key comparisons within a node.
Resources
-
Open Data Structures Morin (2013)
Chapter 14, "External Memory Searching," discusses B-trees as an efficient data structure in the external memory model. B-trees are a generalization of binary trees, efficient for large datasets that can't fit into RAM and must be stored on disks or SSDs.
A B-tree, for an integer B ≥ 2, is a tree where all leaves are at the same depth. Each non-root internal node has between B and 2B children. This structure keeps the B-tree balanced, ensuring efficient operations. The root node can have between 2 and 2B children.
The height of a B-tree (h) is about log base B of the number of leaves, plus one: h ≈ logB(N) + 1. This shows how B-trees stay shallow even when storing lots of data, which is important to minimize disk accesses.
Each node in a B-tree holds an array of keys. An internal node with k children has k - 1 keys, stored in order to allow efficient search, insertion, and deletion. In a B-tree, every key in a node is greater than all keys in the left subtree and smaller than those in the right subtree, similar to a binary search tree.
-
Andy Pavlo used this in the Lecture on Tree Indexes from Carnegie Mellon University's 15-445/645 Database Systems course. It's a neat and handy visualization of the data structure.
-
Insertion Into a B-Tree from UAlberta
It details a three-step method: first, find the appropriate leaf node for the new value using the search procedure, then add the value to the node, and finally handle potential node overflow. If a node overflows when inserting, it's split into parts—left, middle, and right—with adjustments up the tree, including creating a new root if needed. This method ensures B-trees grow from the top rather than the leaves, maintaining their perfect height balance.
-
This text provides a concise overview of B-trees and B+-trees, data structures optimized for storing and managing large datasets. It starts by discussing the inefficiency of changing sorted arrays and introduces dividing data into fixed-size blocks for easier insertions and deletions. The main focus is on the structure of B-trees, where blocks either store data or reference other blocks. This hierarchical setup handles large datasets efficiently, especially for operations like splitting full blocks or maintaining tree balance. The text emphasizes the practicality of B-trees in both on-disk and in-memory storage, highlighting their advantages over simpler data structures for managing ordered key-value maps.
-
Ubiquitous B-tree. Comer. ACM Computing Surveys (CSUR) 11 (1979)
"The Ubiquitous B-Tree" is a seminal work that played a big role in promoting B-trees. The paper helped make B-trees a go-to data structure for database management systems and file systems. It made the concept accessible and explained the advantages, contributing to B-trees becoming a standard choice for many storage systems.
The B-tree version in the paper includes data in the internal nodes. In modern implementations, it's common to store data only in the leaf nodes. This makes B-trees more efficient for range queries and better for systems where data is stored on disk. The internal nodes in a B+-tree contain only keys and guide the search to the leaf node with the desired data.
-
Modern B-tree techniques. Graefe, and others. Foundations and Trends in Databases 3 (2011)
This paper provides an updated look at B-tree optimizations and techniques that have emerged over the years. It's a must-read for understanding modern B-tree practices.
-
More Modern B-tree techniques. Graefe. Foundations and Trends in Databases vol 13, issue 3 (2024)
This monograph complements the earlier survey and brings the content up to date. It adds new techniques for tree structures and query processing, covers log-structured merge-forests and stepped-merge forests, and discusses concurrency control and in-memory B-trees, from thread-private to persistent.
-
Two is Better Than One: The Case for 2-Tree for Skewed Data Sets. Zhou, Yu, Graefe, and Stonebraker. memory 11 (2023) 📺
This discusses a 2-Tree architecture to improve memory use in databases with skewed data sets. B-trees don't handle skewed data well, leading to poor memory use. In a 2-Tree, frequently accessed records are stored in one tree, and less-used records in another. A migration protocol moves records between trees as their access frequency changes.
-
It's designed to be an on-disk index that handles indexes larger than memory effectively.
-
The record migration protocol can adapt to any B-tree.
The next step is to expand the approach to an N-Tree system with multiple storage levels. It's like an LSM of B-trees, but the levels are mutable and records can move between them.
The approach aims to greatly improve memory use and throughput without hurting range scan performance.
-
-
Adaptive Hybrid Indexes. Anneser, Kipf, Zhang, Neumann, and Kemper (2022)
This paper addresses the large memory use of indexes in database systems. Traditional compact indexes reduce space but hurt performance because they're static. They can't adjust to changing workloads or balance memory efficiency and performance.
The goal is to optimize both space and performance in index structures by dynamically adjusting encodings at run-time based on workload patterns.
-
Taming B-Trees (ScyllaDB)
This article discusses how ScyllaDB switched from Red-Black trees to B-trees to improve its in-memory cache, addressing issues like high memory use and CPU usage during tree searches. While B-trees are often considered on-disk data structures, the article explains how B-trees offer better cache locality and reduce branch misprediction problems of binary trees, optimizing key searches and overall performance.