Notes on databases

Nov 19, 2023

References, articles, and information on various database systems and database internals. It serves as a reference and bookmarks for systems, architectures, and implementation details.

Systems

An outline of database systems and architecture. Work in progress. The goal is for me to structure my notes better and to have a reference for when I need to look up something.

Neon

Neon is a system that is both protocol and application-compatible with Postgres. Its architecture, which draws inspiration from Aurora's design, separates compute and storage and is managed by a control plane1.

At its core, Neon operates on the Page (8KB) abstraction level. It functions as a replacement for the Postgres storage engine without concern of the specific page content. This approach allows for a minimal integration surface with Postgres, enabling the use of only slightly modified forks of the database.

The storage layer is composed of Pageservers and Safekeepers. Pageservers have the responsibility of storing Postgres pages, whereas Safekeepers are tasked with persisting the write-ahead logs (WALs). When Postgres needs to read data, it retrieves pages directly from the Pageserver.

WAL records are streamed to the Safekeeper cluster. For reliability, a cluster of three Safekeeper nodes is maintained using Paxos. Following this, Pageservers ingest the WALs from Safekeepers and store the data both locally and on S3 as "layers". The layers enable point in time backups and efficient branching utilizing copy-on-write technique 2. To efficiently handle read requests, Pageservers compact these layers and create materialized pages from the WAL data 3.

Like Aurora, Neon adheres to the single-writer paradigm that is characteristic of Cloud-Native OLTP systems. It uses a shared-storage design that follows a primary-secondary paradigm 4 5.

YugabyteDB

YugabyteDB is a high-performance, distributed SQL database. It is built on a globally-distributed, strongly-consistent document store and is fully compatible with PostgreSQL. It offers horizontal scalability and resilience 6.

It uses a modified Raft algorithm for data replication and sharding for data distribution. It is optimized for cloud-native environments and supports various deployment options.

The storage engine, DocDB, is based on RocksDB. Unlike RocksDB it's a "key to object/document" instead of a "key to value" store 7.

CockroachDB

CockroachDB uses Raft and multi-version concurrency control (MVCC) to provide distributed transactions and strong consistency 8.

Aurora

Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. It uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB. It runs in a single-master configuration where only a single node can process write requests and all other nodes are read replicas. If the writer node becomes unavailable, a failover mechanism promotes one of the read-only nodes to be the new writer 4.

A recent addition to Aurora allows for multi-master configurations, albeit with limitations and new failure modes 9.

DuckDB

DuckDB is an in-memory analytical database that is designed for analytical SQL queries. DuckDB uses vectorized query execution and is optimized for OLAP workloads 10.

As a single-node system, DuckDB shares high-level architecture with Velox, HyPer, and Umbra. It uses a "morsel-driven" query execution framework where query plans are represented as trees of operators 11 12.

Postgres

Postgres, the battle-hardened descendant of Ingres 13. Built to prioritize extensibility and SQL compliance, the Swiss Army knife of the database world.

The architecture follows a process-per-connection model, offering a trade-off between fault isolation and resource consumption. Its MVCC implementation provides granular control over transactions, offering advanced isolation levels without lock contention.

Its write-ahead log serves as the cornerstone for data durability and integrity, and facilitates features like point-in-time recovery and various replication configurations.

The system's extensibility is evident through a decade of extensions for Postgres, systems that fork to build new systems on, or use the query engine as a library.

Transactions

Isolation levels

This is a good post on isolation levels for database transactions. The post looks at the isolation levels and their anomalies. Specifically how Postgres implements isolation levels, but it's a good overview nevertheless.

Hermitage by Martin Kleppmanns, is an attempt to nail down precisely what different database systems actually mean with their isolation levels. It's a suite of tests that simulates various concurrency issues — some common, some more obscure — and documents how different databases handle those situations.

Storage

The journey to understand disk storage could well start with Information Retrieval (CJ van Rijsbergen 1979), File Structures. This focuses on file structures for document retrieval but also reveals that many challenges we encounter today were already addressed five decades ago.

Despite this, storing data on disk securely and durably remains a complex task. The main challenges can be grouped into three key layers:

  1. File API: Complex and error-prone, making it difficult to enforce data ordering without expensive cache flushes.

  2. File Systems: Poor error-handling, leading to data corruption and loss. Even well-known file systems frequently ignore or mishandle errors.

  3. Disks: Unreliable flushing mechanisms and inconsistent error rates further compromise data integrity.

Overall, systemic weaknesses exist from the application layer down to the hardware, making data loss and corruption a lasting concern 14.

B-Tree

  • 📄 Ubiquitous B-tree. Comer. ACM Computing Surveys (CSUR) 11 (1979)

    "The Ubiquitous B-Tree" is considered a seminal work that played a significant role in promoting the understanding and adoption of B-trees. The paper helped to solidify the B-tree's position as a go-to data structure for database management systems and file systems. It made the concept accessible and laid out the advantages, contributing to B-trees becoming a standard choice for many storage systems.

    The B-tree version described in the paper includes data in the internal nodes. In modern implementations, it's common to only store data in the leaf nodes. This makes B-trees more efficient for range queries and better suited for systems where data is stored on disk. The internal nodes in a B+-tree contain only keys and act as a guide to locate the leaf node containing the desired data.

  • 📄 Modern B-tree techniques. Graefe, and others. Foundations and Trends in Databases 3 (2011)

    This paper provides an updated look at B-tree optimizations and techniques that have emerged over the years. It's a must-read for understanding contemporary B-tree practices.

  • 📄 Two is Better Than One: The Case for 2-Tree for Skewed Data Sets. Zhou, Yu, Graefe, and Stonebraker. memory 11 (2023) 📺

    This is a discussion on a 2-Tree architecture to improve memory utilization in databases with skewed data sets. B-trees don't handle skewed data well, leading to poor memory utilization. In a 2-Tree, frequently accessed records are stored in one tree, and cold records in another. A migration protocol moves records between the trees as their access frequency changes.

    • It's designed to be an on-disk index that handles larger-than-memory indexes effectively.

    • The record migration protocol adaptable to any B-tree.

    Next step would be to expand the approach for an N-Tree system with multiple storage levels. It's like an LSM of B-tree's except that the levels are mutable and records can move between them.

    The approach aims to significantly improve memory utilization and throughput without compromising range scan performance.

  • 📄 Adaptive Hybrid Indexes. Anneser, Kipf, Zhang, Neumann, and Kemper (2022)

    The problem this paper tries to address is the large memory usage of indices in database systems. Traditional compact indexes, while reducing space overhead, compromise on performance due to their static nature. Unable to to adjust to varying workload patterns and to trade-off between memory efficiency and performance is a challenge.

    The objective they try to achieve is is to optimize both space and performance in an index structures by dynamically adjusting encodings at run-time based on workload patterns.

  • 🏷️ Taming B-Trees (ScyllaDB)

    Is an article on how ScyllaDB transitioned from Red-Black to B-trees to enhance its in-memory cache, addressing issues like suboptimal memory consumption and CPU usage during tree searches. While B-trees are often thought of as an on-disk data structure the article goes into how B-trees offer better cache locality and mitigate branch misprediction problems associated with binary trees, optimizing key searches and overall performance.

LSM Tree

Implementations

Books

Tools

  • pgBadger - A fast PostgreSQL log analysis tool that generates detailed reports. It offers various graphs, charts, and counters that help in identifying slow queries and performance bottlenecks.

Footnotes


  1. Neon architecture overview↩︎

  2. A copy-on-write clone of a Neon project's primary branch or previously created child branch. A branch can be created from the current or past state of the parent branch. Neon uses the copy-on-write technique to copy data when creating a branch. Neon Branching ↩︎

  3. Neon on dbdb↩︎

  4. Is Scalable OLTP in the Cloud a Solved Problem? (CIDR'23, T Ziegler 2023). ↩︎ ↩︎

  5. Heikki Linnakangas's presentation for the "¡Databases! Seminar" series. He goes into the storage architecture for Neon and mentions how they don't try to solve the multi-writer problem. Neon is a conventional Postgres with a network storage interface that allow them to separate compute from storing files on disk. Video ↩︎

  6. Yugabyte docs↩︎

  7. Yugabyte on dbdb↩︎

  8. CockroachLabs architecture overview↩︎

  9. Aurora MySQL replication↩︎

  10. Persistent Storage of Adaptive Radix Trees in DuckDB. DuckDB uses Adaptive Radix Tree (ART) Indexes to enforce constraints and to speed up query filters. It persist ART Indexes to disk. ↩︎

  11. A shallow survey of OLAP and HTAP query engines↩︎

  12. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age (SIGMOD’14, V Leis 2014). ↩︎

  13. The INGRES relational database management system (DBMS) was implemented during 1975-1977 at the Univerisity of California. Since 1978 various prototype extensions have been made to support distributed databases.

     ↩︎
  14. Writing to a file safely—without data corruption—is more complex than commonly thought. File APIs are difficult to use correctly, file systems can drop critical errors, and disks can corrupt data at alarming rates. To write a file safely, one would typically use a system call like pwrite, which allows interaction with the file system. Despite high-level languages offering simpler interfaces, they ultimately rely on such system calls, and the complexity and risks remain. FILES↩︎