Lens — How
How the tool actually behaves — not how the docs say it should
Each page is a behavioral deep-dive into one system: the execution model you need to predict what it'll do, the failure modes you'll hit in production, and the operational characteristics the getting-started guide never mentions.
Apache Spark
Stages, tasks, shuffles, and spills. How Spark distributes work, where it breaks, and why the query plan matters more than the cluster size.
Apache Flink
Streaming-first execution with checkpoints, backpressure, and state. How Flink manages continuous load differently from batch engines.
Hadoop MapReduce
The batch execution model that shaped everything after it. Why modern engines still carry its trade-offs.
Apache Kafka
Partitions, consumer groups, offsets, retention — and where exactly-once delivery gets complicated.
RabbitMQ
Queue-based routing, acknowledgments, prefetch, and dead-letter exchanges. The queue model versus the log model — and why the distinction matters.
HDFS
Block replication, NameNode limits, and why this file system shaped the first generation of data lakes.
Object Storage
Eventual consistency, listing latency, small-file costs — the cloud storage substrate under most modern data platforms.
Apache Iceberg
Snapshot isolation, partition evolution, metadata scaling — transactional guarantees on top of object storage, and the fan-out that grows as your table does.
Delta Lake
Transaction log mechanics, write conflicts, and the compaction trade-off — what ACID on object storage actually costs, and where it diverges from Iceberg.
Apache Hudi
Copy-on-Write vs Merge-on-Read — the write/read trade-off that determines performance at every layer, and why the incremental query model is different from the others.
PostgreSQL
Planner behavior, vacuum mechanics, connection limits — and where Postgres stops scaling for analytical workloads.
BigQuery
Slot-based execution, columnar scanning, and the cost model that makes full table scans expensive in ways you don't see until the bill.
Snowflake
Virtual warehouse behavior, auto-suspend, clustering, and where the separation of storage and compute leaks.
Redshift
Distribution styles, sort keys, WLM queues — the operational surface area that managed services don't fully manage.
Trino
Federation across heterogeneous sources, predicate pushdown limits, and why the coordinator becomes the bottleneck before anything else does.
Apache Airflow
Execution dates, catchup, task retries — and why the scheduler is often the bottleneck nobody suspects.
Dagster
Assets instead of tasks — why defining what your pipeline produces changes debugging, dependency tracking, and what breaks when you modify the graph.
dbt
Ref resolution, incremental models, test contracts — and where dbt's simplicity creates hidden coupling.
SQLMesh
State-aware execution that knows what changed and only re-runs what must — a genuinely different behavioral contract from dbt's compile-and-run model.
Apache Atlas
Lineage tracking and classification — and where metadata graphs become stale faster than you expect.
DataHub
Search, discovery, and lineage. What it solves technically and what organizational problems it can't.