The mental models docs don't teach
Docs tell you what. They skip why — the forces, trade-offs, and failure modes that determine how data systems actually behave.
Start with Foundations →Four lenses
Every concept, four angles
Partitioning. State. Latency. Each concept lives in all four — each lens surfaces something the others don't.
Why does this keep happening?
How does this tool actually work?
What should I choose?
What went wrong?
Trace · Practice · Reflect · Navigate
Four more sections — each one a different mode of engaging with the knowledge.
Pipeline Seam Map
Find where a problem lives by locating the broken handoff — what each stage assumed it was receiving, and what it actually got.
Go to Lifecycle →↑ locate the broken handoff
Hands-On Scenarios
Debug a slow Spark job, design a streaming pipeline, reason through a migration — with real decision points and consequences.
Go to Practice →Today: 18 minutes. Same data. Same code.
What changed?
On the Craft
How engineers learn, how judgment forms, why some ideas spread faster than they deserve.
Go to Perspectives →Most data quality problems are data modeling problems in disguise. The pipeline just makes them visible.
Guided Routes
Curated reading sequences by role, goal, or system — when you want structure, not a blank map.
Go to Learning Paths →About the author
Aayush Sharma
Data engineer at a bank where a settlement error is a regulatory event. Five years working on the pipelines and reporting systems that correctness-critical teams depend on — trade settlement, regulatory reporting, real-time risk. The kind of systems where the cost of getting it wrong shows up in an audit, not just a Slack alert.
He built this site because the mental models that make data systems legible — the trade-offs, the failure modes, the decisions that look obvious in hindsight — were scattered across papers, talks, and hard-won experience. This is the guide he needed when he started.
Stay sharp
Signal, not schedule
These concepts come from building pipelines where errors reach regulators — where the same failure pattern repeats until you understand the actual cause, not just the fix. When something new is ready, you get one email. That's it.
No digest. No cadence. One email per article.
The cost shows up 10 minutes later, in a different stage, under a different metric name — and by then it looks like a memory problem.
When Spark shuffles, it writes intermediate state to disk and reads it back. The disk I/O isn't the problem. The problem is that shuffle partitions are sized by count, not by the data they'll hold. When your data is skewed...