Systems Thinking for Data Engineers

The mental models docs don't teach

Docs tell you what. They skip why — the forces, trade-offs, and failure modes that determine how data systems actually behave.

Start with Foundations →
incident.log
SparkException: OutOfMemoryError
at sort phase — executor heap exhausted
Why it actually happened
Shuffle produces unbounded intermediate state
→ partition count too low for data skew
→ sort-merge join holds both sides in memory
→ spill threshold crossed on late-arriving data
The fix is in the model, not the config.

Four lenses

Every concept, four angles

Partitioning. State. Latency. Each concept lives in all four — each lens surfaces something the others don't.

Trace · Practice · Reflect · Navigate

Four more sections — each one a different mode of engaging with the knowledge.

About the author

Aayush Sharma

Data engineer at a bank where a settlement error is a regulatory event. Five years working on the pipelines and reporting systems that correctness-critical teams depend on — trade settlement, regulatory reporting, real-time risk. The kind of systems where the cost of getting it wrong shows up in an audit, not just a Slack alert.

He built this site because the mental models that make data systems legible — the trade-offs, the failure modes, the decisions that look obvious in hindsight — were scattered across papers, talks, and hard-won experience. This is the guide he needed when he started.

Get in touch →|

Open to data engineering roles, collaboration, and hard problems.

Stay sharp

Signal, not schedule

These concepts come from building pipelines where errors reach regulators — where the same failure pattern repeats until you understand the actual cause, not just the fix. When something new is ready, you get one email. That's it.

No digest. No cadence. One email per article.

FromThe Data Engineering <aayush@thedataengineering.com>
ReWhy shuffle is never free

The cost shows up 10 minutes later, in a different stage, under a different metric name — and by then it looks like a memory problem.

When Spark shuffles, it writes intermediate state to disk and reads it back. The disk I/O isn't the problem. The problem is that shuffle partitions are sized by count, not by the data they'll hold. When your data is skewed...

Read the full piece →