Systems Thinking for Data Engineers

The mental models docs don't teach

Docs tell you what. They skip why — the forces, trade-offs, and failure modes that determine how data systems actually behave.

Start with Foundations →

incident.log

SparkException: OutOfMemoryError

at sort phase — executor heap exhausted

Why it actually happened

Shuffle produces unbounded intermediate state

→ partition count too low for data skew

→ sort-merge join holds both sides in memory

→ spill threshold crossed on late-arriving data

The fix is in the model, not the config.

Four lenses

Every concept, four angles

Partitioning. State. Latency. Each concept lives in all four — each lens surfaces something the others don't.

Foundations10 concepts

Why does this keep happening?

SparkException: OOM in sort phase

↳sort needs memory to complete

↳shuffle state is unbounded

↳partition count wrong for data skew

Systems20+ systems

How does this tool actually work?

sort-merge join

docs say "join" — not "hold both sides in memory"

Architecture9 patterns

What should I choose?

✓ simple reprocessing

✓ cheap at scale

✓ low latency

✓ event-driven

✗ state complexity

the real constraint is correctness, not latency

Incidents7 failure types

What went wrong?

09:14schema change pushed upstream

15:22downstream job fails silently

18:41alert fires — wrong service, wrong team

Trace · Practice · Reflect · Navigate

Four more sections — each one a different mode of engaging with the knowledge.

Pipeline Seam Map

Find where a problem lives by locating the broken handoff — what each stage assumed it was receiving, and what it actually got.

Go to Lifecycle →

Generate→Ingest

Store→Process

Process→Model

↑ locate the broken handoff

Hands-On Scenarios

Debug a slow Spark job, design a streaming pipeline, reason through a migration — with real decision points and consequences.

Go to Practice →

Your Spark job ran in 4 min yesterday.
Today: 18 minutes. Same data. Same code.
What changed?

PerspectivesReflect

On the Craft

How engineers learn, how judgment forms, why some ideas spread faster than they deserve.

Go to Perspectives →

Most data quality problems are data modeling problems in disguise. The pipeline just makes them visible.

Learning PathsNavigate

Guided Routes

Curated reading sequences by role, goal, or system — when you want structure, not a blank map.

Go to Learning Paths →

Distributed fundamentals

Storage internals

Streaming primitives← you are here

Reliability engineering

Query optimization

About the author

Aayush Sharma

Data engineer at a bank where a settlement error is a regulatory event. Five years working on the pipelines and reporting systems that correctness-critical teams depend on — trade settlement, regulatory reporting, real-time risk. The kind of systems where the cost of getting it wrong shows up in an audit, not just a Slack alert.

He built this site because the mental models that make data systems legible — the trade-offs, the failure modes, the decisions that look obvious in hindsight — were scattered across papers, talks, and hard-won experience. This is the guide he needed when he started.

Get in touch →|

Open to data engineering roles, collaboration, and hard problems.

Stay sharp

Signal, not schedule

These concepts come from building pipelines where errors reach regulators — where the same failure pattern repeats until you understand the actual cause, not just the fix. When something new is ready, you get one email. That's it.

No digest. No cadence. One email per article.

FromThe Data Engineering <aayush@thedataengineering.com>

ReWhy shuffle is never free

The cost shows up 10 minutes later, in a different stage, under a different metric name — and by then it looks like a memory problem.

When Spark shuffles, it writes intermediate state to disk and reads it back. The disk I/O isn't the problem. The problem is that shuffle partitions are sized by count, not by the data they'll hold. When your data is skewed...

Read the full piece →