FoundationsExecution at ScaleLazy vs Eager Evaluation

Lazy vs Eager Evaluation

The code you wrote is not the code that ran.

data.filter(col("status") == "active") \
  .join(table2, "user_id") \
  .groupBy("region") \
  .agg(sum("revenue")) \
  .write.parquet("output/")

That's PySpark. Don't worry if you haven't used it — the shape is what matters, not the syntax. The same pattern exists in Polars, Dask, dbt-compiled SQL, and most modern query engines. Read it as: filter, then join, then group, then write.

That's the order on the page. That's the order you reasoned about. That's the order you'd defend in code review.

It is almost certainly not the order it ran in.

Somewhere between you typing those four operations and the data being written, something else read your code, looked at it as a whole, decided your filter would be more efficient if it ran later, decided your join had a smaller side that should be broadcast, decided one of the columns you computed wasn't actually needed downstream and quietly skipped computing it. By the time the bytes hit disk, the work that happened bears a family resemblance to the work you wrote, but it is not the same work.

You did not authorize this. You also did not unauthorize it. The contract you signed allows it. You may not have realized you signed a contract.

What you wrote— code as sequence
.filter(col("status") == "active")
.join(table2, "user_id")
.groupBy("region")
.agg(sum("revenue"))
.write.parquet("output/")
What ran— engine's compiled plan
scan + filter — pushed into read
predicate pushdown
join — smaller side broadcast
broadcast join
group + write — single pass
operation fusion
implicit col — never computed
column pruning
Same result. Different work. The engine decides what "equivalent" means.

The gap between those two panels is the entire subject of this article. Most production bugs in modern data pipelines live in that gap. Most engineers do not know the gap exists, because the API hides it. Their code looks like Python. Python runs the way you write it. Their code does not.


The Honest Lie

Consider how the code feels when you write it. The verbs are imperative. The order is sequential. The mental model is procedural. You are giving instructions, and the instructions will be followed.

Now imagine you put a print statement after the filter line. In a procedural system, you would see the filtered rows. In the system you are actually using, nothing prints — because the filter has not run yet. Nothing has run yet. Your "code" is not code; it is a message you are leaving for the engine.

The engine reads your message. It does not run it. It plans.

Only when you ask for something the engine cannot fake — a write to disk, a count, a display, a materialization into memory — does the engine compile your message into actual work. And the work it compiles is whatever the engine decided was the most efficient way to satisfy your request, given everything it now knows about the data, the cluster, and the operations you described.

How it reads
.filter(col("status") == "active")
this filters
.join(table2, "user_id")
this joins
.groupBy("region").agg(...)
this groups
.write.parquet("output/")
this writes
sequential ↓
What's actually happening
.filter(col("status") == "active")
describes a filter
.join(table2, "user_id")
describes a join
.groupBy("region").agg(...)
describes a group
.write.parquet("output/")
triggers compilation + execution
execution starts here ↑
The first three lines are not code. They are inputs to a planner.

This is not deferred execution. Deferred execution still runs the work as written, just later. This is something stranger: the work as written may never run at all. The engine will run equivalent work, and it gets to choose what equivalent means.

There is a name for this. We have not earned it yet. Keep reading.


The Engine's Defense

The engine is doing this for a reason, and the reason is not malice or carelessness. The reason is that the work you described, run literally, would often be infeasible.

You described filtering a billion rows, then joining with another table, then grouping, then writing the result. Run literally, that means: load a billion rows into memory, write the filtered subset somewhere, load that subset back, join it with the other table, write the joined result, load it back, group it, write the grouped result, load it back, write the final output. Four full reads, four full writes, every intermediate result materialized to disk or memory because the next step needs it.

I/O cost per step
StepEager (literal)Lazy (planned)
Read inputfull scanfull scan
Apply filterwrite intermediate, read backapplied during scan — no intermediate
Execute joinwrite intermediate, read backin-memory, fused with scan
Groupwrite intermediate, read backsame pass as join
Write outputfinal writefinal write
Total I/O4 reads + 4 writes1 read + 1 write

When the data fits in memory and the operations are cheap, this distinction does not matter. You can run things literally and the cost is negligible. This is the world of pandas, plain SQL queries on small tables, scripts on local files. Every line is the work. The line you wrote is what ran.

When the data does not fit in memory, every materialized intermediate is a pipeline you cannot run. The engine cannot afford to honor your sequence. Honoring your sequence would mean the pipeline does not finish.

So the engine takes what you wrote and asks: what is the smallest amount of actual work that produces the result this person asked for? It pushes filters into scans so the data is filtered before it is even read into memory. It prunes columns nobody downstream uses, so they are never materialized. It fuses operations into single passes. It chooses join strategies based on the size of each side. It does all of this because, at scale, you would do it too if you were patient enough to do it by hand. The engine is just patient.

The engine does not want to execute until it knows the whole plan. It is lazy. Every operation you write extends the plan; nothing is executed until the plan is forced. That is the name we have been earning. Lazy evaluation. Not "deferred." Not "asynchronous." Lazy — the engine refuses to do work until refusing is no longer an option, and when it finally works, it does the smallest amount it can.

The opposite is eager evaluation. Every line is the work. Every operation runs as written, in the order written, materializing as it goes. Pandas is eager. A regular Python list comprehension is eager. SQL on a single small table, evaluated row by row, is eager.

These are not better and worse. They are different contracts.


The Contract You Signed

The eager contract is simple and honest: what you wrote is what runs. You pay for that honesty in scaling — at large data, every materialized intermediate is a cost you cannot avoid. You pay nothing in mental overhead. The line is the work. If you want to know what happened, you look at the line. If you want to debug, you print the value. If something is slow, you put a timer on it.

The lazy contract is less honest, deliberately: what you wrote is a description; the engine decides the work. You pay for that flexibility in scaling — at large data, the engine's whole-pipeline view is the only way to make execution tractable. You pay heavily in mental overhead. The line is not the work. The line is a node in a graph the engine will rewrite. If you want to know what happened, you cannot look at the line; you have to look at the plan the engine compiled.

Most engineers do not realize they have signed the lazy contract until the contract bites them. The bite comes in a specific shape, and once you have seen it, you start seeing it everywhere.


When the Contract Bites

You wrote a transformation. You tested every step. The filter test passes. The join test passes. The aggregation test passes. CI is green. You ship.

In production, the pipeline fails — or worse, succeeds with wrong output. There is no broken line. Every operation, tested alone, is correct. The bug exists only when the engine fuses them together into a plan.

A few shapes this takes, all real, all seen in production:

The filter that ran first in development. At scale, the engine moved it past the join, because the join produced fewer rows on the test data and the optimizer's cost model said the rearrangement was equivalent. On production data, the join produced more rows, and the filter was now operating on data that was supposed to be eliminated before the join. The output is different. The code is identical.

The column added with a non-deterministic function — a timestamp, a random ID, a lookup against a service that returns slightly different values each call. The engine, optimizing, decided the column could be computed once and reused, or computed lazily at each consumer, or fused with another operation that called the function more times than the code suggests. The values are inconsistent across the pipeline. There is no "wrong line." The wrong thing is the engine's freedom to decide when the function gets called, granted by a contract the engineer did not realize they signed.

The operations written in an order that depends on side effects — a write to a log, an external API call, an update to a metrics counter. The optimizer does not see side effects. It sees a data dependency graph. Your operations got reordered to optimize the data flow, and the side effects fired in an order that no longer matches the logic you wrote. The data is fine. The world is wrong.

In all three: every line is correct. The plan is what failed.


The Instinct That Has to Change

In eager systems, when something is wrong, your first question is which line is broken. That instinct is correct, because the line is the unit of execution. Print the intermediate value. Step through with a debugger. Comment out lines until the bug disappears. The line is the artifact.

In lazy systems, that instinct is the bug.

When something is wrong in a lazy pipeline, the line is almost never broken. Every line, in isolation, is doing what you wrote. The first question has to be what plan did the engine compile? — because the plan is the artifact under test, the plan is what executed, and the plan is the only place a bug can live that is invisible to line-level inspection.

Debugging a lazy pipeline
CI passes
every transformation tested in isolation
Ship to production
Production output is wrong
no exception — the pipeline "succeeded"
Print intermediates after each line
every value looks correct
Search for the broken line
there isn't one
Read the compiled plan
filter ran after join — optimizer moved it
Fix the description
not the line — the contract
Every line was correct. The plan was broken. The plan was the artifact under test.
Debugging eager
Output is wrong
Print intermediate values
Inspect output after each line
Find the broken line
Fix the line
Done
Debugging lazy
Output is wrong
Print intermediate values
All intermediates look correct
Realize: line ≠ unit of execution
Read the compiled plan
Find the rewrite that broke the assumption
Fix the description, not the line
Done
The artifact under test changed. The debugging muscle has to follow.

Most lazy systems give you a way to read the plan before it runs. Almost every engineer who works with these systems professionally has either learned to read plans or learned to suffer. There is no third option.


The Decision Is Not Lazy or Eager

The temptation, having read this far, is to conclude that lazy evaluation is dangerous and eager is safe. That conclusion is wrong, and it is wrong in the same way that "synchronous is safer than asynchronous" is wrong. They are different tools for different conditions.

Matching the contract to conditions
Eager fits when
·Data fits in memory — intermediates are cheap
·Interactive exploration — seeing each step output is the point
·Debugging sessions — you need to inspect state at each line
·Pipelines with side effects that must run in source order
·Predictability matters more than performance
Lazy fits when
·Data exceeds memory — materializing intermediates is infeasible
·Batch pipelines where whole-view optimization saves orders of magnitude
·Repeated queries where caching compiled plans pays off
·Transformations where you trust the engine more than manual tuning
·Execution cost outweighs the cost of a wrong-plan assumption
There is no universal answer. There is only matching the contract to the conditions.

A 1 GB dataset on a laptop does not need a query optimizer; the planning overhead might exceed the execution time. A 1 TB pipeline running nightly cannot survive without one. Most modern data tools — DataFrames, query engines, transformation frameworks — are lazy by default because their target use case is the latter, not the former. But "lazy by default" does not mean "lazy is correct." It means the tool's authors made a choice for the conditions they expect, and you are inheriting that choice.

The work, when you sit down to write a pipeline, is not to pick a side. It is to know which contract you are operating under, and to debug, test, and reason in a way that matches that contract. Engineers who do not match get burned. Engineers who do, do not.


What This Means On Monday Morning

The next time you write a chain of operations on a DataFrame, query builder, or any system that lets you describe transformations before they run — pause for one second and ask: am I writing code or writing a description?

If you can put a print statement after a line and see the result of that line, you are writing code, and the line is the unit you debug.

If putting a print statement there would print nothing — or would print something only because you secretly triggered execution — you are writing a description, and the plan is the unit you debug. Find out how to read the plan in your tool. Read it. Compare it to what you thought you wrote. The gap between those two is where your bugs will come from, for the rest of your career, in this paradigm.

The code you wrote is not the code that ran. The sooner that stops surprising you, the sooner you start writing pipelines that survive contact with production.


Connections. The plan-as-runtime-artifact pattern shows up across foundations. Schema as runtime property — what you declare at deploy time is not what enforces at read time. Time and ordering — what you write as sequential is not what executes as sequential. Partitioning and shuffling — what you describe as "join" the engine implements as one of several physical strategies depending on data shape. The thread connecting them: in scaled data systems, the artifact you write and the artifact the system runs are different artifacts. Knowing which you are looking at is half the job.