Pipeline Seams

Where did the contract break?

Every pipeline is a chain of implicit contracts between stages. Locate the broken handoff — what each stage assumed it was receiving, and what it actually got.

Stage handoffs
Generate Ingestformat, assumptions, unstated contracts
Ingest Storedelivery guarantees, deduplication, ordering
Store Processlayout, partition key, schema expectations
Process Modelgrain, nulls, duplicate rows at arrival
Model Servefreshness SLA, semantics, trust boundary
Orchestration alltiming, dependency, retry across every seam

Generation → Ingestion

The source makes assumptions the ingestor doesn't know about — client-side clocks, undeclared currencies, non-unique IDs. What the source owes the pipeline starts here.

source → ingestor contract:
timestampclient-side clock, not UTC
currencyUSD assumed, unstated
event_idnot globally unique
none of this is in the schema
Read →

Ingestion → Storage

The delivery boundary. ACK doesn't mean committed. Committed doesn't mean durable. Where duplicates, loss, and ordering violations first enter the system.

ingestor → storage contract:
event_id 8821 → storage buffer
→ ack sent to producer
→ crash before commit
→ producer retries: 8821 again
storage: duplicate row written
Read →

Storage → Processing

Processing inherits the layout decisions storage made. Partition key mismatches, format assumptions, and schema drift become cost and correctness problems at query time.

storage → processor contract:
partitioned byingestion_date
query filters onevent_date
partition pruningnone — full scan
layout mismatch → $340/run instead of $12
Read →

Processing → Modeling

What arrives at the model layer — duplicates, nulls, ambiguous grain — was decided upstream. The modeling layer can't fix what processing already delivered.

processor → model layer:
user_id | order_id | revenue
1001 | A | 45.00
1001 | A | 45.00← dup
SUM(revenue)90.00 — wrong
pipeline: SUCCESS — grain was never declared
Read →

Modeling → Serving

Consumers assume freshness, stable semantics, and correctness the model layer never explicitly promised. Freshness SLAs and metric definitions are the contracts at this seam.

model → consumer contract:
model refreshesevery 4 hours
dashboard (batch)ok — 4h stale
fraud model (rt)miss — 4h late
freshness SLA was never written down
Read →

Orchestration ↔ All Stages

Orchestration manages timing and dependencies across every seam. Undeclared dependencies, schedule assumptions, and retry logic are what silently corrupt pipelines.

orchestrator → all stages:
scheduled start03:00 UTC
upstream done03:07 (still writing)
load_raw started03:00 ← too early
dependency not declared → silent data loss
Read →