Skip to main content

DATA & ANALYTICS

Pipelines that don't break: a practical guide to data infrastructure.

BY GIGGLI LABS EDITORIAL

Most pipelines fail because no one owns them. Here's the operating model that keeps data flowing, and the one that doesn't.

9 MIN READ · PUBLISHED MAY 2026

A broken pipeline is rarely a code problem. It is an ownership problem. The job ran for two years, the engineer left, the source schema changed last Tuesday, and no one is paged when it fails. By the time the dashboard is wrong, three weeks of decisions have been made on it. We have inherited more of these than we can count. The pattern is always the same: a pipeline gets built in a sprint, ships once, gets forgotten. It runs successfully for six months, then a vendor changes a column name, then nothing breaks loudly because the SQL still executes, it just executes against a column that now means something different. The dashboards keep updating. The numbers keep being wrong. The CFO catches it by accident in a board pack and the credibility cost takes a year to rebuild. What changes when you start treating pipelines as products is the entire operating model around them. Every pipeline has an owner. Every pipeline has a service-level agreement. Every pipeline has a runbook in the same repo as the code, written by the engineer who built it, checked in before the merge. Every pipeline has a test suite that catches schema drift before it reaches the warehouse. Every pipeline emits a heartbeat that triggers a page when it stops. This sounds like overhead. It is not. The overhead of a broken pipeline is six weeks of investigation and one quarter of decisions made on bad numbers. The overhead of doing it right is half a day per pipeline, written down once, never repeated. Where to start when you are in the middle of an inherited mess: do not try to fix everything. Inventory every pipeline you have, and flag the ones with no owner, the ones with no documentation, and the ones the business depends on this quarter. The intersection of those three lists (important, undocumented, unowned) is what you fix first. Everything else can wait. We have never seen a team that tried to fix all of their pipelines at once and succeeded. The teams that pick three and ship them right always do better than the teams that try to fix thirty and ship none. The data-product model has three deliverables per pipeline. The first is the pipeline itself, with tests. The second is the runbook: a single markdown file that says what the pipeline does, who owns it, what its SLA is, what the source schema looks like, what to check first if it fails, and what the recovery procedure is. The third is the on-call rotation. Someone gets paged when the heartbeat stops, and that someone knows where the runbook lives. Tools matter less than people think. We have shipped this model on Airflow, Dagster, dbt, Fivetran, custom Python, and SQL stored procs. The model works because of the discipline, not the framework. The framework just enforces the discipline. We default to dbt + GitHub Actions for small-cap clients because the cost is near zero and the audit trail is clean, but the same model ships on whatever stack is already in place. Do not let a tool migration block the operating-model work. Fix the model first on the tools you have, and migrate later if the math says to. Schema drift is the single biggest cause of silent failure. Most pipelines do not test that the column they read still exists, still has the type they expect, and still contains the values they assume. We add three tests by default: schema test, freshness test, and value-range test. If the source schema changes, the build fails before it overwrites the warehouse. If the source goes stale, the build fails before stakeholders see Tuesday’s numbers on Friday. If the values move outside the expected range, the build flags the anomaly before it lands in the dashboard. Three tests per pipeline. Half a day to write. Decades of time saved over the life of the system. The weekly review ritual is what keeps the model healthy. Once a week, the data team reviews every red pipeline, every aging runbook, every owner change. Twenty minutes. The point is not to fix things in the meeting; it is to surface what needs fixing this week and assign it. Without the ritual, the inventory rots inside a quarter. With it, the system stays clean for years. A note on AI in this stack. The temptation is to point an LLM at the pipeline output and ask it to flag anomalies. We have tried this. It works the first week, then drifts. The model gets used to the new normal, then the new abnormal, and stops flagging things that should be flagged. Use deterministic tests for schema and value range. Use the LLM only for the narrative summary that lands on the operator’s desk on Monday morning, where its job is to translate the test output into a sentence a human can act on. Give it a quarter and the difference is hard to miss. The warehouse is clean and the dashboards are right. When a pipeline fails, the right person gets paged within five minutes, and the runbook tells them what to check first. New engineers can take over a pipeline by reading one document. The CFO trusts the numbers again. The data team stops being the team that gets blamed for things they never owned. What we ship. A pipeline inventory, ownership model, runbook templates, schema-drift detection, and a weekly review ritual that keeps it healthy. Fixed-scope build. Your team owns the runbook after.

KEEP READING

More field notes, or book a call.

READ ANOTHER

Giggli Labs · Calgary, AB