NotesApr 01, 20265 min read

How I Turned Data Pipeline Headaches Into a Python Library

There's a specific feeling that comes from solving the same problem twice.

The first time, it's a challenge worth your full attention. You're learning what the constraints actually are, where the edges are, why the obvious approaches fall short. The second time, something is different. You recognize the shape of the problem before you're halfway through it. You know what the traps look like. You solve it faster, but there's a quiet frustration underneath: this shouldn't require solving from scratch again.

After building a payroll calculation engine and then a tax calculation engine, both data-intensive, both running on Django, I had that feeling often enough to do something about it.

What both engines had in common

The payroll engine required a vectorized calculation core, a batching layer to manage memory under concurrent load, a job queue with a concurrency semaphore, and a polling interface for long-running asynchronous jobs. The tax calculation engine, which handled income tax with slab-based calculations, surcharge, marginal relief, and exemption comparisons across multiple components, required all of the same infrastructure, plus a modular pipeline architecture where each calculation stage was isolated and composable.

The domain logic was different. The infrastructure underneath was nearly identical.

How do you process large volumes of tabular records without per-row Python loops? How do you keep memory bounded under concurrent load? How do you accept a job that takes 10 seconds and return the result without blocking? These questions have good answers. Working through them the first time produced knowledge. Working through them the second time produced the question: why is this floor not already built?

The infrastructure tax 💸

Every data-intensive Django application is going to run into some version of these problems. The exact shape varies: maybe it's pandas, maybe it's SQL aggregations, maybe it's a different output format. The structural problems are consistent. You need efficient bulk processing. You need memory control. You need asynchronous job handling.

The time you spend building that infrastructure instead of the features that make your product specific is what I started thinking of as the infrastructure tax. It's not wasted time. The judgment you develop from working through these problems is real and hard to get any other way. But there's a difference between paying that tax because it teaches you something and paying it because nobody thought to package the solution.

Why Polars, not more pandas

During the memory problem phase on the payroll engine, Dask was the natural candidate for a solution and didn't deliver, not because of any fundamental flaw, but because the gap between "pandas-compatible" and "fully pandas-compatible" was too wide for our specific calculation patterns at the time.

While working through that, Polars was starting to get serious attention. The case for it was straightforward: a Rust-based execution engine that Python's GIL doesn't interfere with, a lazy evaluation mode that handles chunked processing natively without API compatibility gaps, and benchmark results on column-level operations that were clearly faster than pandas for our workload.

I spent several weeks testing Polars against the financial calculation workloads before drawing any conclusion. The lazy mode addressed the memory concern in a way that Dask had promised but hadn't delivered for these patterns. The performance on nested conditional calculations (the kind that needed np.where chains in pandas) was faster and the API was cleaner.

The payroll engine stayed with pandas and the batching layer; rebuilding a production system mid-flight isn't something you do for a benchmark result. But the conclusion was clear enough: if building the infrastructure layer from scratch today, Polars would be the foundation. That conclusion became the basis for django-mindoff.

What django-mindoff is 📦

The library packages what two years of financial engine work made reusable. The Polars integration provides a vectorized processing foundation with native memory management, without requiring you to build batching and chunking yourself. The job queue and concurrency layer handles multiple long-running jobs cleanly without custom threading code per project. The export layer covers rendering processed output to Excel and PDF through configurable templates, a consistent requirement across both engines that involved more bespoke work than it should have.

The goal was to remove the infrastructure tax as a problem. A team building a data-intensive Django application should start with the floor already present and spend their time on the logic that makes their product specific.

django-mindoff is on GitHub. The documentation covers how to get started. This post, and the two before it, cover why it was built the way it was.

What I'd tell someone starting the same kind of project

The payroll and tax engines both worked, held up under production load, and were maintained by teams who hadn't written any of the original code. None of that happened by accident. It happened because the methodology (think in columns not records, design for memory not just speed, make each calculation stage independently testable) held up at the scale and complexity those engines reached.

That methodology is the thing worth carrying forward. The specific tools will evolve. Pandas was the right choice at one point. Polars is the more defensible choice now. What doesn't change is the underlying thinking: data is a set of columns that transform in stages, and every part of your architecture should reflect that.

If you're starting a project with similar requirements, the technical walkthrough of the merge-and-reduce approach is the most practically useful place to begin. The two posts before this one give you the full story of where that approach was tested. What you build on top of it is up to you.