I Cut Tax Prep From 3 Hours To 30 Minutes With A Local LLM

I did not use AI to do my taxes. I used a local LLM to do the boring half of tax prep: the part that turns a folder of 150 randomly named expense PDFs into a package my accountant can actually read. Most marketing around ai tax preparation wants to sell you a one-click filing experience, but that is not where local models earn their keep in 2026. They earn it on unglamorous batch work that sits between “I have receipts” and “my accountant has clean numbers.”

This year, that batch took me 30 minutes. Last year, the same volume of invoices was something like 3 hours of click, rename, tag, and spreadsheet drudgery. The frontier models did not close that gap; a local Qwen did.

What was actually running

Opus 4.7 (high) and GPT-5.5 (high) were eight hours into an autonomous run on something completely unrelated: migrating my desktop orchestrator from a failed Rust + iced GUI attempt to a Tauri build I’d finally conceded to. They were busy, my Claude and Codex subscriptions were getting maxed out, and my judgment was already tied up supervising that run.

But my 3090 was sitting idle.

So I pointed OpenCode at a local Qwen checkpoint, Qwen3.6-35B-A3B-UD-Q4_K_S.gguf, the Unsloth dynamic Q4_K_S quant, and handed it the invoices folder. A3B means it’s a mixture-of-experts: 35B total parameters with only about 3B active per token, which is exactly why it ran fast on a single consumer GPU while my paid models were doing real work elsewhere. The same workflow runs fine on Qwen 3.5 35B if that’s what you already have on disk; the 3.x family has been the most usable local checkpoint for this kind of structured batch work for a while now.

Local models have gotten to the point where they can handle almost all tasks outside of complex coding, and they can definitely bring me from “I have my expense invoices” to “clean numbers and organized files for my accountant.”

The actual workflow

Four steps, all handled through Python tool use:

Read. All 150 invoices were text-based PDFs, not scans, so no OCR was needed; straight text extraction was enough.
Rename. Convert filenames like INV-7741-2.pdf into 2026-03-14-acme-supplies.pdf using the invoice date and vendor parsed from the document body.
Extract. Pull subtotal, sales tax, and total off each invoice into a structured row.
Organize. Drop renamed files into per-vendor folders and produce a single xlsx I could hand to my accountant.

OpenCode drove the loop, the model wrote the small Python snippets it needed, and I watched context length and stepped in when it looked close to drifting.

Why local for this specifically

Privacy was the obvious reason. Every invoice in that folder had a vendor name, an amount, a payment method, and a handful had bank-routing detail printed on payment confirmations. That is not data I want passing through a cloud API’s retention window just to save thirty seconds of orchestration. When people ask “can ChatGPT help with taxes,” the honest answer is “yes, and you’ll be uploading every line of your business spending to a third party in the process.” For accountant-package prep, that trade is a clear no for me.

A second, smaller reason was cost: I did not want to spend frontier credits on text extraction grunt work while two paid models were already eight hours into an autonomous job. Local Qwen on a 3090 is functionally free for this shape of task once the GPU is paid for. Privacy is the reason I’d recommend other operators do this; cost was the nudge that pushed me to do it that night.

What went wrong

Honestly, very little went wrong. Nothing broke structurally. The only thing I had to actively manage was context, because you cannot shove 150 PDFs of varying length into a single window and hope. I broke the batch into chunks, summarized state between them, and made sure the model wasn’t carrying noise from one invoice into the next. That is operator hygiene, not model failure.

I still spot-checked every extracted row against the source PDF before exporting the xlsx, not because I caught the model lying, because I didn’t on this run, but because the cost of a wrong total reaching my accountant is asymmetric. Two minutes of review against a real risk of garbage-in-garbage-out is an obvious trade.

What I would never trust it with

I would never let any model, local or frontier, make a filing decision, classify a deductible, choose between business and personal allocation, or write anything that sounds like tax advice. That is what I pay an accountant for. The pattern that works for me is strictly upstream: clean inputs, organized files, summed totals. After that, a human with credentials takes over, and the model never sees the return.

Where this goes next year

The slow human step right now is the part before the model starts: logging into vendor portals, pulling email PDFs, sorting receipts. Next year, I expect to push the local Qwen further. By then, I expect to be comfortable enough with browser-use agents to pull invoices straight from portals and email, so the model can do the fetching as well as the structuring. That would compress another few hours out of the workflow before any extraction starts.

The takeaway

A local Qwen 35B did not do my taxes. It cleaned the runway so my accountant could file from a finished package instead of a folder of garbage filenames. That is a smaller claim than most AI tax prep headlines make, and a more honest one. The frontier models were doing the eight-hour autonomous work elsewhere; the local model handled the 30-minute batch on the side. Both pulled weight, and neither one was allowed near my return.