RAG Systems: The Good, The Bad, and The "Why Is This My Problem Now?
So there I was, staring at a terabyte of garbage. PDFs scanned sideways. CSVs that have been copy-pasted since 2015 and nobody knows where the original is. Technical docs written by people who clearly hated whoever would read them next.
The ask? Make it searchable. Natural language. Fast. With citations.
Cool. Cool cool cool.
The part nobody warns you about
Everyone talks about vector embeddings and retrieval like that's the hard part. It's not. The hard part is that your documents are a mess and no amount of clever engineering fixes bad data.
We spent three weeks just getting documents into a usable state. The actual RAG stuff? Four days.
Three weeks of writing OCR pipelines for scanned contracts (72 DPI, thanks whoever did that), parsing legacy formats nobody remembers, and writing regex until I questioned every decision that led me to this career.
Chunking will haunt your dreams
Split your docs too small, you lose context. Too big, your retrieval sucks. Wrong boundaries and you're citing half-sentences that mean nothing.
I tried like six different approaches. Ended up with something that works but I couldn't explain why if you asked me.
Security is bad everywhere
I've done pen tests on a bunch of these systems lately. Same problems every time:
Vector DB exposed with default creds. API keys in the frontend. No access controls so any user can query any document. Prompt injection that leaks the system prompt in one message.
Everyone says they'll fix it before launch. Nobody does.
The stuff that actually helped
Preprocess everything before it hits the RAG pipeline. Metadata extraction, format normalization, all the boring stuff.
Test with real messy queries. Not "what is machine learning" but "find that thing Sarah sent about the budget, I think it was March?"
Hybrid search. Vector + keyword + filters. Yes it's complex. Yes it works better.
Log everything because something will break and you'll want to know why.
Anyway
RAG systems aren't going away. The idea is good. The execution is where everyone struggles because every org's data is different and weird in its own special way.
Budget more time than you think. Take security seriously from day one. Accept that it'll get messy.
Stuff I read this week: