west all insights
applied intelligence Feb 28, 2026 10 min read

llm guardrails, the version that actually runs in production

refusal policies, output classification, and the quiet infrastructure that separates a demo from a product.

jagadeesha

co-founder

When a team ships their first LLM-backed feature to production, they usually ship it with guardrails that exist in slide decks. Those guardrails are: a system prompt that forbids the model from behaving badly, an input-length truncation, and an optimism. Within six weeks — sometimes six days — the feature has produced something quotable, the quotable thing has reached a screenshot, and the screenshot has reached the CEO. Then we get called.

This post is the version of LLM guardrails that keeps the screenshot off the CEO's phone. It is not comprehensive. It is the subset we keep rebuilding at client after client, across six different model providers, because nobody else has bothered to ship it once and reuse it.

guardrails are a pipeline, not a prompt

A production LLM call is not a single request to a single model. It is a short pipeline, and every stage has a job:

  1. Input normaliser — strip or canonicalise input that shouldn't reach the model: PII, secrets, excessive whitespace, prompt-injection markers.
  2. Policy classifier — is this question even in scope? A lightweight classifier (small model, usually open-weight) that decides whether to route to the LLM at all.
  3. Model call — the expensive step. Always behind a provider-agnostic gateway so you can swap models without touching product code.
  4. Output classifier — is the response within policy? Offensive content, hallucinated URLs, code that exfiltrates secrets, answers outside the allowed domain.
  5. Post-processor — citation checking, format enforcement, redaction of anything that slipped through.

Skipping any of these stages is usually possible. Skipping stage 4 is never possible. The output classifier is the single highest-leverage component in the pipeline and the single most skipped component in every team we have walked into.

the input side

The input side of the pipeline has two real jobs: stop your users from accidentally sending secrets, and stop your users from deliberately attacking the model.

Accidental leakage is the common failure. An engineer pastes a log line into a support bot. The log line contains a production API key. The model helpfully summarises the log line — including the key — into its response, which is cached in the provider's logs. You now have a secret in a log you do not control, in a system your security team has never audited. The fix is a boring regex-plus-entropy scan at the pipeline boundary, running before the request leaves your VPC. It will flag some number of false positives and you should tune it, but the cost of a false positive is a polite retry; the cost of a miss is a rotation and a disclosure.

Deliberate attacks are the interesting failure. Prompt injection is not a solved problem and will not be solved at the prompt layer. The practical defences are architectural: never concatenate untrusted text with privileged instructions in the same prompt; use separate model calls for "read this untrusted content" and "decide what to do"; and never allow the model to take a privileged action (delete, email, charge) without a human-in-the-loop confirmation of the specific action. Treat the model as a very persuasive intern who has been given a list of instructions by a stranger; design the system accordingly.

the output side

Output classification is the hardest part of the pipeline to ship well because it is the part where most teams' instinct — write a system prompt that forbids bad behaviour — gets in the way of the correct architecture.

The correct architecture is a second model call, with a different system prompt, that looks at the first model's output and answers a specific question: is this response within our published policy? The second call is cheaper than the first because it only has to classify, not generate. It is a different model because you do not want a model marking its own homework. The policy it checks against is a version-controlled document; when policy changes, you update the file and the classifier picks it up on the next deploy.

What does this buy you? You get a logged, reviewable policy decision for every response that reaches a user. You get the ability to run that classifier offline against historical responses — so when your compliance team asks "have we ever told a user X?", you have a tractable answer. And you get a natural place to plug in the quality drift dashboard your ML ops person wants: if the classifier starts flagging more responses this week than last, something in the upstream model or the retrieval layer has changed, and you know before your users do.

hallucinated confidence, cited answers

There is a particular failure mode that is worth calling out: the LLM that invents a source and cites it. It is not lying; it is doing what it was trained to do, which is to produce text that looks like the text it saw. The model has no ground truth for the citation.

The cheapest defence is architectural. If your product includes citations — and if it is retrieval-augmented, it should — the citation should never be generated by the model. The retrieval step returns a set of documents with IDs. The model returns a response plus a list of document IDs it claims to be citing. Your post-processor verifies that every cited ID was in fact in the retrieval set, and that the claim being cited appears in the document text. IDs the model invented get stripped; the response is regenerated or, better, marked as low-confidence with an explicit "no source found" indicator.

This does not make hallucinated claims impossible. It makes hallucinated citations impossible. That is already a meaningful improvement.

what good looks like

A production LLM feature has, at minimum:

  • a versioned policy document
  • an input normaliser with PII + prompt-injection checks
  • a provider-agnostic model gateway with per-model cost ceilings
  • an output classifier running on every response, with its verdicts logged
  • citation verification for any feature that claims to cite
  • latency and cost dashboards broken down by pipeline stage
  • an eval harness that runs on every pull request touching prompts

Nothing in that list is optional. Every one of them has saved a client in the last twelve months. The last one — the eval harness — is usually the bridge too far. Teams resist it because writing evals is dull. Writing evals is dull. A production LLM feature without evals is a product you are shipping blind. Pick one.

keep reading

more from the arkavix practice on platform engineering, applied AI, and the unglamorous details of making systems endure.

all insights