Eight Things I Wish I Knew Before Shipping an LLM

AI
Engineering
Machine Learning
Practical lessons for ML engineers building LLM-powered systems
Author

Andrij David

Published

April 5, 2023

Modified

February 14, 2026

In 2023 Sam Bowman published a useful overview of what the research community knew about large language models (Bowman 2023). Since then, LLMs have moved from research curiosities into production systems that handle customer support tickets, generate reports, and make routing decisions that cost real money.

This post is not an update of that paper. It is a different list entirely of eight things I wish someone had told me before I shipped my first LLM-powered feature. Every point comes from a failure mode I have either hit myself or watched a team hit. Most of them have nothing to do with model architecture and everything to do with engineering discipline.

1. Prompts Are Not Code, They Break Silently

A one-word change in a prompt can flip a model’s output from correct to wrong. Unlike code, prompts do not throw exceptions when they break. They just return a different but often plausible-looking answer.

Consider a sentiment classifier built on an LLM. The table below shows five semantically equivalent ways to phrase the same instruction, and the model’s accuracy on the same 200-example test set.

Prompt Accuracy Format Errors
Classify the sentiment as positive or negative. 94% 2
Is this review positive or negative? 91% 5
Determine whether the sentiment is positive or negative. 88% 3
Label the following text: positive or negative. 86% 14
What is the sentiment? Answer positive or negative. 93% 4

In this experiment, we saw an 8 percentage point spread from rewording alone (94% vs. 86%) on the same 200-example test set. The fourth variant also produced 14 format errors (7%) where the model returned free text instead of one of the allowed labels. In production, those become silent failures unless we enforce schema validation before parsing.

I have seen this fragility in real systems, and the literature supports it. Sclar et al. showed that even a small lexical prompt changes such as whitespace, punctuation, synonym swaps, and typos can cause significant accuracy shifts across tasks and model sizes (Sclar et al. 2024). PromptBench reports similar robustness failures under perturbations at character, word, sentence, and semantic levels (Zhu et al. 2023).

Always treat prompts as production artifacts version them, regression test them, and gate changes in CI rather than editing prompt strings ad hoc and deploying.

2. Temperature Is a Business Decision

Temperature controls the randomness of the model’s output. At 0 it picks the most likely token every time. At 1 it samples more broadly. We can think of it as a way to inject creativity into the model. Most tutorials treat this as a technical detail but it directly affects the system’s behavior in production.

Lets run the followin prompt 50 times at each temperature setting on the same input.

Extract the invoice number from this email.

At temperature 0, the model returns the correct invoice number 96% of the time. At temperature 1.0, it drops to 62%, and 20% of responses are not even parseable. The wrong but parseable answers is the most dangerous as the system accepts the output, writes it to a database, and nobody notices until reconciliation fails days, weeks or even months later. For our case, the degradation curve is steeper than for creative tasks because there is exactly one correct answer and any sampling diversity is noise. The effect of temperature on task performance varies by domain. Renze and Guven found that for problem-solving tasks, higher temperatures can sometimes improve performance on creative reasoning while degrading it on precise extraction (Renze and Guven 2024).

Always use temperature near 0 for structured extraction, classification, and anything that feeds into a downstream system. Reserve higher temperatures for creative generation tasks where variance is a feature, not a bug.

3. Evaluation Is Your Moat

Most teams evaluate their LLM system by looking at a handful outputs. This is the equivalent of testing a website by clicking around for five minutes and shipping it.

A minimal evaluation harness needs at least 50 to 100 labeled examples, a scoring function, and a script that runs on every prompt change. It should catch more regressions than any amount of manual review. Bellow is an example of what a regression looks like when you have the data.

At version 8 someone rewrote the prompt to be “clearer” and accuracy dropped by 5 points. Without the eval harness this regression would have reached production. With it, the team caught the problem in CI before merging.

The LLM-as-judge evaluation is the concept of using a stronger model to grade a weaker model’s outputs.It has become a practical alternative to human labeling for many tasks (Zheng et al. 2023). However, Shankar et al. showed that LLM judges can disagree with human preferences in systematic ways, so you should validate your judge against human labels on a sample before trusting it (Shankar et al. 2024).

Start with 50 examples. Grow to 100. Automate the run. The eval set is the single most valuable artifact in your LLM project.

4. RAG Does Not Solve Hallucination

Retrieval-Augmented Generation (RAG) is the standard pattern for grounding LLM outputs in your own data (Lewis et al. 2020). The idea is simple: retrieve relevant documents, stuff them into the prompt, and let the model answer based on the retrieved context. This does reduce hallucination, but in production I repeatedly see three reliability traps that are easy to miss.

Right answer from imperfect or wrong context This is the quiet failure mode. The output is correct, but the retrieved evidence is weak, partially irrelevant, or does not actually support the claim. Exact-match accuracy marks this as a success, but the system is effectively getting lucky. This is hard to measure and becomes a ticking bomb in production because luck does not survive distribution shift.

Wrong answer despite correct retrieval In this category, the model had the right documents but still generated a wrong answer. This happens when the model paraphrases incorrectly, mixes up entities from multiple retrieved chunks, or ignores relevant context in favor of its parametric knowledge. This is dangerous because retrieval dashboards look healthy while user-visible quality drops.

Wrong answer due to wrong retrieval The retrieval step returned irrelevant or misleading documents. The model then faithfully answers based on the wrong context. This is a retrieval problem, not a generation problem, but it shows up as hallucination to the end user.

A comprehensive survey of RAG systems and their failure modes is provided by Gao et al. (Gao et al. 2024). Chen et al. further benchmark LLM behavior in RAG settings and show that retrieval quality is often the binding constraint (J. Chen et al. 2024).

Always score retrieval and generation separately, then add support-aware checks (citation overlap, attribution, or human audit slices) so the correct answer by luck does not pass silently. If retrieval recall is 70%, no amount of prompt engineering on the generation side will get you past 70% end-to-end accuracy.

5. Token Economics Scale Faster Than You Think

A prototype that handles 100 requests per day and costs few dolars can become a five-figure monthly bill at production scale. Most teams discover this after launch, not before. The math is straightforward but the numbers surprise people.

Let’s imagine a Sass company that decided to add an LLM-powered smart summary feature to their CRM in order to summarize xusomter interaction. Each interaction may involve email, chat and call transcript all summarized via an LLM call. The average input is 800 tokens and 200 output token per request. After checking the price provided by OpenAI and Anthropic and for Open-source (hosted) model let’s assumes a self-hosted Llama-3-70B on 2× A100 via a provider like Together AI or Fireworks, at roughly $0.20/$0.80 per 1M input/output tokens. The formula is simple:

monthly_cost = 30 days × rpd × (input_tokens × input_price + output_tokens × output_price) / 1e6

Bellow is the monthly cost of an LLM-powered feature across different traffic levels and model tiers.

At 10,000 requests per day a frontier model costs roughly $1,500 to $2,000 per month for a single feature. At 100,000 requests per day it crosses $15,000. And this is before retries (add ~5–10%) , prompt chaining (multiplies by chain length), batching discounts, or multi-turn conversations that multiply token usage. The real bill is typically 1.2–2× these estimates.

Chen et al. proposed FrugalGPT, a framework for reducing LLM costs by cascading from cheaper to more expensive models based on query difficulty (L. Chen, Zaharia, and Zou 2023). The core idea is to use a small model by default and only escalate to a frontier model when confidence is low

Before you choose a model, multiply your expected traffic by 10x and check whether the bill is still acceptable. If not, build a routing strategy from day one.

6. Classical ML Still Wins on Structured Data

There is a temptation to use LLMs for everything once they are in the stack. I have seen teams serialize tabular features into text, feed them to an LLM, and ask it to predict churn. This is almost always worse than a gradient-boosted tree and orders of magnitude more expensive.

Let’s consider a telecom company that wants to predicts chirn using 20 features including tenure, monthly charge, contract type, usgae metrics, etc and 10 000 historical customer records with known churn outcomes. One possible LLM approach is to serialize each row ot the dataset into a natural language description e.g:

Customer with 14 months tenure, $72/month plan, fiber optic internet, no tech support, month-to-month contract. Will this customer churn?

Metric XGBoost LLM (GPT-4o) LLM (Haiku)
AUC-ROC 0.87 0.84 0.81
Accuracy 87% 79% 76%
Latency (p50) 0.8 ms 850 ms 320 ms
Cost per 10K predictions $0.00 $4.75 $1.60
Training time 39 seconds N/A (prompt only) N/A (prompt only)

The XGBoost model is more accurate, 400 times faster, and essentially free to run. LLM latency is dominated by network round-trip and token generation. GPT-4o p50 ≈ 800–1000ms, Haiku p50 ≈ 250–400ms for short completions. This is not a cherry-picked example. Grinsztajn et al. showed that tree-based models consistently outperform deep learning on typical tabular datasets, and this gap has not closed with scale (Grinsztajn, Oyallon, and Varoquaux 2022). Borisov et al. provide a broader survey confirming that for structured data the classical toolbox remains the default choice (Borisov et al. 2022).

LLMs excel at unstructured text, ambiguous instructions, and tasks that require world knowledge. For structured data with clean features and historical labels, use the tools that were built for the job.

7. Guardrails Are Not Optional

An LLM without guardrails is a liability. Without input validation, your system accepts prompt injections. Without output validation, it returns hallucinated JSON, executes unintended tool calls, or leaks context from other users.

Below is a taxonomy of failure modes from a production customer support bot, categorized by where the guardrail should have caught the problem.

Input guardrails are topic classifiers, injection detectors, input length limits. They need to catch problems before the LLM runs. The Output guardrails are schema validation, content filters, PII scrubbers. They need to catch problems after the LLMs. They are both necessary.

Perez et al. demonstrated that LLMs can be systematically red-teamed using other LLMs to discover failure modes at scale (Perez et al. 2022). On the defensive side, Llama Guard provides an open-source input/output safety classifier (Inan et al. 2023), and NeMo Guardrails offers a programmable framework for defining conversational boundaries (Rebedea et al. 2023).

The minimum viable guardrail stack is to reject obviously off-topic inputs, validate output schema before returning to the user, log everything for review. You can add sophistication later but in my experience these three catch the majority of production incidents.

8. The Last 10% of Reliability Is 90% of the Work

Getting an LLM system to work 90% of the time is easy. Getting it to 99% is a fundamentally different engineering problem. The failures in the long tail are not random. They tend to cluster around specific input types that the model handles poorly.

Below is a performance analysis from a document classification system we build for a client. After reaching 92% overall accuracy, the team performs an error analysis to understand where the remaining 8% of failures concentrate. The first column shows the input category, the second shows its share of total traffic, and the third shows its share of total errors.

As expected by the team, the errors are not uniformly distributed. Three categories (OCR scans, mixed-language docs, and heavily formatted tables) account for 23% of traffic but 68% of errors. These are the long tail inputs. This pattern is consistent with the general finding that LLM failures cluster around distribution edges where the inputs that are underrepresented in training data or require capabilities (OCR cleanup, multilingual reasoning, spatial table understanding) that text-only models lack.

The practical response is not to make the model better at everything. It is to build category-specific handling i.e route OCR documents through a cleanup pipeline first, use a specialized model for mixed-language inputs, add a confidence threshold that sends low-confidence predictions to a human reviewer.

This is the same principle as RLHF. The base model gets you most of the way, and careful targeted work on the tail distribution is what makes the system production-ready (Ouyang et al. 2022). In practice, that targeted work looks less like training and more like routing, fallbacks, and human-in-the-loop design.

Shipping an LLM is an engineering problem, not a modeling problem and the modeling is the easy part.

Further Reading

  • (Sclar et al. 2024) conducted an empirical study showing how minor prompt variations cause significant accuracy changes across tasks and model sizes.
  • (Zhu et al. 2023) demonstrate how robustness degrades under controlled prompt perturbations..
  • (Zheng et al. 2023) introduces MT-Bench and the LLM-as-judge paradigm for automated evaluation.
  • (Gao et al. 2024) wrote a comprehensive survey of RAG architectures, failure modes, and mitigation strategies.
  • (L. Chen, Zaharia, and Zou 2023) propose a framework for reducing LLM inference costs through model cascading and query routing.
  • (Grinsztajn, Oyallon, and Varoquaux 2022) show that tree-based models consistently outperform deep learning on standard tabular benchmarks.

References

Borisov, Vadim, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2022. “Deep Neural Networks and Tabular Data: A Survey.” IEEE Transactions on Neural Networks and Learning Systems. http://arxiv.org/abs/2110.01889.
Bowman, Samuel R. 2023. “Eight Things to Know about Large Language Models.” http://arxiv.org/abs/2304.00612.
Chen, Jiawei, Hongyu Lin, Xianpei Han, and Le Sun. 2024. “Benchmarking Large Language Models in Retrieval-Augmented Generation.” http://arxiv.org/abs/2309.01431.
Chen, Lingjiao, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” http://arxiv.org/abs/2305.05176.
Gao, Yunfan, Yun Xiong, Xinze Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” http://arxiv.org/abs/2312.10997.
Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. 2022. “Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?” http://arxiv.org/abs/2207.08815.
Inan, Hakan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, et al. 2023. “Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations.” http://arxiv.org/abs/2312.06674.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In Advances in Neural Information Processing Systems, 33:9459–74.
Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730–44. http://arxiv.org/abs/2203.02155.
Perez, Ethan, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, et al. 2022. “Red Teaming Language Models with Language Models.” In. http://arxiv.org/abs/2202.03286.
Rebedea, Traian, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails.” http://arxiv.org/abs/2310.10501.
Renze, Matthew, and Erhan Guven. 2024. “The Effect of Sampling Temperature on Problem Solving in Large Language Models.” http://arxiv.org/abs/2402.05201.
Sclar, Melanie, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design with a Focus on Lexical Differences.” http://arxiv.org/abs/2310.11324.
Shankar, Shreya, J. D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, and Ian Arawjo. 2024. “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.” http://arxiv.org/abs/2404.12272.
Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, et al. 2023. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” http://arxiv.org/abs/2306.05685.
Zhu, Kaijie, Qinlin Zhao, Hao Chen, Jindong Wang, Weixing Chen, Min Zheng, Bohan Yu, et al. 2023. PromptBench: Towards Evaluating the Robustness of Large Language Models.” http://arxiv.org/abs/2306.04528.