Building AI Features Without Breaking the Hosting Budget
The GPU Bill Arrives — And It Is Not Pretty
Every product team wants AI features. The business case is compelling: smarter search, automated support, personalised recommendations, content generation. The prototype works beautifully in development. Then someone calculates the production cost: GPU hosting for model inference, API fees for external model providers, storage for embeddings, and compute for preprocessing pipelines. The number is often five to ten times higher than the team expected.
Building AI features that your hosting budget can sustain requires deliberate architectural choices — not after launch when the bills arrive, but during design when you can make the decisions that determine whether the feature is economically viable at scale.
Start with the Model Selection
The choice of model is the single biggest cost driver. A 70-billion-parameter model produces impressive outputs but costs twenty to fifty times more per inference than a 7-billion-parameter model. For many tasks — classification, summarisation, extraction, simple question answering — the smaller model performs nearly as well.
Right-Sizing the Model
- Classification and routing: A small fine-tuned model (under 1 billion parameters) outperforms a general-purpose large model at specific classification tasks while costing a fraction per inference.
- Structured extraction: Extracting entities, dates, and specific information from text does not require a frontier model. Smaller models with good instruction-following capability handle this efficiently.
- Conversational AI: For customer-facing chat, a 7-13 billion parameter model with good RAG grounding provides quality responses at manageable cost.
- Complex reasoning and generation: Only tasks requiring multi-step reasoning, nuanced analysis, or high-quality long-form generation truly benefit from the largest models.
Map each AI feature to the smallest model that meets the quality bar. A portfolio of small, task-specific models costs less than routing everything through a single large model.
Quantisation: The Free Performance Win
Quantisation reduces the numerical precision of model weights — from 16-bit floating point to 8-bit or 4-bit integers. The result: models use two to four times less GPU memory, run two to four times faster, and require proportionally cheaper hardware. The quality trade-off is usually minimal — for most inference tasks, quantised models produce outputs that are indistinguishable from full-precision models.
Quantise aggressively. A 7-billion-parameter model quantised to 4-bit precision fits comfortably on a consumer-grade GPU with 8GB of VRAM — hardware that costs a fraction of the data-centre GPUs required for full-precision inference.
Caching: Do Not Compute What You Have Already Computed
Semantic Caching
Many AI queries are semantically identical even if the wording differs. "How do I reset my password?" and "I forgot my password, how do I change it?" should return the same response. Semantic caching stores responses keyed by the embedding of the query. When a new query is semantically similar to a cached query (above a similarity threshold), the cached response is returned without calling the model. For support chatbots and FAQ-style features, semantic caching can eliminate fifty to eighty percent of model calls.
Response Caching for Deterministic Inputs
For features where the same input always produces the same output (classification, extraction, structured analysis), cache the result directly. A product categorisation feature that processes the same catalogue entry twice should use the cached classification, not compute it again.
Embedding Caching
If your RAG pipeline generates embeddings for user queries, cache the embeddings. Embedding generation adds latency and cost, and identical queries produce identical embeddings.
Batching: GPUs Reward Parallelism
A GPU processing a single inference request uses a small fraction of its computational capacity. Processing a batch of thirty-two requests simultaneously takes only marginally longer than processing one — but delivers thirty-two times the throughput. Continuous batching, where new requests are dynamically added to the current batch as slots become available, maximises GPU utilisation.
For non-interactive workloads (background processing, batch analysis, content generation queues), collect requests and process them in batches. For interactive workloads (chat, search), use serving frameworks that implement continuous batching automatically (vLLM, TGI).
Serverless GPU: Pay Only When Computing
If your AI feature handles variable traffic — high during business hours, minimal at night, spiky around product launches — serverless GPU endpoints eliminate the cost of idle GPUs. You pay per second of GPU time or per inference request. Cold starts (ten to sixty seconds for model loading) are the trade-off.
Mitigate cold starts with: warm pools that keep a minimum number of instances ready, model caching on the serverless platform's storage, and user-facing design that masks the cold start (show a "thinking" indicator, preload the model when the user navigates toward the AI feature).
External APIs vs Self-Hosting: The Cost Crossover
External AI API providers charge per token (input and output). At low volumes, APIs are cheaper than self-hosting because you pay nothing when the feature is idle. At high volumes, the per-token cost accumulates and self-hosting becomes cheaper.
The crossover point depends on your specific model, traffic volume, and utilisation pattern. As a rough guide: if you consistently process more than a few hundred thousand tokens per day, self-hosting a quantised open-weight model on a dedicated or reserved GPU instance is likely cheaper than an external API. Below that volume, the API is simpler and more cost-effective.
Architecture Patterns That Control Cost
Tiered Model Routing
Route requests to different models based on complexity. A fast classifier determines whether the query is simple (handled by a small model or cached response), moderate (handled by a mid-size model), or complex (routed to a large model). Most requests are simple — handling them cheaply while reserving expensive models for the minority of complex queries dramatically reduces average cost per request.
Preprocessing to Reduce Token Count
LLM API costs are proportional to token count. Reduce input tokens by summarising long context before passing it to the model, trimming irrelevant content from retrieved documents, and using concise system prompts. A ten-percent reduction in average token count translates directly to a ten-percent cost reduction.
Offline Computation Where Possible
Not every AI feature needs real-time inference. Product descriptions, content summaries, SEO metadata, and recommendation scores can be precomputed during off-peak hours and cached for instant delivery. Offline computation uses GPU resources more efficiently (batch processing, off-peak pricing) and eliminates inference latency from the user experience.
Monitoring AI Costs
Track these metrics for every AI feature:
- Cost per inference: Average cost of each model call, including GPU time or API fees.
- Cache hit ratio: Percentage of requests served from cache. Low ratios indicate optimisation opportunities.
- GPU utilisation: For self-hosted models, track GPU utilisation percentage. Below fifty percent suggests over-provisioning.
- Cost per user action: The business-relevant metric — how much does each AI-assisted search, support interaction, or recommendation cost?
- Quality vs cost trade-off: Track response quality scores alongside cost metrics. A cost reduction that degrades quality below the acceptable bar is not an optimisation — it is a regression.
The Bottom Line
Building AI features on a budget is not about skimping — it is about engineering efficiency. Choose the smallest model that meets the quality bar. Quantise aggressively. Cache everything you can. Batch inference requests. Use serverless for variable workloads and self-host for sustained volume. Route simple requests to simple models. The teams that build sustainable AI features are the ones that treat inference cost as a first-class engineering constraint — not an afterthought that surprises them in the monthly hosting bill.