One of many largest surprises for groups constructing with AI will not be that it really works.
It’s how rapidly it turns into costly, gradual, and tough to scale.
What begins as a promising prototype usually turns right into a constrained system. Latency creeps in. Prices rise. Concurrency turns into restricted. And all of the sudden, one thing that felt like a breakthrough is tough to roll out broadly throughout a product.
At a latest AIConf in Ahmedabad, Rajiv Mehta, a Machine Studying Specialist at Bacancy Know-how and AWS Licensed ML Specialist, defined why this occurs. Getting a mannequin to run is trivial. Getting it to run effectively, at scale, and in a method that makes financial sense is the place the actual work begins.
For growth-stage firms, that distinction is every part.
Why the First Model Is Deceptive
The rationale this catches groups off guard is straightforward. The primary model of any AI system often works. It really works in a pocket book, in a demo, and infrequently even with a handful of customers. That early success creates a false sense of readiness.
What’s invisible at that stage are the constraints that present up later. Reminiscence limits, latency, concurrency, and value all start to compound as utilization will increase. What regarded like a breakthrough rapidly turns into a bottleneck.
Rajiv Mehta illustrated this with a easy however highly effective comparability. The identical 4B parameter mannequin, loaded in a normal method, consumes vital reminiscence and helps solely a handful of customers. Optimized appropriately, that very same mannequin can deal with an order of magnitude extra customers at considerably larger throughput.
Identical mannequin. Utterly totally different final result.
For growth-stage startups, that is the distinction between a characteristic that works and a product that scales.
The Actual Price of Doing It the “Default” Method
One of the necessary themes from Mehta’s session is that the default path is nearly by no means the manufacturing path.
Most builders load fashions the only method doable utilizing normal precision, normal libraries, and normal configurations. That strategy is ok for experimentation, however it creates issues rapidly when methods have to scale.
Excessive reminiscence utilization limits concurrency. Gradual throughput impacts person expertise. Inefficient methods drive up infrastructure prices. For a growth-stage firm, these usually are not minor points. They immediately have an effect on margins, pricing, and the flexibility to broaden AI-driven options throughout the product.
The important thing perception is that efficiency is not only about what the mannequin can do. It’s about how effectively you run it.
Small Choices, Huge Affect
What makes this house fascinating is that the largest beneficial properties don’t come from altering the mannequin. They arrive from altering how it’s deployed.
Rajiv Mehta walked by means of a set of optimizations that, taken collectively, dramatically shift efficiency.
Quantization reduces reminiscence footprint with out meaningfully impacting output high quality. As a substitute of consuming large VRAM, fashions can run in a fraction of the house, unlocking far larger concurrency.
Reminiscence administration methods like PagedAttention remove fragmentation and permit methods to make use of accessible assets way more effectively. This turns into vital as workloads improve and methods transfer past easy use circumstances.
Inference engines additionally matter greater than most groups understand. Instruments like vLLM, llama.cpp, and others are purpose-built for serving fashions at scale. Utilizing general-purpose frameworks leaves efficiency on the desk, not as a result of groups are doing one thing unsuitable, however as a result of the instruments weren’t designed for this use case.
Even on the compute degree, optimizations like FlashAttention essentially change efficiency by lowering how usually knowledge wants to maneuver between reminiscence layers. This immediately impacts latency and throughput, particularly in real-time functions.
Individually, every of those choices improves efficiency. Collectively, they fully change what is feasible on the identical {hardware}.
AI Is an Economics Downside as A lot as a Technical One
One of the necessary takeaways for growth-stage firms is that AI is not only a technical drawback. It’s an financial one.
Each token has a value. Each millisecond of latency impacts person expertise. Each inefficiency compounds as utilization grows.
Rajiv Mehta highlighted how dramatically prices and efficiency can shift primarily based on structure choices alone. Methods that aren’t optimized rapidly grow to be costly to function, limiting how broadly AI could be deployed throughout a product.
Then again, well-optimized methods unlock one thing way more beneficial. They permit firms to scale AI capabilities with out scaling value on the identical price.
That’s the place actual leverage comes from.
Avoiding Lock-In as You Scale
One other space Mehta emphasised is flexibility.
Most groups construct immediately in opposition to a single mannequin supplier’s API. It’s quick to get began, however it creates long-term constraints. Switching fashions or including new ones requires remodeling giant components of the system.
The choice is to introduce a routing layer that abstracts the underlying fashions. This enables groups to direct various kinds of requests to totally different fashions primarily based on value, complexity, or sensitivity.
Easy queries could be dealt with by smaller, quicker fashions. Extra advanced reasoning duties could be routed to bigger fashions. Delicate workloads can stay on-premise.
This strategy does greater than enhance efficiency. It offers firms management.
For growth-stage startups, that flexibility turns into more and more necessary as merchandise evolve and utilization patterns change.
The place Most Groups Get It Mistaken
If there’s one takeaway from Mehta’s session, it’s this.
Most groups over-index on the mannequin and under-invest in every part round it.
As he put it, the mannequin is roughly 20 p.c of the answer. The inference engine, reminiscence administration, and routing structure make up the opposite 80 p.c.
That imbalance exhibits up in all places. Groups spend time evaluating fashions, experimenting with prompts, and testing outputs, however they don’t make investments sufficient within the methods required to run these fashions successfully.
For growth-stage firms, it is a vital mistake. As a result of the problem will not be getting AI to work as soon as. It’s getting it to work persistently, effectively, and at scale.
The Backside Line
The toughest a part of AI will not be constructing one thing that works.
It’s constructing one thing that retains working as utilization grows.
Rajiv Mehta’s session made that clear. The distinction between a prototype and a manufacturing system will not be the mannequin. It’s every part that surrounds it. Reminiscence, inference, routing, and value administration all decide whether or not a system can scale.
For growth-stage firms, the chance is obvious. The groups that make investments early in how their methods run would be the ones that may deploy AI broadly and sustainably.
As a result of in the long run, AI is not only about intelligence.
It’s about execution.
To remain up-to-date on all upcoming York IE occasions, comply with us on LinkedIn.

