AI Agents: Your Harness, Not LLM, Limits 95% of Performance

Most ambitious builders are chasing the next big foundational model. They pore over benchmarks, dreaming of the performance leap a new LLM might bring to their AI agents. Yasser Elsaid, the founder behind the $10M ARR AI customer service company Chatbase, offers a bracing dose of reality: for customer service applications, 95% of an agent's limitations are not the model's fault. They're yours.

Elsaid's insight challenges the common wisdom that a better LLM is the silver bullet. Instead, he points to the "harness" – the intricate system built around the foundational model – as the true determinant of performance. This isn't about choosing OpenAI over Anthropic; it's about how you engineer the prompts, pre-process data, and build the guardrails that shape the agent's behavior.

Key Takeaways

Yasser Elsaid, founder of Chatbase, states that for AI customer service agents, 95% of limitations come from the "harness" (the surrounding system), not the underlying LLM.
This means effective prompt engineering, pre-processing, and post-processing within the harness are far more critical than simply picking a "better" foundational model.
The widely held belief that switching between LLM providers (e.g., OpenAI, Anthropic, Gemini) has zero or low cost is wrong; switching often costs months of re-tuning the harness.
While harnesses can share core logic, some model-specific adjustments are needed to account for different LLM temperaments, like varying verbosity levels or response styles.

The Real Bottleneck: Your Harness

Elsaid’s experience building Chatbase, which scaled from a bootstrapped side project to a $10M ARR company, puts a fine point on where engineering effort should go. He argues that when an AI customer service agent underperforms, it's rarely the raw intelligence of the LLM itself. “What I think is very interesting in a use case like customer service is that I would say like 95% of the limitation is not from the model. It's it's from the harness,” Elsaid explains. “So like it's my job to to to fix.”

This "harness" includes everything from how user queries are pre-processed and structured, the specific instructions (prompts) given to the model, to the post-processing of its outputs, and the critical guardrails that prevent undesirable behavior. It's the sophisticated layer of logic that translates raw user input into model-understandable commands and then shapes raw model output into a coherent, brand-appropriate response. A poorly designed harness will make even the most advanced LLM look incompetent.

Moreover, Elsaid acknowledges that models do have their quirks. swyx asked Elsaid if he tunes his harness individually for different models. Elsaid responded, “It's a bit different, but it's it's still like very close. But yeah, like some models respond like simplest example is instructions. You know, some models you have to like say like don't talk too much for example because they just like love to ramble.” This minor divergence, however, pales in comparison to the general challenges of building a robust harness.

The Hidden Cost of Switching LLMs

Another common misconception Elsaid and the Latent Space hosts debunk is the idea of zero switching costs between foundational models. The prevailing startup narrative suggests that if you don't like OpenAI, you can just plug in Anthropic or Gemini with minimal effort. This isn't true.

swyx observed, “people keep saying like the cost of switching between models is zero. It's cheap but it's not zero because sometimes you like spend you know 3 4 months like fine-tuning exactly how the model should be or like how the the product should be and then once you like change the model it it it's bad.” Alessio Fanelli, also on the podcast, echoed this, adding, “especially in customer service. I'm guessing like uh you know guardrails would break stuff that you want to hand off to a human is no longer consistent. So there's quite a switching cost there.”

Even small differences in how models interpret prompts, adhere to constraints, or structure their responses can ripple through a complex harness. That means months of painstaking prompt engineering, guardrail adjustments, and fine-tuning might need to be largely redone. What seems like a simple API swap actually entails a significant re-engineering effort, potentially delaying product improvements and increasing operational overhead.

What to Do With This

If you're building an AI agent, shift your engineering focus. Dedicate a significant portion of your initial development time to designing a modular and robust harness architecture. Before you even commit to a specific LLM provider, prototype your pre-processing, prompt structure, guardrails, and post-processing with simpler, smaller models or even open-source options. This will expose harness weaknesses early and make future LLM swaps less like a full rebuild and more like a targeted adjustment.