Better models or better validation systems, what matters more now?

Unfollow Follow

Zeeshan

Updated 6 days ago in

Recent updates from OpenAI highlight a clear shift. Models are getting better at reasoning, reducing factual errors, and handling complex workflows.

But even with improvements:

hallucinations still exist
confidence doesn’t always equal correctness
production risk hasn’t disappeared

This creates a real challenge for teams building with LLMs:

response = llm.generate(query)

if not validate(response):
response = fallback_system(query)

Even with stronger models, validation layers, guardrails, and system design still play a critical role.

So the real question becomes:
Are we over-relying on better models to solve reliability, or should more focus shift toward building stronger control systems around them?

How are you approaching this in real-world deployments

<div class="flex flex-col text-sm pb-25">
<section class="text-token-text-primary w-full focus:outline-none [--shadow-height:45px] has-data-writing-block:pointer-events-none has-data-writing-block:-mt-(--shadow-height) has-data-writing-block:pt-(--shadow-height) [&:has([data-writing-block])>*]:pointer-events-auto scroll-mt-[calc(var(--header-height)+min(200px,max(70px,20svh)))]" dir="auto" data-turn-id="request-69d61acb-dda0-83e8-b144-24afdb2d28c9-6" data-testid="conversation-turn-52" data-scroll-anchor="true" data-turn="assistant">
<div class="text-base my-auto mx-auto pb-10 [--thread-content-margin:var(--thread-content-margin-xs,calc(var(--spacing)*4))] @w-sm/main:[--thread-content-margin:var(--thread-content-margin-sm,calc(var(--spacing)*6))] @w-lg/main:[--thread-content-margin:var(--thread-content-margin-lg,calc(var(--spacing)*16))] px-(--thread-content-margin)">
<div class="[--thread-content-max-width:40rem] @w-lg/main:[--thread-content-max-width:48rem] mx-auto max-w-(--thread-content-max-width) flex-1 group/turn-messages focus-visible:outline-hidden relative flex w-full min-w-0 flex-col agent-turn">
<div class="flex max-w-full flex-col gap-4 grow">
<div class="min-h-8 text-message relative flex w-full flex-col items-end gap-2 text-start break-words whitespace-normal outline-none keyboard-focused:focus-ring [.text-message+&]:mt-1" dir="auto" data-message-author-role="assistant" data-message-id="4d91caae-52ef-4607-9b56-049a73dd9155" data-turn-start-message="true" data-message-model-slug="gpt-5-3">
<div class="flex w-full flex-col gap-1 empty:hidden">
<div class="markdown prose dark:prose-invert w-full wrap-break-word light markdown-new-styling">
<p data-start="438" data-end="645">Recent updates from OpenAI highlight a clear shift. Models are getting better at reasoning, reducing factual errors, and handling complex workflows.</p>
<p data-start="647" data-end="676">But even with improvements:</p>
<ul data-start="677" data-end="794">
<li data-section-id="1udatkz" data-start="677" data-end="707">hallucinations still exist</li>
<li data-section-id="1yt136i" data-start="708" data-end="755">confidence doesn’t always equal correctness</li>
<li data-section-id="mcnqih" data-start="756" data-end="794">production risk hasn’t disappeared</li>
</ul>
<p data-start="796" data-end="855">This creates a real challenge for teams building with LLMs:</p>
<div class="relative w-full mt-4 mb-1">
<div class="">
<div class="relative">
<div class="h-full min-h-0 min-w-0">
<div class="h-full min-h-0 min-w-0">
<div class="border border-token-border-light border-radius-3xl corner-superellipse/1.1 rounded-3xl">
<div class="h-full w-full border-radius-3xl bg-token-bg-elevated-secondary corner-superellipse/1.1 overflow-clip rounded-3xl lxnfua_clipPathFallback">
<div class="pointer-events-none absolute inset-x-4 top-12 bottom-4">
<div class="pointer-events-none sticky z-40 shrink-0 z-1!">
<div class="sticky bg-token-border-light"> </div>
</div>
</div>
<div class="relative">
<div class="w-full overflow-x-hidden overflow-y-auto">
<div class="relative z-0 flex max-w-full">
<div id="code-block-viewer" class="q9tKkq_viewer cm-editor z-10 light:cm-light dark:cm-light flex h-full w-full flex-col items-stretch ͼ5 ͼj" dir="ltr">
<div class="cm-scroller">
<div class="cm-content q9tKkq_readonly"><span class="ͼe">response</span> <span class="ͼ8">=</span> <span class="ͼe">llm</span><span class="ͼ8">.</span>generate(<span class="ͼe">query</span>)</p>
<p><span class="ͼ8">if</span> <span class="ͼ8">not</span> <span class="ͼe">validate</span>(<span class="ͼe">response</span>):<br /><span class="ͼe">response</span> <span class="ͼ8">=</span> <span class="ͼe">fallback_system</span>(<span class="ͼe">query</span>)</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="">
<div class=""> </div>
</div>
</div>
</div>
</div>
<p data-start="969" data-end="1072">Even with stronger models, validation layers, guardrails, and system design still play a critical role.</p>
<p data-start="1074" data-end="1245">So the real question becomes:<br data-start="1103" data-end="1106" />Are we over-relying on better models to solve reliability, or should more focus shift toward building stronger control systems around them?</p>
<p data-start="1247" data-end="1304">How are you approaching this in real-world deployments </p>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
</div>

Cancel

OpenAI