Human in the Loop
Posts
#15 Edition: How OpenAI Cracked the Core Barrier to AI at Scale

#15 Edition: How OpenAI Cracked the Core Barrier to AI at Scale

PLUS: Replit unveils an dev agent that runs for 200 minutes and Oracle bets $300B on OpenAI

Andreas Horn
September 15, 2025

Hey, it’s Andreas.
Welcome back to Human in the Loop — your field guide to the latest in AI agents, emerging workflows, and how to stay ahead of what’s here today and what’s coming next.

This week:

Oracle places a $300B bet on OpenAI — the biggest cloud deal ever
Replit drops Agent v3, raises $250M, and doubles down on autonomous dev agents
And in today’s deep dive: How OpenAI is tackling hallucinations at the root — and redefining what reliable really means for LLMs.

Let’s dive in.

Weekly Field Notes

🧰 Industry Updates
`New drops: Tools, frameworks & infra for AI agents`

🌀 Oracle bets $300B on OpenAI
→ One of the largest cloud deals ever: 4.5 GW capacity (enough to power over 3 million homes) and $60B annual payments by 2027, 6x OpenAI’s current revenue. Oracle’s stock jumped 43%.

🌀 Replit launches Agent v3 + $250M raise
→ Works autonomously for 200 (!) minutes, tests apps in-browser, and can even build other agents. Also adds a design-only prototyping mode — announced alongside a $250M raise at a $3B valuation.

🌀 ElevenLabs debuts voice remixing (alpha)
→ New tool to edit voices by age, gender, or accent — for custom agent design.

🌀 Vercel drops open-source vibe coding platform
→ Powered by GPT-5 agent loops, it turns natural language into production-ready code with testing and deployment built in. Definitely worth trying.

🌀 Anthropic adds to Claude file creation
→ Claude can now generate and edit Excel, Word, PowerPoint, and PDFs directly in-app — turning raw data or prompts into ready-to-use files.

🌀 OpenAI enables MCP connectors in ChatGPT Developer Mode
→ Pro/Plus users can import remote MCP servers with full read/write tool access —further cementing MCP’s position as the emerging standard.

🌀 Adobe launches AI Agent Orchestrator
→ Enterprises can now embed and manage AI agents in Adobe Experience Platform.

🌀 Anthropic partners with Microsoft on Office 365
→ Microsoft will pay to use Claude tech in Office apps, diversifying beyond its $13B OpenAI partnership. Who would have thought?

🌀 Hugging Face plugs into GitHub Copilot Chat
→ New VS Code extension lets devs run open LLMs inside Copilot Chat, making it a true multi-model hub.

🎓 Learning & Upskilling
`Sharpen your edge - top free courses this week`

📘 IBM Watsonx Agentic Crash Course by Nicholas Renotte
→ Free, hands-on course covering no-code agents in Orchestrate, agentic flows with Langflow, and custom builds with LangGraph.

📘 Google on Gemini CLI + MCP
→ Tutorial shows how to build an MCP server, connect it to Gemini CLI, and extend LLMs with real-world actions.

📘 Packt Mini Course: Model Context Protocol by Hand
→ Sketch-based workshop on MCP, A2A, and agentic systems — designed to make complex agent workflows intuitive and visual. Great entry point for researchers, pros, and educators exploring advanced AI agents.

📘 Google drops 8 bite-sized AI courses
→ All free, 15-minute, and highly practical on NotebookLM and Gemini. Great quick wins to sharpen your AI workflow.

🌱 Mind Fuel
`Strategic reads, enterprise POVs and research`

🔹 Thinking Machines on LLM inconsistency
→ Horace He pinpoints the real culprit behind inconsistent outputs: batch invariance — results shift when requests are grouped differently on GPUs. Their fix makes answers reproducible, critical for high-stakes domains like law and medicine.

🔹 Microsoft on AI agent governance
→ A new 32-page white paper lays out governance for Copilot and enterprise agents. It’s a practical blueprint for risk, compliance, and accountability in the Microsoft stack.

🔹 Sergey Levine (UC Berkeley) on physical Intelligence and autonomous robots
→ He forecasts home robots by 2030, arguing they’ll scale faster than LLMs thanks to self-correction, falling hardware costs, and trillions in blue-collar work ready for automation.

🔹 Anthropic on writing tools for agents
→ New engineering blog shares playbooks for agent-ready tools: prototype fast, optimize for token efficiency, and even let Claude refine its own tools. It marks a shift from deterministic software toward agent-native design.

🔹 DeepMind’s Demis Hassabis on “PhD-level” claims
→ Hassabis warns against “PhD-level” claims: LLMs can draft papers yet miss high-school math. His benchmark: rediscovering relativity from a 1901 cutoff. True, consistent AGI is 5–10 years out; for now, treat LLMs as sharp but inconsistent — verify the basics.

♾️ Thought Loop
`What I've been thinking, building, circling this week`

“Don’t believe everything you get from ChatGPT.“

– Abraham Lincoln

Let’s talk about hallucinations. In the world of LLMs, that means generating answers that look right — but aren’t.

Even the biggest, most advanced models do it. They produce polished, confident statements that are flat-out wrong. And this is becoming more and more critical — because people are putting increasing trust in AI systems, often without realizing they can confidently fabricate information.

Most productivity talk skips the reliability question. That’s why it’s refreshing to see this debate tackled head-on by OpenAI’s new paper released this week, “Why Language Models Hallucinate”.

The Core Insight

The paper argues that hallucinations aren’t glitches — they’re structural. Models are trained to guess rather than admit uncertainty — and that incentive shapes their behavior.

Key findings:

Training bias toward guessing: Current objectives give points for right answers but not for “I don’t know”
Overconfidence: Models don’t just guess — they guess with conviction, making errors hard to spot
Systemic incentives: Benchmarks like MMLU penalize abstentions, pushing models to bluff
Better outcomes with abstention: Allowing “I don’t know” dramatically cuts wrong answers, even if accuracy looks lower on paper

The Fix (Easy to Explain, Tough to Implement)

To address this problem, OpenAI proposes overhauling the grading system:

Penalize confident wrongs more than abstentions
Give partial credit for “I don’t know”
Add confidence thresholds for auditing
Bake this into mainstream benchmarks, not niche tests

It may sound trivial, but it isn’t. Benchmarks shape billions in funding, reputations, and adoption. If the scoreboard doesn’t change, neither will the models — and changing it will require a complex, coordinated effort with broad stakeholder buy-in and alignment.

The most important question in AI right now

This whole debate isn’t really about hallucinations. It’s about AI reliability — the real bottleneck keeping AI from moving beyond the sandbox.

Right now, even the best models can get 99 things right… and one thing catastrophically wrong. That single failure is what stops enterprises from deploying them at scale.

If models like GPT-5 — combined with new training methods that reward knowing when they don’t know — succeed, it would mark a fundamental shift.

Healthcare: AI agents triaging patients or suggesting treatments without slipping in one fatal error
Cybersecurity: AI autonomously isolating threats in live environments without triggering false positives that shut systems down
Supply chains: Agents negotiating contracts and placing million-dollar orders without hallucinating prices or terms

This is the real threshold for AI and especially AI agents: systems that don’t just answer — but know when not to.

Widespread adoption won’t hinge on faster models or new features —
it will hinge on trust. Enterprises won’t deploy agents across critical workflows
until they can rely on them to flag uncertainty as confidently as they give answers.

Solving this isn’t optional. It’s the unlock for moving agents from side tools to core operators.

If you don’t want to read the full paper, I recommend skimming OpenAI’s accompanying blog post — it’s a clear, well-structured summary that makes the key ideas easy to digest.

🔧 Tool Spotlight
`A tool I'm testing and watching closely this week`

Code reviews often take more time than writing the code itself. That’s why a new wave of AI code review tools is emerging currently. They can act like a tireless pair reviewers, catching issues before they reach production.

One I’ve been testing and woth to watch: CodeRabbit.

What it does:
→ Reviews every line of code
→ Flags issues instantly
→ Suggests fixes directly in your IDE

How it works:
→ Runs as a CLI or directly in your IDE (VS Code, Cursor, Windsurf)
→ Hooks into GitHub, GitLab, Bitbucket, and Azure DevOps PR workflows
→ Gives context-aware feedback — not just linting, but structured analysis
→ “Fix with AI” lets you hand issues back to coding agents like Claude Code for auto-patches

Try it now:
→ Explore CodeRabbit on GitHub

That’s it for today. Thanks for reading.

Enjoy this newsletter? Please forward to a friend.

See you next week and have an epic week ahead,

— Andreas

P.S. I read every reply — if there’s something you want me to cover or share your thoughts on, just let me know!

#15 Edition: How OpenAI Cracked the Core Barrier to AI at Scale

PLUS: Replit unveils an dev agent that runs for 200 minutes and Oracle bets $300B on OpenAI

Weekly Field Notes

🧰 Industry UpdatesNew drops: Tools, frameworks & infra for AI agents

🎓 Learning & UpskillingSharpen your edge - top free courses this week

🌱 Mind FuelStrategic reads, enterprise POVs and research

♾️ Thought LoopWhat I've been thinking, building, circling this week