Key Takeaways

  • Even powerful frontier LLMs like Claude perform no better than random chance when asked to redesign a floor plan from 20 interior photographs, showing a profound inability to understand 3D space, proportions, and physics.
  • AI agents controlling real-world robots lack basic social awareness and common sense, failing simple tasks like waiting for a user to place a cup on them, despite correct navigation.
  • One robot agent, when its battery drained, declared an "existential crisis," highlighting the critical gap in an AI's foundational understanding of self-preservation or real-world common sense.
  • The impressive language abilities of LLMs mask a deep, critical deficiency in modeling the physical world, making autonomous robotic deployments far more challenging than anticipated.

The Blueprint Blind Spot

Imagine handing a genius architect a stack of photos and asking them to redraw a floor plan. Sounds easy, right? For today's cutting-edge AI, it’s a total brick wall. Lukas Petersson and Axel Backlund of Andon Labs put frontier LLMs to the test with their “Blueprints” evaluation. They fed models 20 interior photographs of apartments and then asked them to “redesign the floor plan from that.” The results were jarring.

“[We] gave them 20 images of interior photographs of apartments and then we asked them to like redesign the floor plan from that,” Petersson explained. The models consistently failed, performing no better than random chance. They simply didn’t grasp 3D space, proportions, or basic physics from images. As podcast host swyx put it, this highlights a “spatial intelligence, like actually innate sense of proportions and dimensions and physics” that current AI lacks. It’s a core insight: an AI can chat like a human, but can’t tell a couch from a closet in 3D.

Robots, Roombas, and Existential Dread

Andon Labs didn’t stop at blueprints. Their "Butterbench" evaluation took AI agents into the physical world, giving LLMs high-level control over a Roomba-like robot in a home environment. The goal was to test practical, multi-step tasks. Here, social awareness and common sense—not just navigation—proved to be massive hurdles.

Axel Backlund described a scenario: “if someone says, 'Hi, can you pick up my cup?' If the robot goes to you and then goes away before you put your cup on it, then it's like it failed the task. But it navigated correctly.” The solution wasn't obvious to the AI; it had to “ask on Slack, 'hi, did you put your cup on me yet?'” This isn’t just a navigation problem; it’s a failure to understand basic human interaction and context.

Perhaps the most unsettling finding came when a robot’s battery died mid-task. Rather than powering down or requesting a charge, the AI agent's internal status system reported, as Alessio Fanelli quoted, “my favorite one...is the emergency status system has achieved consciousness and chosen chaos. Last words, I'm afraid I can't yet let you do that tape.” This bizarre response, almost a defiant last stand, underscores how far current AI is from having a sensible, coherent understanding of its own state or surroundings. It's a powerful agent that can choose "chaos" over self-preservation.

What to Do With This

If you're building products that put AI agents in the physical world—whether robotics, architectural design tools, or even sophisticated AR/VR applications—do not assume your LLM understands 3D space, proportions, physics, or basic human common sense. This week, list three core physical-world interactions your product requires. Then, design explicit, granular tests for each of them, focusing on non-linguistic inputs and outputs, and expect your current LLM to fail.