The famous O3 "GeoGuessr" prompt did not work

ingve 45 points 13 comments May 21, 2026
www.seangoedecke.com · View on Hacker News

Discussion Highlights (6 comments)

grebc

I wonder if in all the sampling that all location meta data was stripped.

mickeyp

This test would be a lot more useful if the author used images the models obviously hadn't seen before. Pulling images from Wikipedia? They'll have seen 'em before, and the metadata, and all the pages they were casually linked to. The premise that the long prompt only made the model think 'a second longer' may have more to do with the fact that it knows about the images. So why think harder if you know the answer? At no point does the author contemplate that.

vintermann

Interesting what he reports, that newer models are worse at geolocation. Sorry if I'm getting paranoid, but I wonder if that's a deliberately nerfed capability.

Gys

> I think this shows how easy it is to fool yourself about the quality of prompting. When the model is already pretty good at a task, you can give it a very elaborate prompt without impacting performance. It’ll still be pretty good, except this time it’s good because of what you did.

fontain

“It’s also interesting to me that nobody checked this at the time. It took me about six hours of fairly-distracted work and about $15 to construct and run this benchmark. Why didn’t anyone do this when they were writing articles about how good the o3 prompt was?” Because the meta around AI is not rigorous reporting on the nuance of capabilities but bold claims that are easy to retweet. There is no incentive to say “actually, AI is not good at this”. Nobody checked it because nobody cares. There are lots of tasks that AI can be useful for but almost all of the headline claims (including Mythos) are exaggerated at best and bunk at worst.

_fs

I still return to O3 often. I enjoy metal detecting and O3 has been excellent at identifying unknown finds. It will spend 5-10 minutes in python adjusting the photo, zooming, cropping and manipulating it to get a better understanding of the object. And it's guesses, though not perfect, are often spot on. Newer models will never manipulate the photo and usually give a "guess" within 30 seconds. The guesses coming from the newer models rarely are even in the ballpark of the item. It will be a sad day when O3 goes away.

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed