Where the goblins came from

ilreb 313 points 138 comments April 30, 2026

Discussion Highlights (20 comments)

maxdo

article : bla blah blah, marketing... we are fun people, bla blah, goblin, we will not destroy the world you live in.. RL rewards bug is a culprit. blah blah.

nomilk

> We unknowingly gave particularly high rewards for metaphors with creatures. I recall a math instructor who would occasionally refer to variables (usually represented by intimidating greek letters) as "this guy". Weirdly, the casual anthropomorphism made the math seem more approachable. Perhaps 'metaphors with creatures' has a similar effect i.e. makes a problem seem more cute/approachable. On another note, buzzwords spread through companies partly because they make the user of the buzzword sound smart relative to peers, thus increasing status. (examples: "big data" circa 2013, "machine learning" circa 2016, "AI" circa 2023-present..). The problem is the reputation boost is only temporary; as soon as the buzzword is overused (by others or by the same individual) it loses its value. Perhaps RLHF optimises for the best 'single answer' which may not sufficiently penalise use of buzzwords.

JoshTriplett

A plausible theory I've seen going around: https://x.com/QiaochuYuan/status/2049307867359162460

dakolli

Ahh I see. I guess when I turned off privacy settings and allowed training on my code, then generated 10 million .md files with random fantasy books, the poisoning worked. Keep using AI and you'll become a goblin too.

recursivedoubts

> Why it matters i despise this title so much now

tim-tday

So, you brain damaged your model with a system prompt.

canpan

I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia? After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.

themafia

> You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. Just; the mentality required to write something like that, and then base part of your "product" on it. Is this meant to be of any actual utility or is it meant to trap a particular user segment into your product's "character?"

ninjagoo

> the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training. > The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them > Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data. Sounds awfully like the development of a culture or proto-culture. Anyone know if this is how human cultures form/propagate? Little rewards that cause quirks to spread? Just reading through the post, what a time to be an AInthropologist. Anthropologists must be so jealous of the level of detailed data available for analysis. Also, clearly even in AI land, Nerdz Rule :) PS: if AInthropologist isn't an official title yet, chances are it will likely be one in the near future. Given the massive proliferation of AI, it's only a matter of time before AI/Data Scientist becomes a rather general term and develops a sub-specialization of AInthropologist...

ollin

For context, two days ago some users [1] discovered this sentence reiterated throughout the codex 5.5 system prompt [2]: > Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. [1] https://x.com/arb8020/status/2048958391637401718 [2] https://github.com/openai/codex/blob/main/codex-rs/models-ma...

postalcoder

Would love if OpenAI did more of these types of posts. Off the top of my head, I'd like to understand: - The sepia tint on images from gpt-image-1 - The obsession with the word "seam" as it pertains to coding Other LLM phraseology that I cannot unsee is Claude's "___ is the real unlock" (try google it or search twitter!). There's no way that this phrase is overrepresented in the training data, I don't remember people saying that frequently.

hsuduebc2

I. Love. This.

jumploops

TIL gremlins weren’t just used to explain mysterious mechanical failures in airplanes, it’s the origin story of the term ‘gremlin’ itself[0]. I had always assumed there was some previous use of the term, neat! [0] https://en.wikipedia.org/wiki/Gremlin

acuozzo

Weird. I thought they came from Nilbog.

innis226

I suspect this was intentionally added. Just to give some personality and to fuel hype

iterateoften

This is funny because it’s a silly topic, but I think it shows something extremely seriously wrong with llms. The goblins stand out because it’s obvious. Think of all the other crazy biases latent in every interaction that we don’t notice because it’s not as obvious. Absolutely terrifying that OpenAI is just tossing around that such subtle training biases were hard enough to contain it had to be added to system prompt.

albert_e

If a tiny misconfiguration of reward system can cause such noticeable annoyance ... What dangers lurk beneath the surface. This is not funny.

x0x7

I suspected OpenAI was actively training their models to be cringy in the thought that it's charming. Turns out it's true. And they only see a problem when it narrows down on one predicliction. But they should have seen it was bad long before that.

ComputerGuru

The explanation is very concerning. Lexical tidbits shouldn’t be learnt and reinforced across cross sections. Here, gremlin and goblin went from being selected for in the nerdy profile to being selected for in all profiles. The solution was easy: don’t mention goblins. But what about when the playful profile reinforces usage of emoji and their usage creeps up in all other profiles accordingly? Ban emoji everywhere? Now do the same thing for other words, concepts, approaches? It doesn’t scale! It seems like models can be permanently poisoned.

pants2

Nice, OpenAI mentioned my HackerNews post in their article :) I appreciate that they wrote a whole blog post to explain! https://news.ycombinator.com/item?id=47319285

Where the goblins came from

Discussion Highlights (20 comments)

Related Discussions