We found an undocumented bug in the Apollo 11 guidance computer code

henrygarner 407 points 192 comments April 07, 2026

Discussion Highlights (20 comments)

josephg

Super interesting. I wish this article wasn’t written by an LLM though. It feels soulless and plastic.

yodon

This is so insightfully and powerfully written I had literal chills running down my spine by the end. What a horrible world we live in where the author of great writing like this has to sit and be accused of "being AI slop" simply because they use grammar and rhetoric well.

jwpapi

Has someone verified this was an actual bug? One of AI’s strengths is definitely exploration, f.e. in finding bugs, but it still has a high false positive rate. Depending on context that matters or it wont. Also one has to be aware that there are a lot of bugs that AI won’t find but humans would I don’t have the expertise to verify this bug actually happened, but I’m curious.

wg0

Someone please amend the title and add "using claude code" because that's customary nowadays.

riverforest

Software that ran on 4KB of memory and got humans to the moon still has undiscovered bugs in it. That says something about the complexity hiding in even the smallest codebases.

MeteorMarc

Are there any consequences for the Artemis 2 mission (ironic)?

ChicagoBoy11

For anyone who liked this, I highly suggest you take a look at the CuriousMarc youtube channel, where he chronicles lots of efforts to preserve and understand several parts of the Apollo AGC, with a team of really technically competent and passionate collaborators. One of the more interesting things they have been working on, is a potential re-interpretation of the infamous 1202 alarm. It is, as of current writing, popularly described as something related to nonsensical readings of a sensor which could (and were) safely ignored in the actual moon landing. However, if I remember correctly, some of their investigation revealed that actually there were many conditions which would cause that error to have been extremely critical and would've likely doomed the astronauts. It is super fascinating.

chrisjj

> The specs were derived from the code itself Oh dear. I strongly suggest this author look specification up in a dictionary.

buredoranna

Still my all time favorite snippet of code. TC BANKCALL # TEMPORARY, I HOPE HOPE HOPE CADR STOPRATE # TEMPORARY, I HOPE HOPE HOPE TC DOWNFLAG # PERMIT X-AXIS OVERRIDE https://github.com/chrislgarry/Apollo-11/blob/master/Luminar...

iJohnDoe

Fascinating read. Well done. Everyone involved in the Apollo program was amazing and had many unsung heroes.

esafak

An application of their specification language, https://juxt.github.io/allium/ It seems the difference between this and conventional specification languages is that Allium's specs are in natural language, and enforcement is by LLM. This places it in a middle ground between unstructured plan files, and formal specification languages. I can see this as a low friction way to improve code quality.

kmeisthax

> Rust’s ownership system makes lock leaks a compile-time error. Rust specifically does not forbid deadlocks, including deadlocks caused by resource leaks. There are many ways in safe Rust to deliberately leak memory - either by creating reference count cycles, or the explicit .leak() methods on various memory-allocating structures in std. It's also not entirely useless to do this - if you want an &'static from heap memory, Box.leak() does exactly that. Now, that being said, actually writing code to hold a LockGuard forever is difficult, but that's mainly because the Rust type system is incomplete in ways that primarily inconvenience programmers but don't compromise the safety or meaning of programs. The borrow checker runs separately from type checking, so there's no way to represent a type that both owns and holds a lock at the same time. Only stacks and async types, both generated by compiler magic, can own a LockGuard. You would have to spawn a thread and have it hold the lock and loop indefinitely[0]. [0] Panicking in the thread does not deadlock the lock. Rust's std locks are designed to mark themselves as poisoned if a LockGuard is unwound by a panic, and any attempt to lock them will yield an error instead of deadlocking. You can, of course, clear the poison condition in safe Rust if you are willing to recover from potentially inconsistent data half-written by a panicked thread. Most people just unwrap the lock error, though.

totalmarkdown

is this bug the reason why the toilet malfunctioned?

croemer

I've had a look at the (vibe coded) repro linked in the article to see if it holds up: https://github.com/juxt/agc-lgyro-lock-leak-bug/blob/c378438... The repro runs on my computer, that's positive. However, Phase 5 (deadlock demonstration) is entirely faked. The script just prints what it _thinks_ would happen. It doesn't actually use the emulator to prove that its thinking is right. Classic Claude being lazy (and the vibe coder not verifying). I've vibe coded a fix so that the demonstration is actually done properly on the emulator. And also added verification that the 2 line patch actually fixes the bug: https://github.com/juxt/agc-lgyro-lock-leak-bug/pull/1

djmips

I think it's interesting that they found what seems to be a real bug (should be independantly verified by experts). However I find their story mode, dramatization of how it could have happened to be poorly researched and fully in the realm of fiction. An elbow bumping a switch, the command module astronaut unable to handle the issue with only a faux nod to the fact that a reset would have cleared up the problem and it was part of their training. So it's really just building tension and storytelling to make the whole post more edgy. And yes, this is 100% AI written prose which makes it even more distasteful to me.

parliament32

Both the article and repo[1] are slop. [1] In the repo, the "reproduce" is just a bunch of print statements about what would happen, the bug isn't actually triggered: https://github.com/juxt/agc-lgyro-lock-leak-bug/blob/c378438...

bsoles

Another CTO "published" an AI slop to get attention to their vibe-coded company that will disappear in two years. Tell me something new...

callamdelaney

More likely the llm misinterpreted something and hallucinated an error. Just yesterday Claude code hallucinated itself an infinite loop.

garaetjjte

This article is garbage. >The Apollo Guidance Computer (AGC) is one of the most scrutinised codebases in history. What? AGC programs were developed by relatively small team and pretty much left alone since then. Architecture is rather quirky when viewed with modern sensibilities. There's not much people that are familiar with it. Compare it to widely used software like libcurl or sqlite. Or perhaps to Super Mario Bros, which was extensively analyzed for competitive speedruns reasons. Surely that dwarfs amount of knowledge about Apollo code. >2K of erasable RAM and a 1MHz clock. The AGC’s programs were stored in 74KB of core rope How about picking a unit and staying with it? AGC has 2K words of RAM, where each word has 15 bits of usable data (physically it's 16 bits, but one bit is used for parity). Maximum amount of ROM that could be installed is 36K words. (but they switch to KB, which is not only inconsistent with previous sentence but the number is also wrong! It's 72 KiB, 73.728 KB or 67.5 KiB, 69.12 KB depending whether you include parity or not) (maximum of 64K ROM words could be addressed by architecture design, but isn't available in any real hardware) And yes, there is 1.024 MHz clock in the system, which is revelant for peripherals, but you probably want to know how fast it executed instructions. One memory cycle takes 11.71875 μs (85 1/3 kHz), and most instructions take 2 such cycles (one for operation, second for fetching next instruction) (each memory cycle is long enough for read from ROM, or read and write to RAM. ROM speed was the limiting factor, by standard of core memories it wasn't particularly fast. AGS backup computer used core for both RAM and ROM and had memory cycle time of 5 μs) (in case you are confused, "core memory" and "core rope memory" refers to quite different things!). If you think I'm nitpicking, try writing an emulator and wondering why you have to sift through all that slop. You could give the correct numbers, you know? >“My secret terror for the last six months has been leaving them on the Moon and returning to Earth alone”, Collins later wrote of the rendezvous. A dead gyro system behind the Moon, with Armstrong and Aldrin on the surface waiting for a rendezvous burn that depends on a platform he can no longer align, is exactly that scenario. A hard reset would have cleared it. But the 1202 alarms during the lunar descent had been stressful enough with Mission Control on the line and Steve Bales making a snap abort-or-continue call. Behind the Moon, alone, with a computer that was accepting commands and doing nothing, Collins would have had to make that call by himself. You know what an orbit is? That it goes around? That you could just wait for a while and speak with Mission Control? What even is this scenario? That your guidance system failed, and you for some inexplicable reason are considiering immediately leaving back for Earth right now leaving your pals behind? (with a manual burn, I guess, since guidance is dead?) You just wait for contact with Houston and tell them what happened. They pore over the program listings and find the bug. They radio you back appropiate VERB and NOUN commands for poking right values into memory. The End. And besides, spacecraft can be tracked and orbit determined from Earth, so even if the PGNCS did fail completely LM would just get necessary orbit information from Mission Control. (also in case guidance fails in either LM or CM, either one can have active role during rendezvous. And LM have extra backup system, the previously mentioned AGS) The whole thing of "we found a minor deadlock bug in AGC program, what a shock!" is bizzare. It's not a small program. If you have any experience with software, of course you know it has bugs! They iterated on the software, releasing new software for most missions, adding new features, and, fixing bugs they found. What a concept!

thewonderidiot

Mike Stewart here! I led the restoration of the AGC documented on CuriousMarc's channel and co-administrate VirtualAGC. There is a lot to unpack here. First: this is indeed a real bug in the AGC software. However, it did not go unnoticed for the whole program. It was discovered during level 3 testing of SATANCHE, and late development branch of the Command Module software COMANCHE. It was assigned anomaly number L-1D-02, and was fixed between Apollo 14 and 15. There are two known surviving copies of the L-1D-02 anomaly report: * https://www.ibiblio.org/apollo/Documents/contents_of_luminar... * https://www.ibiblio.org/apollo/Documents/contents_of_luminar... The fix described in the article is partially complete, but as noted in the anomaly report there's a little bit more to it. Rather than just adding the two instructions to zero LGYRO, they restructured the code a bit and also cause it to wake up pending jobs. You can compare the relevant sections of the Apollo 14 and Apollo 15 LM software here: * Apollo 14: https://github.com/virtualagc/virtualagc/blob/master/Luminar... * Apollo 15: https://github.com/virtualagc/virtualagc/blob/master/Luminar... The bug would not manifest silently in the way described in the article. For starters, LGYRO is also zeroed in STARTSB2, which is executed via GOPROG2 on any major program change: https://github.com/virtualagc/virtualagc/blob/master/Luminar... This means that changing from any program to any other program would immediately resolve the issue. This is almost certainly a large part of why it took them so long to notice. Hitting BADEND while actively pulse-torquing is quite rare, and avoided by normal procedure. The scenario presented in the article can't happen since the act of starting P52 will zero LGYRO. Moreover, in the very specific scenarios in which the bug can be triggered and remain, it results in multiple jobs stacking up attempting to torque the gyros. Eventually the computer runs out of space for new jobs -- similar to what happened on 11 -- and a 31202 (the Apollo 12+ equivalent of 1202) is triggered. Since the issue was found before the flight of Apollo 14, a further description of how it might occur and what the recovery procedure should be was added to the Apollo 14 Program Notes: https://www.ibiblio.org/apollo/Documents/LUM159_text.pdf#pag... Some other notes: > Ken Shirriff has analysed it down to individual gates I've done the bulk of the gate-level analysis. :) > the Virtual AGC project runs the software in emulation, having confirmed the recovered source byte-for-byte against the original core rope dumps. We've only been able to do that in very specific circumstances and only for subsections of assorted programs, but never for a full program. Most AGC software either comes from a program listing, from a core rope dump, or from reconstruction using changelogs and known memory bank checksums. We've disassembled all of the rope dumps into source files that assemble back into the same binary, but the comments and labels will be different from what was in the original listing. And to be extra clear: I've never had the opportunity to dump a module containing Apollo 11 software for either vehicle. Our sole source for both programs is a pair of printouts in the MIT Museum's collection. > Margaret Hamilton (as “rope mother” for LUMINARY) approved the final flight programs before they were woven into core rope memory. Jim Kernan was the rope mother for Luminary at least up through Apollo 11. Margaret was the rope mother for Comanche, the CM software, and was later promoted to lead the software division. Their positions at the time of 11 can be seen on this org chart: https://www.ibiblio.org/apollo/Documents/ApolloOrg-1969-02.p... > Their priority scheduling saved the Apollo 11 landing when the 1202 alarms fired during descent, shedding low-priority tasks under load exactly as designed. This is a huge topic on its own, but the AGC software was not designed to shed low-priority jobs. Ironically, the lowest priority job during the landing was the landing guidance itself, with high-priority jobs being reserved for things that needed quick response like antenna movements or display updates. If the computer were to shed the lowest-priority jobs, it would shed the landing guidance. This memo contains a list of all jobs active during the landing and their priorities: https://www.ibiblio.org/apollo/Documents/CherryApollo11Exege... > For example, the ICD for the rendezvous radar specified that two 800 Hz power supplies would be frequency-locked but said nothing about phase synchronisation. The resulting phase drift made the antenna appear to dither, generating roughly 6,400 spurious interrupts per second per angle and consuming roughly 13% of the computer’s capacity during Apollo 11’s descent. This was the underlying cause of the 1202 alarms. The frequency-lock prevents phase drift, so the phase is essentially fixed once the power supplies are up. Ironically, however, the bigger issue is that one reference was 28V while the other was 15V. Initial testing on actual Apollo hardware suggests that at least for Apollo 11, this voltage difference was the key contributor rather than the phase difference: https://www.youtube.com/watch?v=dT33c70EIYk

We found an undocumented bug in the Apollo 11 guidance computer code

Discussion Highlights (20 comments)

Related Discussions