XOR'ing a register with itself is the idiom for zeroing it out. Why not sub?
ingve
207 points
204 comments
April 22, 2026
Related Discussions
Found 5 related stories in 59.9ms across 5,335 title embeddings via pgvector HNSW
- Rust zero-cost abstractions vs. SIMD Sirupsen · 14 pts · March 03, 2026 · 42% similar
- C Bit-Field Pitfalls fanf2 · 26 pts · March 21, 2026 · 41% similar
- Taming LLMs: Using Executable Oracles to Prevent Bad Code mad44 · 32 pts · March 26, 2026 · 39% similar
- Show HN: Zeroboot – sub-millisecond VM sandboxes using CoW memory forking adammiribyan · 19 pts · March 17, 2026 · 38% similar
- Show HN: Sub-millisecond VM sandboxes using CoW memory forking adammiribyan · 106 pts · March 17, 2026 · 38% similar
Discussion Highlights (20 comments)
nopurpose
It amazes me how entertaining Raymond's writing on most mundane aspects of computing often is.
NewCzech
The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits. Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster. Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
anematode
My favorite (admittedly not super useful) trick in this domain is that sbb eax, eax breaks the dependency on the previous value of eax (just like xor and sub ) and only depends on the carry flag. arm64 is less obtuse and just gives you csetm (special case of csinv ) for this purpose.
defrost
Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side. And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.
Sweepi
"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination." Will remember for the next time I write asm for Itanium!
tliltocatl
It might be because XOR is rarely (in terms of static count, dynamically it surely appears a lot in some hot loops) used for anything else, so it is easier to spot and identify as "special" if you are writing manual assembly.
rasz
Looking at some random 1989 Zenith 386SX bios written in assembly so purely programmer preferences: 8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax' 26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax' edit: checked a 2010 bios and not a single 'sub x, x'
empiricus
The hw implementation of xor is simpler than sub, so it should consume slightly less energy. Wondering how much energy was saved in the whole world by using xor instead of sub.
jhoechtl
Back in the stone ages XOR ing was just 1 byte of opcode. Habbits stick. In effect XORing is no longer faster since a long time.
drfuchs
Relatedly, there's a steganographic opportunity to hide info in machine code by using "XOR rax,rax" for a "zero" and "SUB rax,rax" for a "one" in your executable. Shouldn't be too hard to add a compiler feature to allow you to specify the string you want encoded into its output.
enduku
I ran into this rabbithole while writing an x86-64 asm rewriter. xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.
b1temy
Back when I was in university, one of the units touching Assembly[0] required students to use subtraction to zero out the register instead of using the move instruction (which also worked), as it used fewer cycles. I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero. [0] Not x86, I do not recall the exact architecture.
adrian_b
It should be noted that XOR is just (bitwise) subtraction modulo 2. There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8. To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation. Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.
Suzuran
On some of IBM's smaller processors, such as channel controllers and the CSP used in the midrange line prior to the System/38, the xor instruction had a special feature when used with identical source and destination - It would inhibit parity and/or ECC error checking on the read cycle, which meant that xor could be used to clear a register or memory location that had been stored with bad parity without taking a machine check or processor check.
zahlman
> but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”. Absolutely. But I can also imagine that it feels more like something that should be more efficient, because it's "a bit hack" rather than arithmetic. After all, it avoids all the "data dependencies" (carries, never mind the ALU is clocked to allow time for that regardless)! I imagine that a similar feeling is behind XOR swap. > Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side. Network effects are much older than social media, then....
dreamcompiler
I vaguely remember we used the XOR trick on processors other than Intel, so it may not be Intel-specific. In principle, sub requires 4 steps: 1. Move both operands to the ALU 2. Invert second operand (twos complement convert) 3. Add (which internally is just XOR plus carry propagate) 4. Move result to proper result register. This is absolutely not how modern processors do it in practice; there are many shortcuts, but at least with pure XOR you don't need twos complement conversion or carry propagation. Source: Wrote microcode at work a million years ago when designing a GPU.
RiverCrochet
XOR is a simple logic-gate operation. SUB would have to be an ALU operation. A one-bit adder (which is subtraction in reverse) makes signals pass through two gates. See https://en.wikipedia.org/wiki/Adder_(electronics) You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways. Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power.
matja
SUB has higher latency than XOR on some Intel CPUs: latency (L) and throughput (T) measurements from the InstLatx64 project ( https://github.com/InstLatx64/InstLatx64 ) : | GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns= 1.00c | T: 0.03ns= 0.135c | | GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns= 0.13c | T: 0.03ns= 0.133c | | GenuineIntel | GoldmontPlus | SUB r64, r64 | L: 0.67ns= 1.0 c | T: 0.22ns= 0.33 c | | GenuineIntel | GoldmontPlus | XOR r64, r64 | L: 0.22ns= 0.3 c | T: 0.22ns= 0.33 c | | GenuineIntel | Denverton | SUB r64, r64 | L: 0.50ns= 1.0 c | T: 0.17ns= 0.33 c | | GenuineIntel | Denverton | XOR r64, r64 | L: 0.17ns= 0.3 c | T: 0.17ns= 0.33 c | I couldn't find any AMD chips where the same is true.
butterisgood
I recall thinking about these things quite a bit when reading Michael Abrash back in the 90s. How much of that advice applies to anything these days is questionable. Back then we used to squeeze as much as possible from every clock cycle. And cache misses weren’t great but the “front side bus” vs CPU clock difference wasn’t so insane either. RAM is “far away” now. So the stuff you optimize for has changed a bit. Always measure!
NanoWar
XORing just feels more like xxxxing out the register. SUB feels like a calculation or mistaken use of a register.