The Alder Lake SHLX Anomaly

(tavianator.com)

230 points | by panic2 days ago

14 comments

tavianator2 days ago
One fun thing about this is that spilling+restoring the register will fix it, so if any kind of context switch happens (thread switch, page fault, interrupt, etc.), the register will get pushed to the stack and popped back from it, and the code suddenly gets 3x faster. Makes it a bit tricky to reproduce reliably, and led me down a few dead ends as I was writing this up.
- dataflow1 day ago
  Wow. Any chance you could share either an explanation or your benchmark code to show how you measured this reliably? I've never had to constrain benchmarks to within a time slice so I'm kind of fascinated how you got stable measurements out of that. Did you force a context switch prior to starting and then loop for a fixed number of times that wouldn't trigger another context switch? Or did you disable preemption on your thread, or something else?
  tavianator1 day ago
  Benchmark code is in the post! The trick is to limit the number of SHLX instructions before you re-initialize RCX. Doing it every 10,000 keeps it fresh enough. I didn't have to disable preemption or anything, perf shows 0 context switches and only ~50 page faults (probably all at startup).
  valleyer1 day ago
  Yeah, even an interrupt (that doesn't result in a thread switch) would cause a spill to the stack, right? That seems hard to control.
  toast01 day ago
  You should be able to prevent or at least strictly limit interrupts if you run a modern kernel, have a single threaded test process and cpu pin your test process to core X, while you've pinned everything else to not core X (and not core X's cpu threads if enabled), and you have no hardware interrupts routed to that core, either.
  You might have to look into tickless kernels too, but I think that's pretty mainstream now. Tldr, if there's no contention for the core and there's no hardware interrupts, there's no need for a timer/scheduler tick on that core.
  dataflow1 day ago
  Yeah. I don't know how often interrupts happen to know how easy this would be to control for, but it certainly sounds like it could be tough. I suppose one way to mitigate these though is e.g. taking a min() of the execution times rather than the average?
  Bulat_Ziganshin1 day ago
  except that we need max() here :)
  dataflow1 day ago
  Huh, why max()? Generally microbenchmarks want min() (or close to it).
  Bulat_Ziganshin1 day ago
  because THIS code becomes faster once RCX is saved and restored during the interrupt call
  dataflow1 day ago
  Oh right, my bad, thanks. Totally escaped me when I wrote that :-)
- amluto1 day ago
  Wait a sec, isn’t this backwards? The OS will restore thread state by loading all of RCX, which sounds like it should trigger the slow path.
  I wonder whether IRET fixes it. Can you try MOV RCX, 1; push an IRET frame; IRET; SHLX to see if IRET fixes it?
  tavianator1 day ago
  > Wait a sec, isn’t this backwards?
  No, if you push and pop RCX it's fast again. I mentioned this in the blog post, `mov rcx, 1` is slow but `mov rcx, 1; push rcx; pop rcx` is fast.
  amluto1 day ago
  I think I missed that one in the pile of options. That’s bizarre. I wonder if MOV from memory is like POP.
  1 day ago
  undefined
  BeeOnRope1 day ago
  The slow state is when rcx has a special "hidden immediate" attached it, from the `mov rcx, 1` or `add rcx, 1`. If the last option on the register doesn't fit that narrow template, it is fast, so loading from memory in any way, like pop, xrestor, plan load, etc, will be in fast mode.
- eigenform2 days ago
  I wonder if this is a thing where the machine is trying to predict the actual value of the 'count' operand ...
  tavianator2 days ago
  If it were a prediction/speculation thing, I would expect it to settle down long before 10,000 `SHLX`s are retired.
  2 days ago
  undefined
userbinator2 days ago
How about trying "xor ecx, ecx; inc ecx"? Or the even shorter "mov cl, 1"?
It is very strange to me that the instruction used to set the shift count register can make the SHLX instruction 3× slower.
I suspect this is a width restriction in the bypass/forwarding network.
The 32-bit vs. 64-bit operand size distinction is especially surprising to me as SHLX only looks at the bottom 6 bits of the shift count.
Unfortunately the dependency analysis circuitry seems not Intel-ligent enough to make that distinction.
- tavianator2 days ago
  > How about trying "xor ecx, ecx; inc ecx"?
  Fast. But inc rcx is slow.
  > Or the even shorter "mov cl, 1"?
  Fast.
  > I suspect this is a width restriction in the bypass/forwarding network...
  I think we just found the explanation on Twitter: https://x.com/corsix/status/1874965887108976858
  Alder Lake adds support for mov/add/sub with small immediates to the register renamer. So `add rcx, 1` gets handled by the renamer, potentially with zero latency.
  Unfortunately shifts are slow when the shift count is renamed in this way. This leads to fun things like `mov rcx, 1024` being fast while `mov rcx, 1023` is slow. I'll update the blog post in a bit.
  phire2 days ago
  Ok... That does make some sense.
  Essentially, Intel have allocated a bunch of fake renaming registers to hold all small immediates. Like how MIPS had a zero register, Intel now have over 1000 constant registers. These registers don't really exist, the immediate is just reconstituted at the execution unit.
  And I'm guessing these small immediates predate Golden Cove; It's probably how they have always represented shift by constant and any instructions with 8bit immediate (presumably at least as far back as Sandybridge). The only change is that now the renamer can now execute simple movs/adds.
  So essentially these variable shifts are being converted to constant shifts on the fly. The problem is that there is no constant shift version of shlx, so it wasn't in the test suite. There is probably some kind of stupid implementation bug in the scheduler that simply wasn't exposed until the renamer changes.
  tavianator2 days ago
  I don't think there are a thousand constant registers. I think the renamer just represents (reg, 10-bit offset) pairs rather than just registers.
  Also the problem affects SHL as well as SHLX, I didn't realize until just now.
  phire2 days ago
  The speculation of a "reg + 10-bit offset" representation feels wrong to me.
  That requires a whole bunch of extra 64-bit full-adders everywhere one of these pairs might be consumed (so realistically, on every single read port of the register file). 64-bit adders take quite a bit of latency, so you don't want extra adders on all your critical paths.
  In the case where it appears to be holding a reg + offset pair, what I think has actually happened is that renamer (and/or uop fusion) has rewritten the uop to a 3-input add, with the offset as the third input.
  > Also the problem affects SHL as well as SHLX, I didn't realize until just now.
  And presumably SHR/SHRX/SAR/SARX too?
  dzaima1 day ago
  You don't quite need a full 64-bit adder to materialize the proper value, as one argument is only a 10-bit int. So a 10-bit adder, a 54-bit increment, and muxing it with the original depending on the add carry.
  And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
  tavianator1 day ago
  Right. Actually it turns out it's 11 bits, since [-1024, 1023] are all supported by the immediate add renamer.
  In general I think people are overstating the delay of an additional 64-bit add on register file reads (though I'm not a hardware guy so maybe someone can correct me). There are carry-lookahead adders with log_2(n) == 6 gate delays. Carry-save adders might also be relevant to how they can do multiple dependent adds with 0c latency.
  > And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
  No, the perf counters show 1 uop dispatched/retired in both the slow and fast cases.
  dzaima1 day ago
  Ah, good to know on the uop count. Still could be (or, well, has to be to some extent) the same concept, just pipelined within one uop.
  eigenform2 days ago
  I dunno, you could imagine it happens [speculatively?] in parallel, at the cost of a read port and adder for each op that can be renamed in a single cycle:
  1. Start PRF reads at dispatch/rename
  2. In the next cycle, you have the result and compute the 64-bit result. At the same time, the scheduler is sending other operands [not subject to the optimization] to read ports.
  3. In the next cycle, the results from both sets of operands are available
  phire2 days ago
  Seems rather pointless to implement it like that. You would save space in the scheduler because the uops have been fused, but execution time and execution unit usage is the same as just executing the original ops before this optimisation was implemented.
  It also adds an extra cycle of scheduling latency, which may or may not be an issue (I really have no idea how far ahead these schedulers can schedule).
  If you think about the "1024 constant registers" approach: It allows you to have a small 10bit adder in the renamer which handles long chains of mov/add/sub/inc/dec ops, as long as the chain stays in the range of 1024. This frees the main adders for bigger sums, or maybe you can power gate them.
  And in the case where one of the inputs is larger than 1024, or it's value is unknown during renaming, the renamer can still merge two two-input adds into a single three input add uop.
  Bulat_Ziganshin1 day ago
  add3 operation will have 3 inputs, though. do we have other integer operations with 3 64-bit inputs?
  1 day ago
  undefined
  BeeOnRope2 days ago
  Yeah I am pretty sure that the renamer adds the tracked immediate(s) to the emitted uop, but that's not inconsistent with tracking "reg offset pairs" is it?
  userbinator2 days ago
  So `add rcx, 1` gets handled by the renamer, potentially with zero latency.
  Unfortunately shifts are slow when the shift count is renamed in this way.
  That's because the value still has to make it to the shifter... and the renamer's adder output clearly takes a different and higher-latency path than the one which doesn't go through it, confirming my original suspicion that this is a limitation of the bypass/forwarding network.
  Ultimately it seems they are just playing around with where things happen; the addition can be done here but the value ultimately needs to go there for the shift, or the addition and shift can both happen there. One could envision a future change that puts the shifter in the renamer too. They're really constrained by the laws of physics.
  Bulat_Ziganshin1 day ago
  we set CX only once and then use it 10000 times. the problem is not the slow calculation of CX per se, but the slow shift once we got CX from the renamer
  eigenform2 days ago
  Oooh, I forgot about this. Only thing I can think of is something like:
  1. Assume that some values can be treated internally as "physical register plus a small immediate"
  2. Assume that, in some cases, a physical register is known to be zero and the value can be represented as "zero plus a small immediate" (without reference to a physical register?)
  3. Since 'count' is always expected to be <64, you technically don't need to use a physical register for it (since the immediate bits can always just be carried around with the name).
  4. The 3-cycle case occurs when the physical register read cannot be optimized away??
  (Or maybe it's just that the whole shift operation can occur at rename in the fast case??)
  phire2 days ago
  > 4. The 3-cycle case occurs when the physical register read cannot be optimized away??
  No... the 3-cycle case seems to be when the physical register read is optimized away. I think it's some kind of stupid implementation bug.
  eigenform2 days ago
  Like, the scheduler is just waiting unnecessarily for both operands, despite the fact that 'count' has already been resolved at rename?
  phire2 days ago
  The 3 cycles latency casts massive suspicion on the bypass network. But I don't see how the bypass network could be bugged without causing the incorrect result. So the scheduler doesn't know how to bypass this "shift with small immediate" micro op.
  Or maybe the bypass network is bugged, and what we are seeing is a chicken bit set by the microcode that disables the bypass network for this one micro op that does shift.
  eigenform2 days ago
  > But I don't see how the bypass network could be bugged without causing the incorrect result.
  Maybe if they really rely on this kind of forwarding in many cases, it's not unreasonable to expect that latency can be generated by having to recover from "incorrect PRF read" (like I imagine there's also a case for recovery from "incorrect forwarding")
  phire2 days ago
  Yeah, "incorrect PRF read" is something that might exist.
  I know modern CPUs will sometimes schedule uops that consume the result of load instruction, with the assumption the load will hit L1 cache. If the load actually missed L1, it's not going to find out until that uop tries to read the value coming in from L1 over the bypass network. So that uop needs to be aborted and rescheduled later. And I assume this is generic enough to catch any "incorrect forwarding", because there are other variable length instructions (like division) that would benefit from this optimistic scheduling.
  But my gut is to only have these checks on the bypass network, and only ever schedule PRF reads after you know the correct value has been stored.
  Bulat_Ziganshin1 day ago
  maybe, the bypass network doesn't include these "constant registers"? a bit like zen5 where some 1-cycle SIMD ops are executed in 2 cycles, probably for shortcomings of the same network
  tptacek2 days ago
  I got nothing here other than: this is very cool.
- monocasa2 days ago
  I would have thought that the movs would have been dumped right into a rob entry without even going through the bypass network, much like a zeroing idiom does.
crest1 day ago
So the issue is that the frontend which "eliminates" those operations doesn't make the resulting values available to the backend if fits Intel's ad-hoc definition of a short immediate (which shift values do) and since there is no SHLX with immediate uop the small value has to be be expanded to a full 32 or 64 bits and requested from the frontend with adds two cycles of latency?
Has anyone run tests with ≥3 interleaved SHLX dependency chains in a loop? Does it "just" have 3 cycle latency or also less than 1 operation/cycle sustained throughput? Because if the pipeline stalls that would be even more annoying for existing optimised code.
Is the regression limited to only Alder Lake P-cores or also present it later (refreshed) cores?
BeeOnRope2 days ago
Worth noting that whether intentional or not, this would be easy to miss and unlikely to move benchmark numbers since compilers won't generate instructions like this: they would use the eax form which is 1 byte shorter and functionally equivalent.
Even some assemblers will optimize this for you.
- tavianator2 days ago
  It's possible to get gcc to generate sequences that would trigger this: https://godbolt.org/z/rYKqPxn7b
  loeg2 days ago
  The godbolt paste I'm looking at shows GCC generating a SAL with 8 bit shift (CL), not a SHLX with 64-bit shift (RCX).
  tjalfi2 days ago
  GCC will generate the shlx instruction if you add the -mbmi2 flag to your build options. For example, you can see this in action here: https://godbolt.org/z/asb1fxos5.
  BeeOnRope2 days ago
  It's true, I was considering only the original mov rax, 1 case, which I'm pretty sure compilers don't generate: it's just a useless encoding of mov eax, 1.
  Given that 64-bit immediate math also causes this though, compilers might generate it (I think this is a minor missed optimization by gcc though: it could have used add eax, 1 instead.
  1 day ago
  undefined
  tavianator2 days ago
  SHL suffers from the same latency issue as SHLX it turns out
  loeg2 days ago
  SAL as well? (And anyway this example seems to be shifting by the low 8 bits of RCX (CL), not the full width register.)
  tavianator2 days ago
  SAL and SHL are synonyms, they have the same encoding. SHL only accepts CL as the count register, there's no other form that takes a variable shift count
  loeg2 days ago
  Ahh, thanks, I didn't make that connection.
  Bulat_Ziganshin1 day ago
  Intel shifts anyway mask out higher bits of CL, this hurts sometimes e.g. when you need to shift by 1..64 bits
- eigenform2 days ago
  Funny thing to double-check: are these encodings correctly specifying a 64-bit operand? Maybe everyone's compilers are subtly wrong D:
  edit: It looks like VEX.W is set in the encoding from the uops.info tests ¯\_(ツ)_/¯
bhouston2 days ago
Shifts were not always fast. These old hacker news comments contain the details: https://news.ycombinator.com/item?id=2962770
- petermcneeley2 days ago
  Indeed. Dynamic shifting was microcoded (not uop!) on the power pc for gen3. However shifting with immediate values was not.
  This leads to all sorts of strange performance workarounds. Mike Acton refers to it here: https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKD...
aftbit2 days ago
Woah that's weird. Left shifting either takes 3 cycles or 1 cycle, depending on how you initialize the cycle count register?
This patch from the article makes it take 1 cycle instead of 3:
- MOV RCX, 1 + MOV ECX, 1
>It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate).
Practically speaking, is this sort of µop-dependent optimization implemented by compilers? How do they do so?
- tavianator2 days ago
  > Practically speaking, is this sort of µop-dependent optimization implemented by compilers?
  I suspect this specific thing is not; I don't think many people were aware of this anomaly until this week. I doubt the compilers are modeling it.
  But in general, compilers will have a machine model that predicts the cost of different instructions. The backend will try to generate instruction sequences with low cost, so if you include this anomaly in the model it will probably work around it automatically.
  This old example might interest you: a false dependency on the destination register for `popcnt` on some uarches led to compiler patches: https://stackoverflow.com/q/25078285/502399
- RossBencina2 days ago
  user mshockwave posted this the other day in a comment, might be of interest/tangential relevance:
  Scheduling Model in LLVM - Part I https://myhsu.xyz/llvm-sched-model-1
  loeg2 days ago
  https://news.ycombinator.com/item?id=42555110
juancn1 day ago
Code alignment? I mean, the different instructions change the alignment of the rest of the code.
- 64 bit register
0: 48 c7 c1 01 00 00 00 mov rcx,0x1 - 32 bit register 0: b9 01 00 00 00 mov ecx,0x1
It should be easy to test by adding a couple of NOPs to the fast version:
0: b9 01 00 00 00 mov ecx,0x1 5: 90 nop 6: 90 nop
and see if it regresses again.
I don't have an Alder Lake to test on.
- tavianator1 day ago
  In general I'd be very surprised if code alignment had much effect on a 10,000 instruction block, especially since the instruction is 5 bytes long so it will quickly unalign itself
- BeeOnRope1 day ago
  https://news.ycombinator.com/item?id=42582623
tavianator14 hours ago
Wrote up the presumed explanation here: https://tavianator.com/2025/shlxplained.html
1 day ago
undefined
orlp1 day ago
Seems like LLVM knows about this quirk (note how it suddenly uses eax instead of rax for the multiply): https://rust.godbolt.org/z/8jh7YPhz4.
- stassats1 day ago
  Using EAX is one byte shorter, so it might be doing that inadvertently.
rincebrain1 day ago
Since this seems like an optimization going awry somewhere, I wonder if there's a chicken bit that disables it, and if so, how broad the impact of disabling it is...
eqvinox1 day ago
Hmm… does this have any impact on time-constant cryptographic algorithm implementations? In particular the wider "addition in register renamer" story?
- LegionMammal9781 day ago
  I don't think it should, if the performance change only depends on immediates hardcoded into the binary, and not the variable data being processed. Maybe if the algorithm branched on the variable data and had two separate ways of loading something, but branching on variable data is the #1 thing you're supposed to avoid in the first place.
  atq21191 day ago
  To expand on this a bit, the hypothesis stated elsewhere is that the register renamer stores "physical register + 11-bit signed immediate".
  It should be possible to construct scenarios where a logical register is mapped to "physical register + immediate" where the immediate depends on the path taken through the program so far. The easiest example would be an if-statement that optionally adds 1023 to a register, and then an add of 1 after the if-statement. The add of 1 overflows the immediate stored in the register renamer if and only if the if-condition was true.
  As you said, that requires taking different control flow paths through the program, which is something to be avoided by constant-time algorithms anyway.
  One potential twist: This shows that it may be important that the interface between the constant-time algorithm and the rest of the program goes entirely through memory and not through registers. If the interface does go through registers, it's possible that variable register-renamer-immediates leak from the outside world into the constant-time algorithm. (This admittedly feels like a bit of a stretch, but who knows.)
  Bulat_Ziganshin1 day ago
  just to make you even more paranoid - zen2/4 (and probably e-cores and zen5) can rename memory operands too (you start to do strange things when Intel limits you to 16 registers)!
  so today it needs to go through SSD or maybe the network LOL. but for real conspiracy, we should use paper and pencil
- dzaima1 day ago
  Perhaps not - as it happens at the rename stage and thus only with immediate values, it's all fixed behavior for a given instruction stream; with branching you can of course get dynamically different renamed immediates, but, as that's branchy code, it's not time-constant anyway.
BeeOnRope2 days ago
To rule out alignment you should adding padding to one so the two variations have the same alignment in their long run of SHLX (I don't actually think it's alignment related though).
- tavianator2 days ago
  I did try aligning the loop, it didn't matter
BeeOnRope2 days ago
Does this also occur with other 3-argument instructions like ANDN?
- tavianator2 days ago
  ANDN is fast. The issue looks specific to shift counts. SHL is also slow.
  Bulat_Ziganshin1 day ago
  what about 8-bit ANDN? SHL essentially uses CL, so it may be because 8-bit subregister of "constant register" isn't present on the bypass network