TL;DR: Some of the P-cores in Alder Lake CPU can exhibit highly unstable performance behavior, resulting in large noise for any benchmark running on it.
UPDATE: A colleague of mine reported that the behavior can be observed on his i9-9980HK as well, and observed ~25% end-performance fluctuations on short-running benchmarks. So it seems like this behavior as been around for quite a while – dating back to at least the 9th-gen Intel CPU.
As a performance engineer, it’s routine to evaluate the performance before and after a code commit. This is why I’ve been faintly feeling that something is unusual about my new Intel Alder Lake i7-12700H laptop CPU.
Today I dug into the problem. As I discovered, this CPU indeed exhibits some highly unusual and surprising performance characteristics, which can easily cause pitfalls for benchmarks.
For background, Alder Lake features a hybrid architecture of the powerful P-cores and the weaker E-cores. i7-12700H has 6 P-cores and 8 E-cores. Of course, we want to have the P-cores run our time-sensitive tasks, such as our benchmarks. This can be done easily by
taskset the process to only P-cores.
This is where the story begins. I noticed two problems with the P-cores:
- Sometimes it cannot turbo-boost to 4.7GHz, the Intel-specified max turbo boost frequency (for the one-active-core case) for i7-12700H.
- Sometimes it cannot stay at the highest CPU frequency it can boost to.
Point 1 implies that we cannot enjoy the full performance promoted by Intel. Point 2 implies that the core cannot deliver consistent performance, which is problematic for performance engineering, as the noise would make two benchmark runs less comparable.
To expose the problem, I wrote a dumb program that increments a variable in a dead loop, so that the frequency of the CPU running the program is maxed out. Then I use
taskset to pin the program to one CPU, have it run for 60 seconds, and run
cpufreq every second to record the frequency of that CPU in the duration.
I took the following precautions to ensure nothing outside the CPU chip is limiting the CPU from boosting to its max frequency:
isolcpusLinux kernel boot parameter to exclusively dedicate the tested CPU core to our test program. This removes any noise caused by the OS.
- Confirm the CPU is not throttled by power limit: with only one active core (running our test program), the CPU package power consumption is less than
25W, far less than the base
45WTDP of i7-12700H.
- Confirm the CPU is not temperature-throttled (by monitoring sensors). To be paranoid, I also set a
20sgap between each test so the temperature goes back to idle state.
- Confirm the machine is in idle state, and stop unnecessary background services.
- The CPU frequency governer is set to
performance, and I confirmed that the governer is not limiting the turbo boost frequency.
- Everything is at stock setting: nothing is overclocked or undervolted, etc.
- All tests are repeated 3 times, and consistent behavior is observed for every core.
Not All P-cores Are Born Equal
The test confirmed my hypothesis that the 6 P-cores in my i7-12700H do not have a uniform quality. Specifically, my 6 P-cores exhibit three different performance characteriscs!
I dubbed them as “gold core”, “B-grade core”, and “wild core”:
- Gold core: the core can boost to and stay at 4.7GHz, just as Intel claimed.
- B-grade core: the core can boost to and stay at a frequency lower than 4.7GHz.
- Wild core: the core cannot boost to 4.7GHz, and cannot stay at any stable frequency: it will fluctuate wildly between a range of frequencies, and the degree of turbulence also varies per core.
We will explain their performance characteristics below.
The “Wild Cores”
Let’s start with the most bizarre cores: the wild ones. As it turns out, 3 out of my 6 P-cores are wild (a whopping 50%!), and among those three cores, one of them is particularly wild, as shown in the plot below:
As you can see, the CPU frequency turbulents violently from 4.05GHz to 4.55GHz, and each run exhibits a completely different pattern. Clearly, if any benchmark were run on this core, such a large noise would be a headache to deal with.
The other two wild cores I got were less turbulent. Even though, the noise introduced by the frequency instability still make them not ideal for benchmark comparison:
The “B-grade Cores”
The B-grade cores (as I dubbed) are better: while they cannot boost to 4.7GHz as promoted by Intel, at least they can operate at a consistent frequency, so benchmark results are comparable as long as they are run on the same core.
It turns out that my i7-12700H has two B-grade cores, both capable of operating at 4.5GHz:
As one can see, the core for the second graph has slightly higher frequency variations. Nevertheless, they are much stabler than the three wild cores.
The “Gold Core”
Only 1 out of the 6 P-cores of my i7-12700H matches Intel’s marketing:
As one can see, it operates stably at about 4.68GHz, just as Intel claimed.
The Behavior of the E-cores
Unlike the P-cores, it turns out that the E-cores have extremely stable behavior. All the eight E-cores can boost to and stay at 3.5GHz, just as the Intel specification said. There is not even a single outlier point: as you can see in the figure, it’s a completely straight line.
Given Intel’s tight testing and binning quality-control process, it seems very unlikely that I’m seeing all of these only because I got a defective. So I conjecture the “wild core” behavior can likely be observed on many i7-12700H CPUs.
Additionally, since i7-12700H is just the same i9-12900 chip with two below-quality P-cores disabled, it is also interesting to know if the behavior shows up on higher-end Alder Lake models, like the i9-12900K, as those presumably come from the better silicons, but I don’t have the ability to validate it.
Nevertheless, from a practicalist’s perspective, the action to take is clear: run the benchmark to identify the best cores and the performance-unstable cores on your chip, avoid running benchmarks on the performance-unstable cores, and use the best cores for the most latency-sensitive application.
For example, for my particular chip, physical core 2 (logical core 4-5) turns out to be the only “gold core”, so
taskset -c 4 for single-threaded benchmark is a good idea. Similarly, for latency-sensitive multi-threaded application (like the
QtCreator IDE, where UX is heavily affected by auto-completion latency), it is reasonable to modify the startup command in the desktop link to pin it to the good cores (logical core
0,1,4,5,8,9 in my particular chip).
I’m not expert at all, but my conjecture is that the increase in clock frequency and # of cores in recent CPUs might be the cause: due to silicon lottery, the max stable clock frequency is inherently different for each core. So as the chip gets more cores, it becomes exponentially harder to find chips where all cores in the chip match the spec frequency criteria – so maybe that’s why Intel loosened their criteria?
On the other hand, boost frequency is designed to go down as more cores become active. So in theory, having one golden core is actually enough, as long as the OS is aware of which core is golden, and assigns performance-demanding task to that core. However, it doesn’t seem to be the case yet, at least for my Ubuntu running Linux kernel 5.15.
On the other hand, my 7-th generation i7-7700HQ CPU does not have the problem described in this post. ↩︎
The full bash script for the test can be found here. For least noise, you should use
isolcpusboot parameter to isolate a subset of CPUs, reboot, modify the script to only test the isolated subset, then change
isolcpusto isolate the opposite set of CPUs, reboot, and modify the script to test the opposite set. ↩︎
The two logical CPUs of the physical core exhibit the same behavior, so I only show one of them. Same for other figures in this post. ↩︎
Though if you take a closer look at their specification, you’ll see what Intel claimed is “up to 4.7GHz”, so technically they did not lie, as they never claimed all cores can meet their specification – though, I guess, two cores 0.2GHz slower, two cores 0.35GHz slower and turbulent, one core 0.5GHz slower and highly turbulent is still, hmm. ↩︎