• Do not let Power Management mess up your performance analysis!

    Modern Processors' Power ManagementModern processors, be they of CPU, GPU, APU or SoC flavour, employ very sophisticated power management techniques in order to better leverage the full potential of their respective functional units in combination with the applied thermal solution. What sounds quite basic is indeed a multi-dimensional topic and can easily mess with any performance analysis that is not taking into account the specifics in each processor. In this article I am giving an outline what contributes to the power management in modern processors.

    A while ago - in IT-time quite an eternity - processors had a certain clock speed which they maintained during all operations. They had a fixed TDP that defined the necessary cooling solutions' capabilities to the fraction of a watt. Of course, power was quite low compared to more recent devices in the desktop space. Mobile devices, being much more dependant on a low power consumption, employed simple techniques like a reduced state for clock speed and core voltage as early as the turn of the millenium. With more recent devices, power has become a much more important issue in more than one respect - even in the desktop space. There are a number of reasons for this.
    1. The devices themselves have become immensely more complex - more than an order of magnitude. From tens of millions of transistors in the early pentium II and Athlon days the semiconductor industry has moved to processing devices with just short of a billion transistors, multiple megabytes of cache and multiple processing cores working in concert to speed up the calculation. Other processors such as GPUs are a different kind of beast but plagued by the same problem - even more so, since the do not emphasize single-thread performance as much but are designed as throughput machines that utilize more of their functional units at any given time, additionally, they do not have as much cache blocks but a high degree of very fast and very power intensive PRFs or physical register files to keep the machine state readily available for a multitude of threads. Mobile devices on the other hand have become even lower power today than a decade ago but at the same time they have to incorporate all day battery life inside a small form factor with fast multimedia processing capabilities like HD-movie decode which of course requires realtime operation.
    2. The workloads have shifted. Despite Windows being marketed as a multi tasking operating system from as early as the mid-nineties going forward, basically every task was processed serially, albeit with time sliced interleaving creating the illusion of multitasking. In today's information technology, real multi tasking is a concept being more and more employed and exploited even at the operating system level. Nevertheless, most of the time there still is not enough undone work to fully (or even partially) load more than a few processors significantly.
    3. The variance of power required per transistor to switch has grown tremendously. Whereas the thermal design power for a certain type of SKU could be given to the tenth of watt in the time of the Athlon XP and first generations of the Pentium IV, the bins had to be extended by quite a wide margin. The TDP-classes typically employed by processor manufacturers are an upper boundary which leaves quite a bit of wiggle room for most ASICs falling in each and every bin in order to maximize the yield for a given spec.
    4. With modern process technology, the power problem has extended from dynamic power, which only hit hard when the device was under load to static power as well, creating massive amounts of leakage even when the processors transistors are not switching. Combined with the enormous amounts of transistors that most processing devices incorporate, this creates high power draw even in non- or partial-load situations which by far outnumber full load scenarios in daily usage. The semiconductor manufacturers are battling this effect with a number of techniques, most of which are only able to alleviate it to a certain extent but do not really solve the problem.

    These and probably even more factors create the situation, where many processors are sold with a large headroom left until they reach their TDPs upper limit and the manufacturers have been quite creative to employ ever more complex techniques to exploit that headroom. From a simple up-clocking when only one processor is loaded and the other are in a energy-saving state up to calculations for currently used power budget across the whole device including main processor cores, graphics core, system interconnect and more, there is a variety of things that will try and maximize user experience in any given situation. Additionally, some processors have an integrated graphics part (IGP) of some sort which their power management units are factoring into the equation, giving the main cores further wiggle room when the IGP is not used because an external solution has been put into place. The exploitation of thermal wiggle room is a good thing. But there are darker places down the road, that you have to be aware of when doing a performance analysis.

    One of those dark places is to exploit thermal inertia in order to even exceed the thermal limit for a short amount of time. The idea behind it is, that when a processor comes from a low power state, thus having a low temperature as well, it's power consumption would hit it's TDP even under full load until the thermal solution is saturated and the temperature has risen to the specified value at which point the processing device will burn the expected wattage. This for example can lead to the undesireable effect of measuring only peak throughput instead of sustained performance, especially when a very potent cooling solution is used. The shorter a given benchmarks runtime is, the more it will be prone to profit from exploiting thermal inertia, thus not painting a clean picture on the processors performance. An obvious solution to this would be to turn off power management features for performance testing altogether, but this way you would thwart the beneficial part of power management even in the short turn. A more elaborate version would carefully monitor to what degree every benchmark is running beyond the specified clock frequency and decide if that's a representative measure of expected workloads across the processor's active duty life. The popular performance ratings based on average or even normalized benchmark results will have a difficult stance to defend if this problem is not taken into account.

    Obviously, this is an extension of the problem that the normal turbo operation has created for several years now - you had to carefully judge whether or not the benchmark you choose to run was representative for it's class of operations. A popular game like Starcraft 2 would profit from turbo-boosted clock speeds, because it only utilized two cores of a host processor, but the performance would not be indicative of something like Battlefield Bad Company 2 or one of Codemasters' racing games like F1 2010 or Dirt 3 which all use four and more cores. Also, is a 20-second Sisoft Sandra run enough to make the CPU run at a sustainable speed or would you just record boosted performance?

    There is a similar pitfall with graphics processors GPUs nowadays but they are approaching from the other end of the road. They are designed to run at a maximum speed, throttling if power management decides that the current load is too high - no matter whether this is done in software (Geforce 400/500 series) or a hybrid solution (Radeon HD 6000/7000 series). There are certain workloads that will make the graphics card draw more power than both power circuitry and to a lesser degree cooling solution were designed to handle. While the Cayman-based Radeon cards handle those overload situations more gracefully by reducing their clockspeeds and core utilization in fine steps, Geforce-cards fall off of a hard performance cliff, basically halving performance. On the other hand, the throttling mechanism seems to kick in way more often with the Radeon, making it diffifult to measure actual performance when you do not know if your card is just on the top end of the variance curve or if the benchmark you're using is putting too much load on the whole series of SKUs, thus making it rather hard to decide whether or not to deem it to be of indicative performance for it's class of games.

    Unfortunately, I don't see a solution for reviews working within a limited set of ressources which satisfies every aspect of power management techniques. I think, you can rather choose from either depicting real world performance on the choosen benchmarks without them being an indicator of general performance or you can - somewhat artificially - limit the functions of the processors by turning of power management for performance testing, thus only showing peak performance without knowing if and for how long the SKU you're putting to the test can sustain that performance under real world conditions.

    The key takeaway from bearing with me over the course of this article is twofold.
    1. If you're a customer, be careful with calling judgement over modern processing devices with just a very limited number of scenarios.
    2. If you're a reviewer, make sure that you call your readers attention to this trade-off you probably will have to make unless you come up with a solution to the dilemmma outlined in this article.