• Is GF104/106s superscalar design really an improvement?

    GF104 seems to be generally regarded as the better Fermi for gaming since Nvidia has integrated some improvements into the oddly shaped graphics processor of its Geforce GTX 460 cards. For one, the chip itself is aimed at the performance segment, thus not being as large and as power hungry as the original GF100 from the start. It also has a much lower ALU-TEX ratio which is not desireable for high performance computing (HPC) or workstation cards where GF100 was targeted. 56 out of 64 texture mapping units (TMUs) are enabled in currently shipping products. On the downside, it has less shader units in general, having to make do with 336 instead of GF100s 480 - 448 - 356 in descending order for Geforce GTX 480, 470 and 465. They are organized differently too - instead of grouping 32 ALUs together, GF104 gangs 48 of them up into one Shader Multiprocessor (SM) along with various other units as eight TMUs, four Polymorph-Engines (PME), 16 Load/Store Units (L/S) and four units doing the more esotheric transcendental functions (SFU). But that is not all.

    Superscalar Execution
    As noticed only by very few launch articles (1 (en), 2 (fr), 3 (ger)) for Geforce GTX 460, Nvidia also changed the way the SM schedules the groups of units with work. In GF100, there were six groups (2x 16 ALUs, 1x 16 L/S, 4 TMUs, 4 SFU and 16 Texture Interpolators) and a dual warp scheduler with two dispatch units. A warp is Nvidia-speak for a group of 32 threads, ideally each execution the same instruction at the same time in order to keep the functional units busy. They could task two out of the six functional groups at any given clock cycle with instructions from one warp each - except for double precision calculations, which would block the second scheduler.

    In GF104, there's now seven functional groups since the ALUs are divided into three groups of 16 but still only two . And to avoid leaving five out of seven groups without anything to do, Nvidia decided to try and go superscalar. Now, that is something AMDs done for years now [update: at least, that's what they've marketed for their VLIW execution units] and essentially is a method to not only rely on enough threads adn warps in flight to fully utilize all execution units but to extract additional parallelism out of every instruction sequence.

    The warp scheduler and dispatch units had to be reworked, with (each of) the former now being connected to a dual-issue dispatcher and being responsible for checking dependencies between instructions, in other words making sure, they're safe to extract Instruction Level Parallelism (ILP). Because if an instruction relies on the outcome of an earlier calculation, it is not considered ILP-safe and you need to make sure to have finished all the prerequisite calculatons before issuing this instruction.

    In the best case, you can utilize more of your units this way, achieving a higher throughput with fewer additional transistors. In the worst case... well, you'll see in just a second.

    GPU-Bench 1.21
    This little application was designed back in the days of the last DirectX 9 cards in order to measure their ability to be abused for general calculations. Since DirectX 9 wasn't suited for those tasks and is generally not very well respected with researchers who want to run their programs under various dialects of Linux, GPU-Bench consequently uses OpenGL. For that, a variety of tests can be run out of which we'll be focusing on the instruction issue rate for the GPU, which is documented here.

    In short, GPU-Bench uses a stream of simple ARB 1.0 instructions such as ADD, MUL or MAD as well as more transcendental functions operating on two sets of registers in order to remove dependencies. I have configured the test to run shaders on scalars as well as vector4 width with 40 instructions. In order to sort out any bottlenecks here, I've cross-checked the results with longer sequences up to several thousand of the same instructions - the general outcome of the test did a little bit closer to the respective peak rates, but did not change in general [update: Except for short instruction sequences - see below].

    To make this a little more interesting, I took the liberty of clocking all cards the same:

    • 675 MHz for the engine and fixed function units
    • 1350 MHz for the shader ALUs and
    • 900 MHz for the memory (you might want to factor in a multiplier of 2 for GDDR5-based models).

    Now, here are the results for scalar instruction issue - something the more recent Nvidia GPUs should excel in:

    Scalar Instruction Issue Rate on GF104, GF106, GF100 and other GPUs

    While the older chips really are getting close to one instruction per ALU per clock and in some cases like G9x and GT200 even are able to expose the infamous "Missing MUL", the GF100 chip is already lagging behind, even though its dual warp schedulers/dispatchers should be able to adequately feed the 2 16-wide ALU blocks inside each Shader Multiprocessor.

    With GF104 and also its architectural sibling, the GF106, things look a little different. In pure scalar workloads, something Nvidia has gone out of its way to make a central point of its GPU architectures over the last four years, they notable fall short of good utilization. Their issue rate is limited to two thirds (61-63% in this real-world test) of their maximum throughput.

    So there you have it: Put a scalar workload onto those chips and their savings in terms of transistors starts to show compared to GF100 and even more so to the purely scalar designs of the past years.

    Here's the same test but with an instruction width of 4:

    Vector4 Instruction Issue Rate on GF104, GF106, GF100 and other GPUs

    Even tough the GPUs should be able to fully use their potential here, as is the case with the older, pre-Fermi chips, GF100 and also my Radeon HD 5870 (not shown here, and only counted as having Vec4-Units, since the test does not run with 5-wide instructions). But apparently, they suffer from lower utilization in this case as well, managing only a notch above 80% compared to the 95% and up for the other chips.

    [Update: Here's what I've missed. Thanks to trinibwoy over at Beyond3ds Forums for making me realize my mistake! In fact, the situation changes for most cards only very little when upping the shader length to 64 or more instructions. They go up from 97 to 98 percent or so. Not so with GF104 and GF106, the only superscalar designs in this test. They actually improve by almost one quarter, going up from their mediocre results at a shader length of only 40 instructions to the level of the competition when the shaders are longer - 64 instructions or more. This changes the conclusion also!].

    Vector4 Instruction Issue Rate with shader length of 64 instructions on GF104, GF106, GF100 and other GPUs

    Clearly, this test is not adequate for modern GPUs and only by accident does what it's supposed to on all but GF104 and GF106 GPUs or there are more pitfalls in the reworked Shader-Multiprocessors of the "Gamer-Fermi", Nvidia has earned a better reputation with compared to their original GF100 chip.
    [Update: As the new results indicate, the reworked instruction management seems to be efficient with a little longer shaders only. I don't know if this is a limitation on the issuing hardwrae itself or if the results are probably due to other factors as the length of the internal ALUs pipelining system.

    Anyway, Nvidia probably felt that this was a tradeoff worthwhile doing, since shaders becoming much longer these days anyway and Fermi chips are not designed to achieve peak throughput rates at simple things anyway as is evident in the known limitation of two pixels per clock per SM ]