• GF100 vs. Archmark - 3.51 Triangles per clock

    When Nvidia presented its new Fermi architecture back in September 2009, Jen-Hsun GF100 geomtry throughput per clockHuang and his employees were particularly proud of one thing: Geometry. Rightfully they told all who would listen, that compared to the improvements in pixel and shading throughput, the triangle side of things seemed almost completely ignored by the IHVs since the introduction of the first DirectX 9 cards. That's not because in AMDs and Nvidias engineering teams the pixel fanboys outnumbered the triangle fanboys, but because it's actually a very hard thing to parallelize geometry. For various reasons (all of which would lead to far to explain plus I don't fully understand them all myself) you need to keep the rendering order of triangles consistent if you actually want any reasonable and predicatble output on the screen. To preserve this order seems to be very costly to implement and before the advent of DirectX 11 and its mandatory tessellation feature, a geometry throughput in the order of three-digit hundreds of millions of triangles seemed fine.

    With DirectX 11 and tessellation however, you can have a more or less simple shader program create thousands of vertices out of only one primitive directly on the GPU - no external busses which might bottleneck data transfer are involved any more. If you've got a scene with, say, 1 million polygons and start to tessellate it, that number explodes and would suffocate the triangle setup on any standard-issue GPU. That was especially the case with Nvidias former chips until GT200. Those only could reach one set-up triangle per clock cycle with culling activated and if you actually wanted this thing drawn, the rate dropped to half. AMDs Radeons did better and could setup and draw one triangle per clock cycle in addtion to their significantly higher clocks.

    Geometry Pipeline - Implementation

    AMD was first to incorporate tessellation hardware into their chips. Every modern GPU from them, starting with the Xbox 360s chip, had a dedicated tessellator. Yet, those did not quite meet the requirements of DirectX 11, so only from september 2009, when AMD actually launched the HD 5000 series, there was usable tessellation in the PC space. From the tests I've seen so far, it seems as if using this tessellation hardware incurs a penalty of about 3 clock cycles. In other words the triangle rate with tessellation activated drops to about a third of the maximum throughput the cards can reach otherwise. This is also remotely illustrated in one of my earlier articles, where I used Ozone3Ds Tessmark in order to show tessellation throughput of the Geforce GTS 450.

    Nvidia did go the painful route and just as AMD had touted it's entry into the DirectX 10 era with the mighty R600 as "DirectX 10 done right", Nvidia boasts itself that its Fermi architecture is "DirectX 11 done right". The reason being that they parallelized geometry in the Fermi architecture, though not as finely grained as pixels. Each Fermi chip features something Nvidia calls GPC, Graphics (or Geometry?) Processing Clusters. GF100, the first incarnation, has four of these conceptual entities, of which some people believe, they are just that: a marketing concept with no real hardware behind it. Be that as it may, the bottom line is, that each part of a Fermi GPU organized into a GPC is able to process geometry all by itself, so the key takeaway is: The more GPCs there are, the better the geometry performance is (for an example see the link above).

    But enough of the small talk now, let's get into the thick of things.

    Enter the Archmark
    Nvidias engineers understandably were quite frank about the theoretical triangle rate of GF100, which amounts to four triangles per clock set up and also drawn. They even admitted freely that this theoretical rate is not reached in real world tests. Instead, I heard somewhere the number 3.2 as being the observed triangle rate in directed tests. While this is only possible with very small triangles and tessellation enabled, when you actually want to draw something, the pure setup-rate (beforce culling) is interesting also, because it can alleviate bottlenecks in other applications as well without having to rely on the DirectX 11 API being used.

    For today's piece of information, I dug up an older favourite of mine. It's an OpenGL based benchmark called Archmark (which is short for architecture mark). You can download and read more about it on the Archmark homepage. Sadly, for a good part of its tests it uses buffer and/or texture formats no longer supported by the OpenGL ICD of AMDs Catalyst drivers, so it's not quite usable for architecture comparisons as it once was. But on the Geforce side it still works like a charm and so I decided to use it for this article.

    Inside Archmark, there's a geometry test, where the triangles are set up as a fan, so that each additional vertex would create another triangle. It doesn't draw those triangles though, so obviously this test is ideal for reaching close to peak throughput on most cards out there. And that's what I did with a Geforce GTX 480 running at stock clocks (700 engine, 1400 shaders and 924 MHz for the memories) under Windows XP which I've found to produce much better results for low-level tests than the current king of Windows, WinNT 6.1.
    The parallelized geometry architecture of GF100 shows its strength - 3.51 triangles per clock are set up before drawing them.

    The Archmark actually reaches 2,462,000,000 set up triangles per second (yes, that 2462 millon or 2.462 billion triangles per second) and thus peaking at 3.5171428571428571428571428571429 (says calc.exe ) triangles per clock and getting quite close to the theoretical peak of GF100. Funny side-remark: My friend Damien already got exactly the same number in his test of the Geforce GTX 480 over at www.hardware.fr.

    Running the same test with the one-GPC GTS 450, I am getting 733,65 million triangles per second which amounts to an even higher utilization of 93,7% of the theoretical peak in this test. For comparison: GF100s' utilization is 87,9% of theoretical peak - not bad considering that there probably is some amount of overhead involved when four GPC have to agree on the triangle sorting order. Just for kicks, a GTX 280 comes in at 95,6% efficiency at setup, but looses badly, when actually trying to draw the triangles.

    The highest non hardware tessellated triangle rates I've seen so far are from the Xvox Demo by Jan Vlietinck. At the link, there's also an extensive technical documentation about how he achieves some tessellation-like effect with only basic vertex shaders and vertex stream. The Xvox Demo puts a GTX 480 at about 1.6 GTris/s, a GTX 470 is hovering around 1,4 GTris/s, GTX 460 gets a clean and round 1 GTris/s and GTS 450 finally is at 684 million triangles per second. HD-5000-series of cards with 850 MHz core clock get about 740 MTris/s - the rest behind the front end does not seem to matter. HD 5870, 5770 and an 5670 overclocked to 850 MHz as well all reach about the same result. The Cedar chip in HD 5450 though seems to have only a third the triangle rate, coming in at only 185 MTris/s. Unfortunately, I suspect those rates also to include culled and drawn triangles alike, since G8x/9x based Nvidia cards reach way more than their nominal half-rate for drawn polygons.

    Xvox Demo Triangle Rate Geforce GTX 480Xvox Demo Triangle Rate Geforce GTX 470Xvox Demo Triangle Rate Geforce GTX 460Xvox Demo Triangle Rate Geforce GTS 450

    Now, compared to the professional Tesla and Quadro line-ups, the Geforce products based on the Fermi architecture are severly limited in real world applications not only in double precision throughput (which is cut in half two times, i.e. 1/8th throughput), the triangle performance is also somewhat limited for drawn triangles. That's obviously for reasons of product differentiation, mind you.