The 7.1 billion transistor chip was introduced at last years‘ GPU Technology Conference on Nvidias home turf. It was being touted as the next best thing next to sliced bread and first productized as Tesla K20 and Tesla K20X accelerator cards at ISSC in November 2012. Conveniently, it was by then already being put to use in the world’s fastest supercomputer, ORNL’s Titan.
Speaking of Titan, the rumor mill also has it, that the card that’s gonna be launched in the next few weeks (if correct) also carries this denominator: Geforce Titan – whether this goes with a number (I’d like „Titan 1“) or not or if it’s true at all, is still shrouded in the crystal balls of the fortune tellers. What is equally unknown outside Nvidia is the exact configuration which Titan 1 will be carrying and in this article I am going to explore a few variables and give reason as to why I think Geforce Titan will (or should) be configured this way or another.
Remember that inside the 7.1 billion transistors that make a GK110, there’s a plethora of functional units, some of which can be disabled at Nvidia’s discretion at the time of fusing. The 2,880 shader ALUs (or CUDA-Cores in Nvidian) are organized in 15 clusters (Nv.: SMX ), which in turn are grouped into five GPCs or Graphics Processing Clusters, the highest level of hierarchy within the chip. Alongside, there are six memory channels, each sporting a 256 kiB L2-cache (r/w) and being 64 Bit wide. Those are tightly integrated with the six ROP-partitions which handle eight pixels per clock each and feature a 8x-mode for Z-/Stencil operations only. Apart from that, there’s some display circuitry, some I/O (PCI-Express 3.0) and the command processor that’s responsible for feeding the GPCs and SMXs with 32-wide work items called Warps.
Modern processors are immensely complex and even the most recent process technology cannot guarantee manufacturing in a completely error free way. Those errors occur as microscopically tiny imperfections in the silicon wafer from which the ASICs are cut. Since each and every one of those dots can render a fully specced processor useless which has the misfortune of being lasered from an impure area, chipmakers resort to harvesting. This means, if there’s an error in one of the memory controllers, they simply blow a fuse at the time of testing and instead of a 384 bit memory interface you have 320 bits left, but can still use/sell the chip. Similar is true for other mostly self-contained functional units like the aforementioned SMX. For instance, the Tesla K20 comes with 13 out of 15 SMX and 5 out of 6 mc/ROP partitions enabled. The higher end-model K20X has the full memory configuration enabled but still one of the SMX is deactivated, leaving only 2,688 shader ALUs running.
What the Geforce Titan 1 will look like is unknown as I’ve mentioned before, but there are certain possibilities and indications as to where Nvidia could head with it’s new single-GPU flagship. In the following I’ll go through some bullet points explaining why I think certain specs are more likely than others.
- The GK110-GPU is more than 70 percent larger than it’s gaming-nephew GK104 which powers the current Nvidia flagship Geforce GTX 680 (single-GPU).
- The possible memory configuration will be either 2,560 (320 Bit) or 3,072 (384 Bit) or double the amount, respectively. That is 25 to 50 percent higher memory cost at least at the board level in addition to the costlier GPU.
- Given that pricing has been quite firm in the high-end over the last months, chances are high that Nvidia will not introduce Geforce Titan 1 at the 499 (or 599) US-Dollar pricepoint and lower the price of the lesser models accordingly.
- Geforce Titan 1 will be a high-end-model only (I am deliberately avoiding the term enthusiast). As such, it won’t sell very high numbers. Nvidia won’t need to set aside hundreds of thousands of GPUs for the high-end segment of the market. Instead, maybe even a single production run of about ten to twenty thousand GPUs would suffice to satisfy demand if priced at 699 or even 799 US-Dollars.
- If need be and market dynamics change, Nvidia could still create slightly less potent variation by either cutting down functional units, omitting memory or backpedalling on the clock-speed front.
- The only public die-shot of GK110 silicon from heise.de indicates a chip in A1 stepping. In other words, not much needed fixing or requiring another revision of the chip in order to achieve the desired productization possibilities.
- Geforce Titan will be the high-end of Nvidias line-up, Geforce GTX 690 has already been discontinued and only remnants of the production run are being sold right now.
- It has to sit therefore above Geforce GTX 680 and, given the expected price premium, with a comfortable margin as well to justify the price premium
- Titan 1 will be a prestige object. It will fill the space, Dual-GPU-cards with all their shortcomings used to occupy. That leaves probably only a very small space for a second SKU, which would need to be priced lower but cannot hope to reach the levels Nvidia can sell the chip in the professional market for, being branded as Tesla (which are already harvested parts) or Quadro.
- People interested in buying the fasted card out there for an immense amount of money probably don’t want to get a castrated, i.e. not fully enabled product.
- Going upwards from GK104, Nvidia has a thermal budget of about 55 watts left for a 250 watt target TDP or 105 watts if they go straight for the 300 watt barrier, which I think is more likely, since it’s already inherently possible with 1x 6-pin and 1 8-pin PCIe power connectors and the high-end target audience.
- With the Kepler architecture, Nvidia has demonstrated that they’re able to very accurately control the power consumption of their SKUs, thus being able to contain even a potentially higher power SKU to a target TDP of 250 or 300 watts precisely
- The GK110-based Tesla K20-cards differ only by 10 watts in nominal TDP, despite the K20X both sporting 20% more GDDR5 memory, 7.7% more computing ressources and a 3.7% higher clock speed. The active thermal solution does not seem to make a lot of a difference in terms of performance because it’s 10.3 rated watts (21.6 w max) do not alter the boards TDP, which remains at the same 225 watts as the passive solution. This seems like there’s either quite a bit of TDP headroom for the K20 Teslas or quite a bit of variation on the large GK110 chips – possibly both. This in turn makes a careful selection of the lowest power chips even more feasible and beneficial for a high-performance SKU.
- Higher frequencies need higher voltages. This is easily observed with GK104, which boosts it’s clocks about 10 percent on light/average workloads but needs more than 10 percent of additional core voltage for this. Problematic about this is the fact, that voltage is not linearly, but exponentially raising power consumption. Core clocks can only be improved to a certain, small degree until power runs out of limits.
Reasoning behind productization of Geforce Titan
From the prerequisites outlines above, it seems most likely to me, that Nvidias best option would be to release a fully enabled, modestly clocked GK110 chip for re-capturing the performance crown.
In order not to cannibalize the sale of their low-cost/high-margin GK104-based SKUs like GTX 680, Nvidia needs a high price. A high price is a twofold benefit: You can easily built a quality solution „halo“-card with a lot of memory that generates positive reviews which benefit it’s smaller siblings while at the same time not being forced to supply a high volume of this product. People also tend to see a high price justified on the flagship product. Additionally, Nvidia has already probed it’s customers‘ price tolerance and the market demand with their dual GPU soluitions. The 999 US-Dollar card Geforce GTX 690 being the last installment of those to date. For power and performance reasons as outlined above, it is best to have as much parallelism as possible while keeping clocks in a reasonable range (that is: not going for the last 10 percent that might be possible with higher voltage) – assuming your architecture scales pretty good with number of functional units, which has not been a real problem of Nvidia’s as of late.
Compared to Tesla products, GTX 680 has at least 37% higher clocks (not counting Boost state), offsetting some of the GK110 advantages in terms of higher unit count so that even a fully enabled GK110 with K20X clocks is only 36% faster on paper than a GTX 680 at it’s base clock of 1006 MHz. While Nvidia has power to spare coming from GK104 and it’s 195 watt TDP in Geforce GTX 680, it’s not like the 7.1 billion transistors in GK110 come for free. Some of them will even burn a small amount of power without ever being useful in games, such as ECC circuits for register files and cache or the 960 DPFP/FP64-ALUs, so it’s perfectly reasonable to assume that GK110 will not hit the frequencies of GK104 on a regular basis. For marketing reasons, Nvidia might try to bin the chips as power friendly as possible and try to achieve 1 GHz core clock, but of that I’m highly doubtful. I find it rather likely that they’re shooting for a target core clock of 850 – 900 MHz and enable a little bit of Boost headrom in order to exploit variation of the dies with regard to power characteristics.
Nvidia has been producing GK110-chips at least from end of August 2012 and has been delivering at least 30.000 Tesla K20(X) accelerator boards to major supercomputing installations going from their K20(X) launch claim of 30 PetaFLOPS in 30 days. A rough estimate of die-size being 23 x 23 mm tells me that they have been producing at the very least 300 wafers until then. Even if Nvidia by some mysterious cosmical accident stopped having TMSC produce GK110 chips, that be around 2,700 good dies from the center of the wafer where the silicon is the least tainted. Assuming that only 50 percent of those dies are actually error free, we’d be having 1,350 fully capable GK110 dies – easily enough for a worldwide launch in the ultra-expensive segment while the rest of the GPUs can be binned and sold as Tesla accelerators as Nvidias does it today.
Geforce Titan 1 will launch initially as high-end SKU with 3 Gigabytes of GDDR5-memory for 800 to 850 US-Dollar. It will feature a fully enabled GK110 with 5 GPCs/15 SMX/2,880 ALUs with a core clock of about 850 to 900 MHz and a small Boost for the reasons given above. The memory will probably run only at 2,8 GT/s for power reasons – depending on the effectiveness of the doubled L2-cache compared to GK104 and allowing for 268,8 GB/s of bandwidth. It will require a 6- and an 8-pin PCIe power connector and thus be able to sustain a TDP of 300 watts (which might be stated a bit lower for marketing reasons).
This of course is all pure speculation on my part and may prove to be far off the mark; none of it is confirmed or endorsed by Nvidia in any way.
Update Feb. 21st, 2013, evening:
Now that the Titan's officially launched and I've been proven wrong (and released from all NDAs), let me add, that when I wrote the above, it was in good faith, high hopes and just before any briefings did take place for my region. In all, the Geforce GTX Titan is a marvellous piece of hardware and by far the fastest single-GPU in the world. But there's two things really bothering me, no wait, actually it's three:
First, NOT the price, but the fact that you don't get a fully enabled product in Nvidias flagship range.
Second, the temperature-dependant GPU Boost 2.0 seems to be primarily targeted at H20- and NO2-cooling addicts and friends of short benchmark runs and not at real gamers that spend hours and hours in the worlds of their favourite games. They are the least likely to profit from the "Benchmark-Boost".
Third, price. Yep, only in third place, because it's not really a matter of reason to buy high-end cards anyway, value seekers better look at the 100-120 Dollar range.
In case you might need to read a couple more review, I would suggest those two even if you probably will need an online translator:
• Geforce GTX Titan launch review #1 (PCGH.de, german)
• Geforce GTX Titan launch review #2 (Hardware.fr, french)
And while you're at it, watch this video:
• First time ever 2 TFLOPS DP from a single Processor (Youtube)