OT: New Nvidia Cards to come in RTX and GTX versions?! RTX Titan first whispers.

kyoto kid · September 2018

jonascs said:

ebergerly said:

jonascs said:

Cool. So did you get the expensive NVLink or the RTX version?

According to the videos I saw on YouTube the NVlink for RTX is smaller and is a single link, whilst the Quadro Links are dual and are wider and won't fit. So I ordered the RTX link directly from Nvidia. ;)

..again the dual bridges are only compatible with the Quadro GP100, GV100 and Tesla V100. Linking RTX Quadros will use a new single bridge that apparently is not available yet.

kyoto kid · September 2018

7thOmen said:

Maybe the higher bandwidth is only being used to get the frame data to the primary display card faster, thus an attempt to reduce the dreaded micro stutter that SLI historically had. Imagine the data load at resolutions of 4k and beyond. I suppose we could do the math...

I find it curious that, since multi card NVLINK reviews have appeared, no one really seems to be bothering with memory pooling. I wonder why? Maybe it's because gamings needs haven't really eclipsed 8-11 GB of VRAM and nVidia knows this?

Omen

...again, we are but a small segment of the consumer market who would have the need for going beyond the "11 GB barrier". I still wonder why for the price they just didn't add one more VRAM chip to the 2080Ti to make it 12 GB (and maybe two to the 2080 to make it 10GB) given the price increase. The Titan-X had 12 GB and it was till branded as a GTX card. I have yet to see any news of an "RTX Titan" being released and if there were one, it likely would cost as much or more than the 16 GB RTX5000, just as the Titan-V does..

I also wonder who will be the customer base for the 2070 given it also has a much higher price tag (about that of the original 1080) and is not NVLink compatible.

outrider42 · September 2018

ebergerly said:

I'm merely trying determine the facts, to address the belief that the RTX cards will allow VRAM stacking even though their NVLink bridges are not the full implementation. I'm certainly not trying to knock the RTX cards or anything like that, just get the facts.

Linus says no, there will be no resource pooling. Tom Peterson of NVIDIA says something like "yeah, but only if developers do the work to implement it", and it's not automatic like with the Quadro cards. Octane and V-Ray developers seem to be claiming that they have implemented VRAM stacking. And based on the major NVLink hardware differences between the Quadro and RTX NVLink interconnections, it doesn't quite compute (IMO) that VRAM stacking is do-able. Arguably the RTX version is just a high-speed SLI bridge, and since VRAM stacking hasn't occurred over all these years on other SLI connections, I wonder why (or if) it's really do-able now?.

There's no question that the RTX cards have the capacity to be amazing. And even without all the software pieces in place yet they still have more than what might be expected for a new generation of GPUs.

The answer to your question comes down to this: Which one of the men works for Nvidia? Another thing, Linus did not even test the RTX gaming cards in his video. He used Quadro on gaming. Linus tends to focus a lot more on gaming. He might touch on CAD and Quadro once in a while, but when he speaks about performance, he almost always speaks exclusively to gaming.

So really, both Tom and Linus can be correct here. Linus could be correct because he is probably thinking of gaming, and Tom did say that stacking is NOT enabled on the cards by default. They must be programmed to use the feature, and odds are that most games will not do this. There are still bottlenecks in place that make this not ideal to gaming at all. The extra bandwidth can help a lot, but having each card read VRAM across NVLink is still too slow with what the gaming NVLink uses. The fastest the 2080ti NVLink gets is 100 GB/s. VRAM is running several times faster than that, and that is why stacking is not a good idea. If you want to see just how much bandwith can screw over a GPU, I found a very interesting video by GamersNexus that reviews the 1030 GPU that shipped with DDR4 memory. You read that correctly, DDR4, the same kind of memory in your system, NOT GDDR5 or even GDDR4. The 1030 has a GDDR5 version as well, so this allows us to make a great comparison of just how much slow VRAM can cripple performance. Spoiler alert: It makes a HUGE difference.

Also, as he says in the video, do NOT buy that card for any reason!

So this is why you cannot pool memory for gaming with NVLink. Or at least why you shouldn't. The faster NVLinks reach 300 GB/s, and that is pretty solid for VRAM, as that is basically the same speed the GDDR5X VRAM ran at in the 1080. GDDR6 is of course faster still, but 300 GB/s would be totally usable.

At any rate, programs like Octane and Iray are not effected so much by memory speed because of how they work. So memory stacking, even at 100 GB/s is very doable.

I think the reason that Linus said stacking was not possible was to avoid confusing gamers. There is already enough confusion going on with RT and Tensor and everything else, so it makes sense to give a simple answer of "no" when asked about this subject. But ultimately it is possible.

The 2080ti has dual channel NVLink. As I said earlier, it very closely matches the lower Quadro RTX 6000 in NVLink spec. The 2080 on the other hand seems to have truly gimped NVLink performance with single channel, where it is not possible to stack VRAM at all. So only the 2080ti can stack VRAM at this point.

kyoto kid said:

7thOmen said:

Maybe the higher bandwidth is only being used to get the frame data to the primary display card faster, thus an attempt to reduce the dreaded micro stutter that SLI historically had. Imagine the data load at resolutions of 4k and beyond. I suppose we could do the math...

I find it curious that, since multi card NVLINK reviews have appeared, no one really seems to be bothering with memory pooling. I wonder why? Maybe it's because gamings needs haven't really eclipsed 8-11 GB of VRAM and nVidia knows this?

Omen

...again, we are but a small segment of the consumer market who would have the need for going beyond the "11 GB barrier". I still wonder why for the price they just didn't add one more VRAM chip to the 2080Ti to make it 12 GB (and maybe two to the 2080 to make it 10GB) given the price increase. The Titan-X had 12 GB and it was till branded as a GTX card. I have yet to see any news of an "RTX Titan" being released and if there were one, it likely would cost as much or more than the 16 GB RTX5000, just as the Titan-V does..

I also wonder who will be the customer base for the 2070 given it also has a much higher price tag (about that of the original 1080) and is not NVLink compatible.

The 2080ti is not the full Turing chip, so if Nvidia wants to do a RTX Titan, they have that option. However, the full chip is not that much bigger than the 2080ti, so performance would only be slightly faster. So what would make a the RTX Titan more compelling? Why more VRAM, of course. The price of this Titan is totally up in the air, and I'll tell you why. The reason Nvidia left the Titan out is so they have an option to combat whatever AMD releases next year. It all depends on AMD. If AMD somehow pulls a Ryzen style comeback out of their hat, then the RTX Titan may have more VRAM and a lower price tag. But if AMD still connot compete, then the RTX Titan will cost A LOT more. Remember this, the 1080ti was expected to cost a lot more than it did. Plus, the previous releases, the 1080 and 1070, had expensive Founder's Editions. There was no Founder's Edition nonsense for the 1080ti, and it launched at an amazing price of $700. This was right before Vega launched, and righter AFTER the first Ryzen launched. I personally believe (and I am not alone in this) that Ryzen scared Nvidia into thinking Vega might be a really strong release. So they pushed out the 1080ti at $700. And they launched the 1070ti to tackle Vega 56, which further proves how Nvidia can and will retaliate.

You mentioned the 970ti several times before, and the reason I believe the 970ti never came out was simply because they did not need it. The 970 sold great, and was on top of its competitor. The infamous 3.5 and 0.5 GB VRAM thing never actually impacted sales much. And it is probably this fact that ultimately led them to kill the 970ti. The 970ti probably did exist, at least in concept. The production cycle of many things can change a lot. I still think that Nvidia probably entertained the notion of building more VRAM into these cards, but in the end they must have felt they had no reason. But we know they CAN do 16 GB if they want to.

I very much believe that the RTX 2070 will be worst received card of the bunch. The x70 has always been the more mainstream version of the series. Like I said above, the 970 was huge seller, and the 1070 was a big seller, too. The x60 and x70 generally take the top spots. But this time the 2070 is a bit of a lame duck. The price is just too high. Early adopters and tech heads will buy the 2080 and 2080ti. But 2070 costs too much and offers too little. The card will not even be $500, as its Founder's edition is a striking $600. The 2070 has a fraction of the ray tracing and Tensor power, making those features almost useless. So the card will have to work on CUDA alone. The limited Tensor might help some with DLSS mode, and might be the ONLY saving grace for the card. However the last gen 1080ti will destroy the 2070 and be a much better buy. The 1080ti is matching the 2080 in benchmarks, so the 2070 would have to be far under those marks. So why would anybody buy the 2070 over the 1080ti? In that last generation, the 1070 was a killer, beating the previous Titan and 980ti. But the new 2070 will not match the previous x80ti in performance while costing so much more. Yikes.

The GTX 2060 will be the one mainstream card released. Who knows what its price will be, but I think it will be much more fair. Without the heavy RT cores to weigh it down, the 2060 will be much cheaper to produce. I do think the 2060 will have some Tensor cores though, maybe. DLSS looks to be a good feature, so keeping this for the 2060 would really push for DLSS adoption. I don't think anything below the 2060 will have any Tensor, though.

Its going to be weird, because there is this strange gap between the 2060 and 2070, with price being the biggest.

The other thing with mainstream cards is that this is where AMD can actually compete. So prices in this segment will be more competitive. Actually, the door is wide open for AMD to attack right now if they are capable. Nvidia's GPUs are heavy, bloated and expensive. AMD could sweep in and take the performance crown right out of Nvidia's hands, though this is likely why Nvidia held back on releasing the full Turing chip.

How could AMD do this? If they can make the jump to 7nm, they can make a much smaller chip. They can dedicate the full die to streaming processors, and not do any wacky tensor or RT type thing. Tensor and RT take up a ton of space on Turing. The actual CUDA portion of the die probably is not that much bigger than Pascal. So Nvidia has this huge chip but only about half of it is "old fashioned" CUDA, if even that. AMD could make a chip that is smaller than Turing, yet still offer a much larger streaming processor count, enough to make up the difference and then some. That is how AMD can get back into this. But can they really do this??? I am not sure. Vega was pretty sad, to be frank. A year late and only barely competitive with Pascal. They didn't even touch the 1080ti. I honestly do hope AMD can pull something off, because the market needs competition. Maybe Intel can make some noise when they release a GPU. But grrr, I so loath Intel. Not that any company is innocent, mind you. But the market must have competition.

branonig · September 2018

Has anyone tried to actually render with one of these new cards yet? I bought a 2080 founders the day of release and tried to render out some of the ready to render content daz has and it won't render at all. I just get a black image. Where can i find more info on this? So far this post is the only thing I've been able to find.

outrider42 · September 2018

branonig said:

Has anyone tried to actually render with one of these new cards yet? I bought a 2080 founders the day of release and tried to render out some of the ready to render content daz has and it won't render at all. I just get a black image. Where can i find more info on this? So far this post is the only thing I've been able to find.

Yes, there are posts in this very thread. You need to use the Daz BETA. To install the beta, you must use DIM, and you must click the box in DIM that shows hidden items. For a small guide, look here.

https://www.daz3d.com/daz-studio-beta

ebergerly · September 2018

outrider42 said:

The answer to your question comes down to this: Which one of the men works for Nvidia? Another thing, Linus did not even test the RTX gaming cards in his video. He used Quadro on gaming. Linus tends to focus a lot more on gaming. He might touch on CAD and Quadro once in a while, but when he speaks about performance, he almost always speaks exclusively to gaming.

Thanks, but my question was asking if anyone knows the technical, hardware/software level details about how the RTX NVLink/SLI bridge connector can possibly do VRAM stacking for a rendering environment, considering raytracing presumably would need a bi-directional, high speed, mesh-type connection, not a one-way master-slave configuration. Like I said, if half of my scene elements are in one GPU's VRAM, and the other half in the other GPU, they're certainly gonna need a two-way high speed link. For example, if a scene light is residing on one GPU, its rays will be bouncing to scene elements on both GPU's and require high speed, two-way communication.

Also, I think you might be a bit mistaken in your understanding of some of the technical issues, like "dual channel" and "single channel" links and their impacts on all of this, but I'll leave that for another discussion.

7thOmen · September 2018

From the nVidia whitepapers on Turing:

Turing TU102 and TU104 GPUs use NVLink instead of the MIO and PCIe interfaces for SLI GPU-to-GPU data transfers. The Turing TU102 GPU includes two x8 second-generation NVLink links, and Turing TU104 includes one x8 second-generation NVLink link. Each link provides 25 GB/sec peak bandwidth per direction between two GPUs (50 GB/sec bidirectional bandwidth). Two links in TU102 provides 50 GB/sec in each direction, or 100 GB/sec bidirectionally. Two-way SLI is supported with Turing GPUs that have NVLink, but 3-way and 4-way SLI configurations are not supported.

(WP-09183-001_v01 | 23)

Turing TU102 and TU104 GPUs incorporate NVIDIA’s NVLink™ high-speed interconnect to provide dependable, high bandwidth and low latency connectivity between pairs of Turing GPUs. With up to 100GB/sec of bidirectional bandwidth, NVLink makes it possible for customized workloads to efficiently split across two GPUs and share memory capacity. For gaming workloads, NVLink’s increased bandwidth and dedicated inter-GPU channel enables new possibilities for SLI, such asnew modes or higher resolution display configurations. For large memory workloads, including professional ray tracing applications, scene data can be split across the frame buffer of both GPUs, offering up to 96GB of shared frame buffer memory (two 48GB Quadro RTX 8000 GPUs), and memory requests are automatically routed by hardware to the correct GPU based on the location of the memory allocation.

(WP-09183-001_v01 | 6)

What's interesting to note here is that nVidia uses "Professional" and "Quadro" in the same sentence... no mention of 20x0.

Based on the market that 20x0 resides, it doesn't make sense for nVidia to enable memory pooling, regardless of what some salesman might have said. There just isn't any need for that tech in gaming. It definitely would encroach on the Quadro market.

As @outrider42 said...

outrider42 said:

Didn't you say the same thing about Nvidia giving gaming cards ray tracing and tensor? I also recall you saying that Tensor cores were strictly for scientists and AI research and that non researchers had no use for them. You were unable to think forward about how Tensor could impact gaming, when it should have been clear that denoising could be applied to gaming just like I said it could.

If Nvidia has any aspirations of keeping Iray relevant then they have to do this to compete with the renderers that do. If Octane and vray do this and Iray does not, Iray will truly die. And considering what they've already done so far, I tend to believe they likely will. This does not in any way encroach on Quadro. Quadro does not enhance Iray in any way whatsoever, the one and ONLY reason anyone would buy Quadro for Iray is purely VRAM. And besides, its not like the 2080ti is exactly cheap. You are still talking about a $2500 investment. And I'll say it again, if Nvidia really wanted to prevent this, all they had to do was lock VRAM stacking out at the hardware level to begin with. They did not.

If you could possibly stack on 20x0 cards, wouldn't that hurt sales in the Quadro market (specifically the 5000 and 6000)? Double up the core capabilities, and you have surpassed the 5000/6000. Where would the incentive be to spend $6k on a Quadro RTX 6000 when you only loose 2Gb at $2500 (2080ti w/bridge) and have significantly better multicore performance?

Omen

drzap · September 2018

7thOmen said:

From the nVidia whitepapers on Turing:

Turing TU102 and TU104 GPUs use NVLink instead of the MIO and PCIe interfaces for SLI GPU-to-GPU data transfers. The Turing TU102 GPU includes two x8 second-generation NVLink links, and Turing TU104 includes one x8 second-generation NVLink link. Each link provides 25 GB/sec peak bandwidth per direction between two GPUs (50 GB/sec bidirectional bandwidth). Two links in TU102 provides 50 GB/sec in each direction, or 100 GB/sec bidirectionally. Two-way SLI is supported with Turing GPUs that have NVLink, but 3-way and 4-way SLI configurations are not supported.

(WP-09183-001_v01 | 23)

Turing TU102 and TU104 GPUs incorporate NVIDIA’s NVLink™ high-speed interconnect to provide dependable, high bandwidth and low latency connectivity between pairs of Turing GPUs. With up to 100GB/sec of bidirectional bandwidth, NVLink makes it possible for customized workloads to efficiently split across two GPUs and share memory capacity. For gaming workloads, NVLink’s increased bandwidth and dedicated inter-GPU channel enables new possibilities for SLI, such asnew modes or higher resolution display configurations. For large memory workloads, including professional ray tracing applications, scene data can be split across the frame buffer of both GPUs, offering up to 96GB of shared frame buffer memory (two 48GB Quadro RTX 8000 GPUs), and memory requests are automatically routed by hardware to the correct GPU based on the location of the memory allocation.

(WP-09183-001_v01 | 6)

What's interesting to note here is that nVidia uses "Professional" and "Quadro" in the same sentence... no mention of 20x0.

Based on the market that 20x0 resides, it doesn't make sense for nVidia to enable memory pooling, regardless of what some salesman might have said. There just isn't any need for that tech in gaming. It definitely would encroach on the Quadro market.

As @outrider42 said...

outrider42 said:

Didn't you say the same thing about Nvidia giving gaming cards ray tracing and tensor? I also recall you saying that Tensor cores were strictly for scientists and AI research and that non researchers had no use for them. You were unable to think forward about how Tensor could impact gaming, when it should have been clear that denoising could be applied to gaming just like I said it could.

If Nvidia has any aspirations of keeping Iray relevant then they have to do this to compete with the renderers that do. If Octane and vray do this and Iray does not, Iray will truly die. And considering what they've already done so far, I tend to believe they likely will. This does not in any way encroach on Quadro. Quadro does not enhance Iray in any way whatsoever, the one and ONLY reason anyone would buy Quadro for Iray is purely VRAM. And besides, its not like the 2080ti is exactly cheap. You are still talking about a $2500 investment. And I'll say it again, if Nvidia really wanted to prevent this, all they had to do was lock VRAM stacking out at the hardware level to begin with. They did not.

If you could possibly stack on 20x0 cards, wouldn't that hurt sales in the Quadro market (specifically the 5000 and 6000)? Double up the core capabilities, and you have surpassed the 5000/6000. Where would the incentive be to spend $6k on a Quadro RTX 6000 when you only loose 2Gb at $2500 (2080ti w/bridge) and have significantly better multicore performance?

Omen

This makes perfectly good sense and I doubt you'll see any consumer application of memory stacking from Nvidia for the very reason you stated. So that probably means no stacking for iRay. However, it seems evident that it can be jury-rigged to work and I am anxious to see what the folks at Vray comes up with.

ebergerly · September 2018

I JUST WROTE A VRAM STACKING APPLICATION !!!

Yes, you heard it right. I wrote a very simple code (around 20 lines or so) that technically stacks VRAM. Well, depending on what you mean by VRAM stacking...

At the most basic level, you can define VRAM stacking as using two or more GPU's to accomplish multiple parts of a single task. So say I have a huge, monster array that takes 22GB of VRAM. If I can break that up into two 11GB matrices that are totally independent, I can send one of the sub-matrices to one GPU and the other to the second GPU, and have them work on those matrices totally independently. In other words they don't need to communicate between the GPU's, but each GPU receives its portion of the work from the CPU/RAM, and when it's done sends the results back to the CPU/RAM.

Which means you don't really need a fancy NVLink to technically stack VRAM, you just need a task that can be broken down into two independent parts. In fact you don't even need SLI. And you could do this type of VRAM stacking many years ago in CUDA. And in fact it's very simple. Just select a GPU, send it your sub-array (or whatever), then select the other GPU and send it the other sub-array, and you've just stacked VRAM. Not a whole lot different from doing "alternate frame rendering" with SLI for video games.

But just don't expect the tasks to collaborate very much, because that takes forever over the PCI bus.

So as with anything dealing with computers, the devil is in the details. The difficulty comes when you need to have two or more GPU's work on a single task that requires internal communication within that task. Like rendering a big 'ol scene spread across two GPU's.

drzap · September 2018

ebergerly said:

I JUST WROTE A VRAM STACKING APPLICATION !!!

Yes, you heard it right. I wrote a very simple code (around 20 lines or so) that technically stacks VRAM. Well, depending on what you mean by VRAM stacking...

At the most basic level, you can define VRAM stacking as using two or more GPU's to accomplish multiple parts of a single task. So say I have a huge, monster array that takes 22GB of VRAM. If I can break that up into two 11GB matrices that are totally independent, I can send one of the sub-matrices to one GPU and the other to the second GPU, and have them work on those matrices totally independently. In other words they don't need to communicate between the GPU's, but each GPU receives its portion of the work from the CPU/RAM, and when it's done sends the results back to the CPU/RAM.

Which means you don't really need a fancy NVLink to technically stack VRAM, you just need a task that can be broken down into two independent parts. In fact you don't even need SLI. And you could do this type of VRAM stacking many years ago in CUDA. And in fact it's very simple. Just select a GPU, send it your sub-array (or whatever), then select the other GPU and send it the other sub-array, and you've just stacked VRAM. Not a whole lot different from doing "alternate frame rendering" with SLI for video games.

But just don't expect the tasks to collaborate very much, because that takes forever over the PCI bus.

So as with anything dealing with computers, the devil is in the details. The difficulty comes when you need to have two or more GPU's work on a single task that requires internal communication within that task.

Congratulations. You just invented renderfarm software. I use Deadline. It's a lifesaver.

ebergerly · September 2018

drzap said:

Congratulations. You just invented renderfarm software. I use Deadline. It's a lifesaver.

While I certainly appreciate the somewhat dismissive sarcasm , I hope this can provide a little insight for those who are trying to figure out what's really involved with these technologies like VRAM stacking. This stuff really is incredibly complex, and if all we see are some 10 minute videos and vendor headlines it's tough to really understand what the technology is all about. And in many cases, unless you get down to the detailed code level, all this new hardware is nothing more than some guy standing on stage touting some tech-speak marketing terms that don't really mean much.

7thOmen · September 2018

ebergerly said:

While I certainly appreciate the somewhat dismissive sarcasm , I hope this can provide a little insight for those who are trying to figure out what's really involved with these technologies like VRAM stacking. This stuff really is incredibly complex, and if all we see are some 10 minute videos and vendor headlines it's tough to really understand what the technology is all about. And in many cases, unless you get down to the detailed code level, all this new hardware is nothing more than some guy standing on stage touting some tech-speak marketing terms that don't really mean much.

That seems to be echoing all around the interwebs these days...

Although I haven't seen the actual code, it appears what you are describing in your experiment is more like parallel processing rather than memory stacking. Truth be told, I applaud and admire your endeavors. It could be quite interesting if the answer to the milion-dollar memory pooling question was discovered in this little niche of the web, especially since nVidia doesn't seem to care to weigh in on the matter.

Keep up the good work!

Omen

*edit for structure clarity

ebergerly · September 2018

7thOmen said:

Although I haven't seen the actual code, it appears what you are describing in your experiment is more like parallel processing rather than memory stacking. Truth be told, I applaud and admire your endeavors.

What I did was certainly not an accomplisment by any stretch, nor was that the point of my post. Like I say, you just assign a matrix to one GPU, and another matrix to a second GPU. A few lines of code and you're done. I'd be happy to post my code if there are any C# programmers out there who are interested, although I'm sure you can also check the CUDA toolkit online and find some C/C++ samples that do something similar.

And my point was that it's very simple to do something that's been done for many years, and that is to get multiple GPU's to break a single big task into separate tasks. But it highlights what I think is an important point. While it can technically be considered memory stacking (since it effectively utilizes the sum of the two GPUs' VRAM to accomplish what can be viewed as a single task), folks need to understand that true VRAM stacking for our purposes involves high speed, bi-directional communication between GPU's since there is a lot of interaction between GPU's needed. I've seen all over the web where that very important point gets lost in the discussion, and leads some to believe that you can do it with just a uni-directional SLI link.

Now that's not to say that some GPU experts can't figure out how to do true VRAM stacking using only the RTX version (ie, high speed, one-way SLI connector), but it seems clear that if you consider the low level details like I mentioned, it's hard to understand how they can manage the necessary collaboration that's required for rendering a single scene across GPU's. Maybe they can, but you'd think that at least you'd take a significant speed hit due to the "neutered" connector.

Takeo.Kensei · September 2018

ebergerly said:

I JUST WROTE A VRAM STACKING APPLICATION !!!

Yes, you heard it right. I wrote a very simple code (around 20 lines or so) that technically stacks VRAM. Well, depending on what you mean by VRAM stacking...

At the most basic level, you can define VRAM stacking as using two or more GPU's to accomplish multiple parts of a single task. So say I have a huge, monster array that takes 22GB of VRAM. If I can break that up into two 11GB matrices that are totally independent, I can send one of the sub-matrices to one GPU and the other to the second GPU, and have them work on those matrices totally independently. In other words they don't need to communicate between the GPU's, but each GPU receives its portion of the work from the CPU/RAM, and when it's done sends the results back to the CPU/RAM.

Which means you don't really need a fancy NVLink to technically stack VRAM, you just need a task that can be broken down into two independent parts. In fact you don't even need SLI. And you could do this type of VRAM stacking many years ago in CUDA. And in fact it's very simple. Just select a GPU, send it your sub-array (or whatever), then select the other GPU and send it the other sub-array, and you've just stacked VRAM. Not a whole lot different from doing "alternate frame rendering" with SLI for video games.

But just don't expect the tasks to collaborate very much, because that takes forever over the PCI bus.

So as with anything dealing with computers, the devil is in the details. The difficulty comes when you need to have two or more GPU's work on a single task that requires internal communication within that task. Like rendering a big 'ol scene spread across two GPU's.

Congrats but sorry I don't think you did it. You just divided a task requiring lots of mem into two independent task doing two parallel tasks that have nothing to do each with the other. If both don't communicate that's useless. Remember you said it yourself. What if you have a photon that needs to bounce to a mesh that is in the other card's memory?

If you don't have true memory stacking, at least you need to make the GPU communicate. If I was you I'd have a look at MPI https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ if you still want to play

And over PCIe that must be slow. It's a pity you don't have two cards in NVLink to test, but if you send your little app to test P2P com to jonascs he can test if P2P is enabled by default on RTX cards (I doubt it but who knows)

By searching infos about Octane and Vray succeeding in VRAM pooling I found something interresting with lots of infos https://www.reddit.com/r/nvidia/comments/9jfjen/10_gigarays_translate_to_32_gigarays_in_real/

I'll quote few extracts about Otoy's Octane :

The important point, it is linked under Vulkan API. This is when OTOY will leverage the cores fully in release 2019.1 after they finished porting the whole render engine over to Vulkan as OptiX does not give the same granularity of control like Vulkan and DXR does.

This is the reason we see "only" an 80% improvement for OctaneBench 5 in the Forbes article: Preliminary use of OptiX for the release 2018.1 and the lack of granular control of Vulkan or DXR.

Seems like Otoy is going full Vulkan, and the question would be if they achieved the VRAM pooling with Vulkan instead of Optix. Optix/Cuda seems to be a limiting factor

@ Outrider : Nvidia never needed to limit anything at hardware level. They've done it on a software/driver level. As long as you need to use Cuda they can lock you up. Now with DXR and Vulkan that may be an other story

And about AMD, their card have very little interrest outside of gaming. Their software ecosystem is miles away from Nvidia in HPC. We can hope for a price drop but AMD must be very good this time as there is an opening

BTW I may have found an explanation to as why Win 10 reserves VRAM on GFX cards that don't have a monitor plugged in. I'll post that in the other thread.

kyoto kid · September 2018

...but isn't Vulkan supported by AMD not Nvidia?

ebergerly · September 2018

Takeo.Kensei, I think you missed the point of my post on the VRAM pooling, though I thought I made it very clear. And since I don't have any 20-series cards with NVLink, it's not even possible to develop and test actual VRAM stacking code with my 1080ti + 1070, so I'm not sure what point you were making.

Anyway....on the other hand, thanks for the info on the reddit post. If the source is reliable, then it may actually have some interesting and pertinent info on VRAM stacking (I've highlighted what I think are the important parts):

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

"OTOY confirmed in a forum the RTX 20-series also can pool its memory. This means 2 x 11 GB = 22 GB VRAM with NVLink, on consumer cards.

The VRAM limitation is the major advantage of CPU rendering which can access much larger amounts of RAM. For ray tracing you need the entire scene accessible in memory. Unlike rasterizing where you can load one batch of polygons after the other, when ray tracing the secondary rays can hit any part of the scene and if this geometry is not in the VRAM it needs to be loaded over the PCIe bus with low bandwidth and high latency, killing the speed advantage GPU rendering has over CPU rendering. On the professional end with Quadro cards, being able to pool now 2 x 48 GB = 96 GB VRAM is enough to hold geometry and textures of really large cinematic productions.

There is a speed hit to be expected. A single x8 NVLink channel provides 25 GB/s peak bandwidth in and 25 GB/s out. There are two x8 links on the TU102 GPU and a single x8 link on the Turing TU104 GPU.

For the 2080 Ti this equals 1/6th of GDDR6 Memory Bandwidth of 616 GB/s or half the NVLink Bandwidth of the GV100 or one third of the V100.

OTOY is testing how big the impact on speed will be. VRAM memory pooling will be at least better performing compared to loading "out-of-core" geometry and textures from low bandwidth connected RAM, and it will perform better on a 2080 Ti than on a 2080 due to two NVLink's on the TU102."

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So apparently they found a way to stack VRAM on the 20-series cards, but as I expected there's a speed hit associated with it. So yeah, you can use the sum of the VRAM's of the two GPU's, but your render time will suffer. It will be interesting to see the test results from Otoy, but my gut tells me it will be a substantial hit. So then users will have to decide whether the cost of two 20-series cards is worth the benefits of stacked VRAM but with increased render times compared to the entire scene residing on each GPU.

fredmusic · September 2018

kyoto kid said:

...but isn't Vulkan supported by AMD not Nvidia?

Vulkan is what replaces OpenGL, and is still the same open source Khronos Group "entity" The ray tracing support is available in Vulkan, DirectX, and Nvidia API.

Based on what I've read Nvidia is focusing on getting as many developers as possible to use RTX cores. In fact they stopped development of Mental Ray to get more people on AI and GPU acceleration. So instead of offering Mental Ray, they work with renders like Octane, V-Ray, Arnold, etc... to get RTX support.

https://twitter.com/mentalray?lang=en

Pixar is looking to shift into more GPU because of what RTX brings.

https://renderman.pixar.com/news/renderman-xpu-development-update

kyoto kid · September 2018

...so basically, that means if I want to perform GPU based rendering under Vulkan, I need an AMD GPU card as CUDA is locked out.

So much for Iray.

outrider42 · September 2018

Who is buying Quadro to render with Iray to begin with? You need to stop and think about who the customer base is here before saying this would hurt Quadro. If you are buying $6000 Quadros for Iray, then you are probably not going to step down to 2080ti anyway. Instead of two 2080tis for 22 gb, you would just buy two Quadros to get 48 gb. Plus the NVLink works right out of the box, without needing your favorite software to figure it out. Not every software is going to support this.

Somebody has to buy these $1200 cards. Gamers are not that excited about them solely because of the price. It makes sense for Nvidia to push these for more prosumer markets to sell these things. Quadro is not under attack. And just think, Nvidia just totally blew away the $3000 Titan V, which is not even 10 months old yet. Plus you still cannot stack memory in games.

And yeah, nvlink is 1/6 the speed of the VRAM in Turing, but that's comparing it to Turing. I seem to recall the 1080 and 1070 doing just fine with VRAM that was HALF that speed. The 1080ti can beat the 2080 in some games and benchmarks. So its not exactly getting hurt by its "slow" VRAM speed. I already linked a good video on how VRAM speed can effect performance. But keep i mind that in that video, the VRAM of the terrible 1030 card was slower than the nvlink. In fact, the nvlink is almost twice as fast as the DDR4 memory the bad 1030 used. That would make quite a bit of difference, don't you agree? Nvlink is almost as fast as the memory that the GTX 1050 uses. That may not be the fastest VRAM out there, but it performs its function well enough.

And that was for gaming, which is FAR more reliant on VRAM speed than any rendering engine like Iray is. So perhaps stacking VRAM might cause the render to run a bit slower than the standard method...but I do not believe it will be all that drastic. And it will certainly be faster than trying to render a scene that would exceed the memory of a single card.

Nvlink has SLI mode for compatibility with existing SLI games. No game at this time actually uses the nvlink to its full advantage, just like no game uses ray tracing at this time. And speaking of, I just had a thought. Using SLI slows down Iray, and I think the amount of slowdown you get trying to run Iray with SLI will be somewhat similar to whatever slowdown you may see with a stacked VRAM nvlink mode.

The whole thing is a bit of a mess. Stop to think about this. So Nvidia reinvented how SLI works, but at a time when SLI use has dropped off greatly. Moreover, because Nvidia has made nvlink exclusive to its top tier cards, this will greatly limit the adoption rate of nvlink. How many gamers are going to spend $2500 for only a few games that do support SLI/nvlink??? Many games right now do not support SLI at all, and are actually negatively impacted when you try SLI. I'm serious here, most games do not use SLI, only a handful do. So who is really buying this stuff?

So again, that leads back to the prosumer market. The market that "would love to buy a Quadro but can only dream about it" market. Because once again, how many of you reading this own Quadro? And I'm not talking about old ebayed Quadros from years ago, I'm talking about recent ones in the $6000 and up range. I know a couple of you do, but rest asured that those are an extreme minority. Quadro is a completely different market.

Vulkan is not exclusive to AMD or anybody. I can play Doom (2016) on Vulkan. Its just another API. Its more like the "Linux of APIs".

Takeo.Kensei · September 2018

ebergerly said:

Takeo.Kensei, I think you missed the point of my post on the VRAM pooling, though I thought I made it very clear. And since I don't have any 20-series cards with NVLink, it's not even possible to develop and test actual VRAM stacking code with my 1080ti + 1070, so I'm not sure what point you were making.

Don't think I miss anything.And to sum up what I'm saying :

- It takes more than dividing a task in two independant ones to make RAM stacking. What you've done isn't VRAM stacking.

- If you still want to dable with code, even if you can't test, somebody else with appropriate hardware could test it for you. A good test would be to know if P2P transfer is enabled by default if you have NVLink

- Seeing Otoy going full Vulkan raises some question about Optix. I'm further inclined to think there won't be RAM pooling with Iray

kyoto kid said:

...so basically, that means if I want to perform GPU based rendering under Vulkan, I need an AMD GPU card as CUDA is locked out.

So much for Iray.

No you just need an other software that is not Cuda/Optix based. Nvidia Hardware can use DX12 coupled with DXR or Vulkan or Cuda or OpenCL for rendering

I'm happy to see Octane going Vulkan because that means there is still hope for open standards that are not hardware locked. OpenCL was pretty much dead because of Nvidia staying at v1.2

Vulkan can use any CPU and any GPU

However I can't see AMD really concurrencing Nvidia in the Vram Pooling area. There are rumors of something similar to NVlink but that would be reserved to Pro applications https://www.notebookcheck.net/AMD-Radeon-Vega-20-GPUs-will-get-XGMI-PCIe-4-0-to-compete-with-Nvidia-s-NVLink.329034.0.html

And adding such thing to consumer cards would make them more expensive and I don't think that would be a good idea. But I'd really like a good surprise from their side

davemc0 · September 2018

Hi.

I ran the OptiX team at Nvidia for five years. I've since moved to Samsung to design a new GPU, but still keep tabs on Nvidia's ray tracing work. Let me mention a few points that might help this conversation.

Nvidia's "RTX" ray tracing hardware in the Turing architecture is the real deal. It is real hardware acceleration of ray - acceleration structure intersection and ray - triangle intersection. The performance numbers quoted by Jensen of 10 billion rays per second are obviously marketing numbers, but in OptiX (which was written on top of CUDA at the time) 300 million rays per second was a good marketing number. So the speedup is very real.

RTX is only being exposed through ray tracing APIs. It is not accessible through Compute APIs like CUDA, OpenCL, or Vulkan Compute. It is/will be exposed only through Microsoft DXR, Nvidia OptiX, and a Vulkan ray tracing extension. Nvidia can make their own Vulkan extension or the Vulkan Ray Tracing Technical Subgroup (of which I am a member) may define a cross-platform ray tracing extension.

Thus, all existing renderers that want hardware acceleration will have to use one of these three APIs. Most renderers are / will be using OptiX - Pixar, Arnold, Vray, Clarisse, Red Shift, and others.

Another thing to note is that OptiX Prime will not be accelerated on RTX; only OptiX will. This is significant because Iray uses OptiX Prime, not OptiX. Thus, Iray does not get RTX ray tracing acceleration on my RTX 2080, and something big will have to change before it does get accelerated. I don't know whether Nvidia will port Iray to OptiX, which would be a big effort, or whether that will be done by MI Genius or Lightwork Design, or it will just not happen. Another possibility is for some party to implement OptiX Prime on top of OptiX to get access to RTX hardware acceleration.

If I were DAZ I would be privately freaking out about my renderer being left in the dust because it's not hardware accelerated, even though the same company made the renderer and the hardware acceleration.

-Dave

bluejaunte · September 2018

davemc0 said:

Hi.

I ran the OptiX team at Nvidia for five years. I've since moved to Samsung to design a new GPU, but still keep tabs on Nvidia's ray tracing work. Let me mention a few points that might help this conversation.

Nvidia's "RTX" ray tracing hardware in the Turing architecture is the real deal. It is real hardware acceleration of ray - acceleration structure intersection and ray - triangle intersection. The performance numbers quoted by Jensen of 10 billion rays per second are obviously marketing numbers, but in OptiX (which was written on top of CUDA at the time) 300 million rays per second was a good marketing number. So the speedup is very real.

RTX is only being exposed through ray tracing APIs. It is not accessible through Compute APIs like CUDA, OpenCL, or Vulkan Compute. It is/will be exposed only through Microsoft DXR, Nvidia OptiX, and a Vulkan ray tracing extension. Nvidia can make their own Vulkan extension or the Vulkan Ray Tracing Technical Subgroup (of which I am a member) may define a cross-platform ray tracing extension.

Thus, all existing renderers that want hardware acceleration will have to use one of these three APIs. Most renderers are / will be using OptiX - Pixar, Arnold, Vray, Clarisse, Red Shift, and others.

Another thing to note is that OptiX Prime will not be accelerated on RTX; only OptiX will. This is significant because Iray uses OptiX Prime, not OptiX. Thus, Iray does not get RTX ray tracing acceleration on my RTX 2080, and something big will have to change before it does get accelerated. I don't know whether Nvidia will port Iray to OptiX, which would be a big effort, or whether that will be done by MI Genius or Lightwork Design, or it will just not happen. Another possibility is for some party to implement OptiX Prime on top of OptiX to get access to RTX hardware acceleration.

If I were DAZ I would be privately freaking out about my renderer being left in the dust because it's not hardware accelerated, even though the same company made the renderer and the hardware acceleration.

-Dave

Thanks so much for commenting. That is somewhat sobering. What is the difference between OptiX Prime and OptiX?

ebergerly · September 2018

Wow, thanks very very much for the input davemc0. I never realized there was a difference between OptiX and Optix Prime. I just assumed that Iray and Optix were kind of synonymous, and Optix was just integrated into Iray along with CUDA on a lower level since it's all NVIDIA.

It's mind blowing to think that RTX is not accessible via CUDA, and the existing RTX 2-series owners will have to be content with the present (although significant) speedup which seems to match dual 1080ti's in Iray? I guess that's why NVIDIA was never forthcoming with Iray benchmarks.

My brain just exploded.

Thanks again for the input.

nicstt · September 2018

Indeed.

I am please to hear my caution is warranted; sure there is a very decent seeming increase in speed. But to purchase base on both the current increase and an expected future increase, when the actuality might be different, would be dissapointing. The are over-priced imo.

ebergerly · September 2018

nicstt said:

I am please to hear my caution is warranted;

Yeah, sometimes being negative is a positive

Anyway, I guess I just assumed that in the world of NVIDIA GPU's, basically EVERYTHING was written using CUDA. Guess I have some homework to do...figure out exactly what Optix is.

Wow.

ZarconDeeGrissom · September 2018

lol, I jump to the last page, and see the 'GN' GT1030-disgrace edition vid, and I know this is a cool place.

[All statements are the opinion of ZDG, who does not work for or represent Daz Studio or Gamers Nexus]

My thoughts on Iray via CUDA, vs Iray on RT and Tensor cores, lol.

I feel that Iray via CUDA is going to happen, Iray with RT and Tensor performance bosts I'm not too sure of. RT cores can be very handy, the code just needs to be written to do it. Tensor from what I've read elsewhere tends to be more fuzzy math based (not great for calculating exact numbers). As I had written elsewhere, it only took over three months for GTX10 cards to get Iray support going when they launched. adversely, It 'ONLY' took the Adobe Premiere dev team over ten years to add Intel iGPU acceleration to video encoding (for an iGPU that Intell CPUs had since the Core Duo days, lol). Judging by that, RT and Tensor acceleration for Iray may take a few months to work out, or it may take a corporate level market share threat to Nvidia to convince them it would be good to do. Don't get me wrong, I would love Iray to be better, I just have doubts AMD will ever have that kind of market share threat to Nvidia to make them want to do it.

The Turing CUDA/SM units do have a nifty trick up their sleeves that alone may explain some performance boosts. Older CUDA could only do an FP or Int op per cycle, the new Turing CUDA cores can do FP and int at the same time like Hyperthreading sort of. I do not know yet if that simultaneous CUDA execution will need new code to take advantage of, or if it will work fine with old code. My opinion on price aside, It may be a good GPU, if the code is worked out for it.

I still feel the loss of memory at the same price bracket, is a major hit to RTX card value. It's easy to fill 4GB with Iray, and 8GB is not all that great either. I feel in that way, Nvidia has been a major letdown for what Iray offers. It's great if a 32GB Tesla V100 is a business write off, not so much if you have to pay for it out of pocket. And more affordable cards often lack memory, like anything below the GTX1050ti for that matter (including the 3GB GT1060, lol). If you're on a budget and need Iray capable cards with a min of 4GB, The pickings are slim. There is the 4GB GK208 GT730 (384 CUDA card, not the GK108 with 96 CUDA cores) that is kind of slow, then there is the 4GB GTX1050ti, or used 4GB GTX960's, the rest gets pricey fast.

kyoto kid · September 2018

Takeo.Kensei said:

ebergerly said:

Takeo.Kensei, I think you missed the point of my post on the VRAM pooling, though I thought I made it very clear. And since I don't have any 20-series cards with NVLink, it's not even possible to develop and test actual VRAM stacking code with my 1080ti + 1070, so I'm not sure what point you were making.

Don't think I miss anything.And to sum up what I'm saying :

- It takes more than dividing a task in two independant ones to make RAM stacking. What you've done isn't VRAM stacking.

- If you still want to dable with code, even if you can't test, somebody else with appropriate hardware could test it for you. A good test would be to know if P2P transfer is enabled by default if you have NVLink

- Seeing Otoy going full Vulkan raises some question about Optix. I'm further inclined to think there won't be RAM pooling with Iray

kyoto kid said:

...so basically, that means if I want to perform GPU based rendering under Vulkan, I need an AMD GPU card as CUDA is locked out.

So much for Iray.

No you just need an other software that is not Cuda/Optix based. Nvidia Hardware can use DX12 coupled with DXR or Vulkan or Cuda or OpenCL for rendering

I'm happy to see Octane going Vulkan because that means there is still hope for open standards that are not hardware locked. OpenCL was pretty much dead because of Nvidia staying at v1.2

Vulkan can use any CPU and any GPU

However I can't see AMD really concurrencing Nvidia in the Vram Pooling area. There are rumors of something similar to NVlink but that would be reserved to Pro applications https://www.notebookcheck.net/AMD-Radeon-Vega-20-GPUs-will-get-XGMI-PCIe-4-0-to-compete-with-Nvidia-s-NVLink.329034.0.html

And adding such thing to consumer cards would make them more expensive and I don't think that would be a good idea. But I'd really like a good surprise from their side

..so Vulkan it isn't specifically software based then? Interesting. That would make AMD Vega cards useable in Octane as well.

For 500$ less than a P5000 (800$ less than an RTX5000) and 300$ more than a 2080 TI there's the WX9100 offering 16 GB HBM 2 and 4096 Stream processors. That could be an optimal choice for someone using Octane.

kyoto kid · September 2018

davemc0 said:

Hi.

I ran the OptiX team at Nvidia for five years. I've since moved to Samsung to design a new GPU, but still keep tabs on Nvidia's ray tracing work. Let me mention a few points that might help this conversation.

Nvidia's "RTX" ray tracing hardware in the Turing architecture is the real deal. It is real hardware acceleration of ray - acceleration structure intersection and ray - triangle intersection. The performance numbers quoted by Jensen of 10 billion rays per second are obviously marketing numbers, but in OptiX (which was written on top of CUDA at the time) 300 million rays per second was a good marketing number. So the speedup is very real.

RTX is only being exposed through ray tracing APIs. It is not accessible through Compute APIs like CUDA, OpenCL, or Vulkan Compute. It is/will be exposed only through Microsoft DXR, Nvidia OptiX, and a Vulkan ray tracing extension. Nvidia can make their own Vulkan extension or the Vulkan Ray Tracing Technical Subgroup (of which I am a member) may define a cross-platform ray tracing extension.

Thus, all existing renderers that want hardware acceleration will have to use one of these three APIs. Most renderers are / will be using OptiX - Pixar, Arnold, Vray, Clarisse, Red Shift, and others.

Another thing to note is that OptiX Prime will not be accelerated on RTX; only OptiX will. This is significant because Iray uses OptiX Prime, not OptiX. Thus, Iray does not get RTX ray tracing acceleration on my RTX 2080, and something big will have to change before it does get accelerated. I don't know whether Nvidia will port Iray to OptiX, which would be a big effort, or whether that will be done by MI Genius or Lightwork Design, or it will just not happen. Another possibility is for some party to implement OptiX Prime on top of OptiX to get access to RTX hardware acceleration.

If I were DAZ I would be privately freaking out about my renderer being left in the dust because it's not hardware accelerated, even though the same company made the renderer and the hardware acceleration.

-Dave

...interesting insight.

That pretty much made my decision.

Kevin Sanderson · September 2018

iClone just got Iray this week for Character Creator 3 which now more easily imports DAZ characters. You have to buy the Iray plugin, though. Bet they are not happy, unless they know something we don't. Interestingly, DAZ is running Animation geared ads on YouTube as of a couple weeks ago. One (maybe both) looked like an Iray rendered animation which is hard to do. Does DAZ know something we don't?

drzap · September 2018

Nvidia's business strategy has been shifting away from supporting their own renderers so they can invest more in the compute and AI space. This has been shown by their dropping support for Mental Ray and handing a lot of the support for iRay to third parties. And let's face it, iRay is not among the top or popular renderers in the market, not even close. Mental Ray was far more successful. There are iRay plugins for only about 3 or 4 content creation applications, then there's Daz Studio, iClone, and Substance Painter. That's about it. Not a great loss from Nvidia's standpoint. But there is always Octane for DS users. They decided to drop Optix and use the Vulkan API for RTX access, putting them in position to offer services for both Nvidia and AMD gpu's. Octane has more than 25 plugins with more planned to come, A big update in the near future, and I doubt users would have reason to worry about them dropping support any time soon. I'm not a big fan of Octane, but if I were a Daz Studio user, I would be knocking down their door right now.

Notifications

OT: New Nvidia Cards to come in RTX and GTX versions?! RTX Titan first whispers.

Comments

Adding to Cart…