General GPU/testing discussion from benchmark thread

11213141618

Comments

  • DripDrip Posts: 1,237

    The only thing we're really missing from this data is the convergence ratio people reached. Though I expect the numbers to be fairly similar anyway, they shouldn't be identical.

    Another thing I'd find interesting, is data about when certain convergence ratios are reached, especially the first 50 percent (so, I'd like to know for example at which respective iteration 10%, 20%, 30%, 40% and 50% convergence was reached). Ofcourse it's just guessing, but I think some more differences between Optix and non-Optix might be found there.

  • timon630timon630 Posts: 37

    Может кто-нибудь протестировать эту сцену в DS 4.11 и DS 4.12, пожалуйста?

    Я хочу посмотреть, как это происходит на других картах, особенно с картами без RTX.

    Это в основном сценарий SY, модифицирующий активную прядей волос, который может быть проблемой для оперативной памяти для рендеринга (не знаю, может ли это быть проблемой)

    Спасибо

    https://www.sendspace.com/file/b78062

     

    GTX 1070
    4.11 Total Rendering Time: 4 minutes 21.22 seconds (2400 iterations, 6.852s init, 252.496s render)
    4.11 Total Rendering Time: 4 minutes 28.70 seconds (2385 iterations, 7.715s init, 259.114s render)


    4.12 Total Rendering Time: 4 minutes 20.3 seconds (2434 iterations, 6.446s init, 251.043s render)
    4.12 Total Rendering Time: 4 minutes 15.88 seconds (2363 iterations, 6.927s init, 246.595s render)

  • RayDAntRayDAnt Posts: 1,154
    edited July 2019

    Even in the hypothethical case where Iray and Driver version were frozen there's still variance,

    Of course there's always going to be a certain base level of underlying variance in the rendering process wehn you're dealing with an unbiased rendering engine (which is what Iray is.) But is that variance going to be enough to throw things off statistically using a given testing methodology? Let's find out.

    Here are the 4.12.0.033 Beta render times (in seconds) for 5 completed runs of the benchmarking scene I created rendered on my Titan RTX system, all completed under completely identical testing conditions (1800 iterations, 72f ambient room temp, custom watercooled components, identical daz/iray/driver versions, etc):

    Run #1: 231.283s render
    Run #2: 231.370s render
    Run #3: 231.397s render
    Run #4: 231.389s render
    Run #5: 231.368s render

    Which gives us the following descriptive statistics for rendering time.
    Min: 231.281s (Run #1)
    Mean: 231.361s (all Runs averaged together)
    Max: 231.397s (Run #3)

    And the following descriptive statistics for variance between rendering times.
    Min: 0.002s (Run #2 - Run #5)
    Max: 0.116s (Run #3 - Run #1)

    The two most popular thresholds for statisitcal signficance are 5% and 1% of the total. What percentage of 231.281s (the average render time) is 0.116s (the largest variance here observed)? Let's do the math:
    (0.116 / 231.281) * 100 = 0.000501554 or 0.05%

    The variance you are talking about here is literally insignificant. At least, it is if you are adhering to the testing/reporting methodology I set up.

     

    The convergence ratio is actually more reliable.

    Limit by convergence threshold is technically the least accurate method Iray has for automatically determining when a scene is "finished" rendering. This is because it only periodically updates the Rendering Quality stat which drives the underlying mechanism (because doing so takes additional processing time that Iray's developers wanted to go toward rendering itself as much as possible - hence the "Rendering Quality Enable" toggle under Progressive Render options. Turning that to Off actually improves your overall rendering performance across the board.) Meaning that the algorithm only has an extremely coarse-grained (in comparison to iteration limited configs) set of opportunities to end the render once the desired convergence threshold is reached. Hence why completion by convergence always results in an overshoot of whatever you set it to. 

    And the same is true (although significantly less so) with limits by time. Iray always overshoots whatever time limit you set it to because it renders natively in terms of iterations - not time spans. Set it to render to 10 seconds and it will always overshoot because what it is actually doing is rendering iterations up to the time you set PLUS however more time it takes to get all the active iterations back at that approximate time regardless of how much more time than the amount you set it it takes. Iray Iteration limit is the only consistent method Iray has for ending a render.

    you can try to change the max time in iray settings to reduce the overcalculation

    You can try. And as you will discover (like I did when I first started investigating this in earnest back in January/February) you will fail.

     

    if you want a zero variance the easy solution is to choose 100% convergence.

    Iray is an unbiased renderer. Meaning that it will take it an INFINITE amount of time to render ANY scene to a true 100% level of convergence. Making this an impractial sollution.

    Post edited by RayDAnt on
  • CinusCinus Posts: 118
    edited July 2019

    @RayDAnt

    Using the number of iterations to terminate the test is certainly more accurate, but the fraction of a % that it will run over when using convergence is negligible in this kind of test. In a lab environment it might make sense (as long as every machine is using the same version of Iray).

    From the link in my earlier post : 

    As we changed some internal algorithms over the last year (regarding our sampling schemes), the 2016.X releases will behave a bit differently than 2015.X. So if you experience “slower” iteration times, don’t be scared, as in the end you will get a cleaner image faster

    Let's say the Iray developers change the next version of Iray in such a way that it does more work per iteration, so now Iray can reach 95% convergence in 1600 iterations (for the given test scene). With said new version of Iray, your test will keep on churning till it hits 1800 iterations and run much longer so it will appear that things had gotten worse (in terms of iterations / sec), but in reality your test just kept on running for 200 iterations longer than it needed to. What this boils down to is that you cannot compare different versions of Iray against each other when using # of iterations to terminate the test.

    Again, what matters to most people is not how many Iray iterations per second a GPU can do, what matters is how fast it can generate a "clean" image.

    Post edited by Cinus on
  • RayDAntRayDAnt Posts: 1,154
    edited July 2019
    Cinus said:

    @RayDAnt

    Using the number of iterations to terminate the test is certainly more accurate, but the fraction of a % that it will run over when using convergence is negligible in this kind of test. In a lab environment it might make sense (as long as every machine is using the same version of Iray).

    From the link in my earlier post : 

    As we changed some internal algorithms over the last year (regarding our sampling schemes), the 2016.X releases will behave a bit differently than 2015.X. So if you experience “slower” iteration times, don’t be scared, as in the end you will get a cleaner image faster

    Let's say the Iray developers change the next version of Iray in such a way that it does more work per iteration, so now Iray can reach 95% convergence in 1600 iterations (for the given test scene). With said new version of Iray, your test will keep on churning till it hits 1800 iterations and run much longer so it will appear that things had gotten worse (in terms of iterations / sec), but in reality your test just kept on running for 200 iterations longer than it needed to. What this boils down to is that you cannot compare different versions of Iray against each other when using # of iterations to terminate the test.

    DIfferent versions of Iray can't be directly compared to each other at all (neither in terms of iterations completed, time spent rendering, nor even in terms of percentage converged since the underlying algorithm used to estimate this last one also changes from release to release) when it comes to rendering completion patterns. Which is why my benchmark and the analysis thread associated with it specifically DOESN'T do that. It only aims to compare rendering performance across different rendering hardware using the same given version of Iray. Hence why all reported results are split into separate graphs based on Iray version tested and the resultant data points are not related in any direct way to each other. This was done on purpose to avoid making the exact faux pas you are describing. You are barking up the wrong tree.

    The benchmarking scene I created and the reporting mechanism/analysis thread for it are by no means a flawless, catch all solution for everything benchmark-related people could want to know about Iray rendering in Daz Studio. It is a hardware benchmark specifically. Not a software benchmark. For that, you will have to look some place else.

    Post edited by RayDAnt on
  • CinusCinus Posts: 118
    edited July 2019

    @RayDAnt said

    "The benchmarking scene I created and the reporting mechanism/analysis thread for it are by no means a flawless, catch all solution for everything benchmark-related people could want to know about Iray rendering in Daz Studio. It is a hardware benchmark specifically. Not a software benchmark. For that, you will have to look some place else"

    I'm afraid I don't quite understand what you mean by "It is a hardware benchmark specifically. Not a software benchmark". What are people benchmarking when they post performance data for V4.11 vs. v4.12 for the same hardware?

    I think it would be very useful to know whether a new version of Iray / Daz3D is performing better or worse than a previous version, but using number of iterations for the benchmarks will not work for that.

    It's your benchmark thread, so you can do it however you feel best smiley. I've said my piece on this. Thanks.

    Post edited by Cinus on
  • RayDAntRayDAnt Posts: 1,154
    Cinus said:

    What are people benchmarking when they post performance data for V4.11 vs. v4.12 for the same hardware?

    Reference data for people running either 4.11 or the current 4.12 Beta. Not everyone - understandably - is a fan of relying on self-proclaimed beta software in their day-to-day workloads. I've found Daz beta releases to be extremely reliable myself, but I can totally understand people not wanting to go that route. So I'm supplying data (wherever possible) for the same hardware on both software distribution channels (public beta and final release) to cover as many bases as possible..

     

    Cinus said:

    I think it would be very useful to know whether a new version of Iray / Daz3D is performing better or worse than a previous version,

    So do I. if you get any ideas for how it could be done, please speak up! (been actively searching for ideas to that end myself for quite some time - so far, no dice.)

  • ebergerlyebergerly Posts: 3,255
    edited July 2019

    And to be clear, it is a somewhat limited hardware benchmark, in that it doesn't consider stuff like:

    1. Concurrently running GPU processes which may affect hardware performance, but the user may be totally unaware of 
    2. Thermal throttling (transient or otherwise) which may be occurring but the user may be totally unaware of
    3. Overclocking (factory, transient or otherwise) which may be occurring and the user may be totally aware of
    4. etc.

    If nothing else, these issues certainly bring into question the posting of performance times with 0.001 sec (or even 1 sec) accuracy, unless the user has absolutely ruled these out in their system. Which can be a sizeable task.   

    Post edited by ebergerly on
  • ebergerlyebergerly Posts: 3,255
    edited July 2019
    Cinus said:

     

    I'm afraid I don't quite understand what you mean by "It is a hardware benchmark specifically. Not a software benchmark". What are people benchmarking when they post performance data for V4.11 vs. v4.12 for the same hardware?

    I think it has already been clearly determined that Iray (software) performance with the RTX technology can result in a staggering range of performance improvements, based solely on the complexity of your particular scene. That range is around 5% to 300%. Therefore there is no Iray software benchmark (or group of benchmarks) that can tell you how it will perform on your particular scene(s), and in fact there may be VAST differences between what an Iray/RTX benchmark shows and what you actually experience with your scenes.  

    Post edited by ebergerly on
  • fred9803fred9803 Posts: 1,565
    edited July 2019

    I had assumed that GPU memory will affect the amount of VRAM available, but this will only affect the available size for scenes loaded and not actual render times, unless the render drops to CPU.

    If sys or VRAM memory do have a affect on render times, perhaps we should be restarting the puter between test renders rather than just closing and re-opening DS because recent data read from the hard disk are cached in RAM. Closing and re-opening a program does not clear that from RAM.

    Post edited by fred9803 on
  • bluejauntebluejaunte Posts: 1,990
    ebergerly said:

    If nothing else, these issues certainly bring into question the posting of performance times with 0.001 sec (or even 1 sec) accuracy, unless the user has absolutely ruled these out in their system. Which can be a sizeable task.   

    Sure. But if Iray logs these times with that precision, what do you stand to gain by rounding them? Just report the numbers as they are. Doesn't mean we need to read anything into a fraction of a second.

  • fred9803fred9803 Posts: 1,565
    edited July 2019

    Yeh, forget about the seconds because their neither here nor there given the innacuracy from each system's specs. What day of the week did the ice-age end?  When the margin of error exceeds the confidence level given the variability of system specs and other unknown parameters, then attempting to time in seonds becomes a meaningless task.

    Post edited by fred9803 on
  • RayDAntRayDAnt Posts: 1,154
    edited July 2019
    fred9803 said:

    If sys or VRAM memory do have a affect on render times, perhaps we should be restarting the puter between test renders rather than just closing and re-opening DS because recent data read from the hard disk are cached in RAM. Closing and re-opening a program does not clear that from RAM.

    For what it's worth, I did a LOT of render engine behavior pre-testing as part of the process behind developing my benchmarking scene, and I was able to conclusively determine that current implementations of RAM caching in Windows 10 have no effect on Iray render times on a render-to-render basis, much less a Daz application launch-to-launch or even OS bootup-to-bootup one. Here is the testing methodology I used to determine this:

    1. Move all default Daz assets to an external drive (in my case a high speed USB-C SSD)
    2. Open a scene (eg. one of the benchmarking scenes) known to use assets now located on that SSD in Daz Studio
    3. Press Render and watch the scene render to completion
    4. With Daz Studio still open and the completed render's preview window still open, unplug the external drive
    5. Press Render again and watch the scene render to completion
    6. With all of the previous renders' completed preview windows still open, plug the external drive back in and wait for windows to recognize it
    7. Press Render another time and watch the scene render to completion
    8. Repeat steps 4-7 as many times as you want

    What I found is that, without fail, every one of the renders above finishes without a single interruptiuon or error message from Iray or Daz Studio. However, the render completed in step #5 above (also step #9, #13, etc) is always visibly missing all of its object's textures whereas the renders completed in steps #3 and #7 always look exactly as they should. This means that every time you press the big Render button in Daz Studio, Iray is being fed (or at least attempting to be fed) a fresh copy of its asset data directly from disk. If OS level disk caching were in play here (or even Daz Studio level caching, for that matter), then logic dictates that the render in step #5 would always render visually correctly regardless of whether the external drive is currently plugged in or not.

    Post edited by RayDAnt on
  • nicsttnicstt Posts: 11,715
    ebergerly said:
    RayD'Ant, I like your wood measuring analogy. But in the RTX case I think we're talking about measuring wood whose length changes depending on who is measuring the wood.

    +1

    I'm seeing an improvement between 4.11 and 4.12 on a 980ti; that improvement varies in percentage dependent on what is in the scene, and how much. It appears to be about 30%. I don't see the value in the thread RayDAnt is doing, but it is his time, and may have value for some. It may also mislead too, which is worth considering.

  • RayDAntRayDAnt Posts: 1,154
    fred9803 said:

    I'm seeing an improvement between 4.11 and 4.12 on a 980ti; that improvement varies in percentage dependent on what is in the scene, and how much. It appears to be about 30%. I don't see the value in the thread RayDAnt is doing

    It's a relative hardware performance benchmark (between-graphics-cards testing.) It has no bearing on what you are currently investigating, which is relative software performance (between-software-versions testing.)

  • I got a decent improvement going from 4.11 to 4.12. I am thinking about the discussion around the asset drive. Looks like I could save maybe 4-5 seconds moving to my M.2 SSD. That's not much for this scene, but I am curious if that becomes drastic on more complex scenes.

  • RayDAntRayDAnt Posts: 1,154

    I got a decent improvement going from 4.11 to 4.12. I am thinking about the discussion around the asset drive. Looks like I could save maybe 4-5 seconds moving to my M.2 SSD. That's not much for this scene, but I am curious if that becomes drastic on more complex scenes.

    Unfortunately because of the way Iray manages memory (loads EVERYTHING it's ever gonna need at the very start, and then just sits there iterating on it until the very end) having a faster asset drive actually ends up making LESS of a difference for more complex scenes that take more time to render. Once you get past the first few seconds or so of any render, Iray no longer needs data from your storage device at all. You can verify this by sticking your content library on an external drive, starting a scene render using assets from it, and then maliciously unplugging the drive from the computer once you start seeing iteration Updates onscreen. Iray will go on to finish rendering the scene perfectly, completely oblivious to the fact that it's data source is longer there.

  • RayDAnt said:

    I got a decent improvement going from 4.11 to 4.12. I am thinking about the discussion around the asset drive. Looks like I could save maybe 4-5 seconds moving to my M.2 SSD. That's not much for this scene, but I am curious if that becomes drastic on more complex scenes.

    Unfortunately because of the way Iray manages memory (loads EVERYTHING it's ever gonna need at the very start, and then just sits there iterating on it until the very end) having a faster asset drive actually ends up making LESS of a difference for more complex scenes that take more time to render. Once you get past the first few seconds or so of any render, Iray no longer needs data from your storage device at all. You can verify this by sticking your content library on an external drive, starting a scene render using assets from it, and then maliciously unplugging the drive from the computer once you start seeing iteration Updates onscreen. Iray will go on to finish rendering the scene perfectly, completely oblivious to the fact that it's data source is longer there.

    Ah! Nice to know. Didn't really want to go through the hassle anyway :)

  • Say... How would I test if NVLINK is working in 4.12 and combining memory? Many textures or just a really large texture?

  • RayDAntRayDAnt Posts: 1,154
    edited July 2019

    Say... How would I test if NVLINK is working in 4.12 and combining memory? Many textures or just a really large texture?

    Logic suggests that the single texture would be much more conclusive - If you can find a texture big enough for the task...

    To the broader issue of combining memory and NVLink, you may find this recent exchange between myself and @Jack Tomalin useful (reposted from here):

     

    RayDAnt said:
    RayDAnt said:

    Jack Tomalin said:

    I don't want to muddy the waters, so consider this anecdotal.. but wanted to just test the new render server - so threw the benchmark on it and..

    Rendered snapshot in 69.6524s

    ETA that's 4x 2080ti's in there now, sent via DS 4.12 on Iray Server 2.53 (so Iray 317500.2436)

    By all means, muddy the waters (post as much benchmarking data/hardware details as you can stand.) 'Tis the reason I created this benchmark/thread in the first place. I can always just not include whatever other data you have if it proves too convoluted to get into the graphs/tables above.

    Btw do you run Iray Server on Windows or Linux? Because in case you don't already know, running it under Linux will give you some MAJOR surprise perks.

    Currently Windows, since clustering only works between the same OS's.. and I wanted to try and get that working with my work rig.  Initial attempts saw some bugs, but I didn't know if it was due to the fact I was mixing the cards.  Now everything is 2080ti's between the two machines, I'll try again.

    But yea, it would be nice to try a Linux distro at some point.

    There's been LOTS of hay made about whether or not Iray is currently capable of implementing VRAM pooling across multiple GPUs. The full truth of the matter (and this is information I was only able to finally tease out properly as recently as two weeks ago after months of exhaustive research) is that it DOES (and has been capable of doing so for years) - but only in the case of Iray Server being run on a Unix-based system with the addition of NVLink bridges with capable Turing cards (like your 2080Ti's or my Titan RTX.)

    The key sticking point for compatibility under other operating systems is apparently in the way graphics cards and their indivial hardware components are enumerated as devices by the operating system. Both modern versions of Windows and Mac OS handle io operations with onboard graphics memory directly, whereas Unix systems farm the task off to the manufacturer's driver (assuming one exists, of course.) This is key because Nvidia's implementation of memory pooling (what they loosely refer to as enhanced Unified Memory) is enacted at the basic io level. Making it currently a no-go on OS X or Windows unless Apple/Microsoft start tailoring there own low level graphics card support systems (eg. WDDM) to fit Nvidia's own proprietary implementation needs...

     

    Which isn't to say that you shouldn't test away - I'd love to be proven wrong about the above...

     

    Post edited by RayDAnt on
  • RayDAnt said:

    Say... How would I test if NVLINK is working in 4.12 and combining memory? Many textures or just a really large texture?

    Logic suggests that the single texture would be much more conclusive - If you can find a texture big enough for the task...

    To the broader issue of combining memory and NVLink, you may find this recent exchange between myself and @Jack Tomalin useful (reposted from here):

     

    RayDAnt said:
    RayDAnt said:

    Jack Tomalin said:

    I don't want to muddy the waters, so consider this anecdotal.. but wanted to just test the new render server - so threw the benchmark on it and..

    Rendered snapshot in 69.6524s

    ETA that's 4x 2080ti's in there now, sent via DS 4.12 on Iray Server 2.53 (so Iray 317500.2436)

    By all means, muddy the waters (post as much benchmarking data/hardware details as you can stand.) 'Tis the reason I created this benchmark/thread in the first place. I can always just not include whatever other data you have if it proves too convoluted to get into the graphs/tables above.

    Btw do you run Iray Server on Windows or Linux? Because in case you don't already know, running it under Linux will give you some MAJOR surprise perks.

    Currently Windows, since clustering only works between the same OS's.. and I wanted to try and get that working with my work rig.  Initial attempts saw some bugs, but I didn't know if it was due to the fact I was mixing the cards.  Now everything is 2080ti's between the two machines, I'll try again.

    But yea, it would be nice to try a Linux distro at some point.

    There's been LOTS of hay made about whether or not Iray is currently capable of implementing VRAM pooling across multiple GPUs. The full truth of the matter (and this is information I was only able to finally tease out properly as recently as two weeks ago after months of exhaustive research) is that it DOES (and has been capable of doing so for years) - but only in the case of Iray Server being run on a Unix-based system with the addition of NVLink bridges with capable Turing cards (like your 2080Ti's or my Titan RTX.)

    The key sticking point for compatibility under other operating systems is apparently in the way graphics cards and their indivial hardware components are enumerated as devices by the operating system. Both modern versions of Windows and Mac OS handle io operations with onboard graphics memory directly, whereas Unix systems farm the task off to the manufacturer's driver (assuming one exists, of course.) This is key because Nvidia's implementation of memory pooling (what they loosely refer to as enhanced Unified Memory) is enacted at the basic io level. Making it currently a no-go on OS X or Windows unless Apple/Microsoft start tailoring there own low level graphics card support systems (eg. WDDM) to fit Nvidia's own proprietary implementation needs...

     

    Which isn't to say that you shouldn't test away - I'd love to be proven wrong about the above...

     

    I'm still pretty noob at this. Would it be as simple taking a mat file and blowing it up to 4mb x 4mb (16gb) ?

  • ebergerlyebergerly Posts: 3,255
    edited July 2019

    Regarding memory pooling, I thought it was shown to be do-able in Windows (using SLI, not TCC mode) last year, but requires software support. And in fact Chaos Group had already implemented it in its Vray rendering engine. So for me, the question remaining (and the question I've had since last year) is to what extent (if at all) Iray developers have done the necessary work to implement it in Iray.

    BTW, note that Puget even compiled some code and made a UI so you can check if NVLINK is enabled, since the "simpleP2P.exe" (P2P = "peer to peer") that NVIDIA supplies doesn't show it (see link below). I haven't tried that code since I don't have NVLINK, but I have used the "simpleP2P.exe", and also written CUDA code last year that uses the Unified Memory feature to make an easy, global memory allocation (using "cudaMallocManaged") that includes system RAM and GPU VRAMs all as one unit, easily accessible. 

    Here's some quotes from the Puget Systems article on NVLink:

    "How To Configure NVLink on GeForce RTX 2080 and 2080 Ti in Windows 10

    Instead of using TCC mode, and needing to have a third graphics card to handle video output, setting up NVLink on the new GeForce RTX cards is much simpler. All you need to do is mount a compatible NVLink bridge, install the latest drivers, and enable SLI mode in the NVIDIA Control Panel.

    It is not obvious that the steps above enable NVLink, as that is not mentioned anywhere in the NVIDIA Control Panel that we could see. The 'simpleP2P.exe' test we ran before also didn't detect it, likely because TCC mode is not being enabled in this process. However, another P2P bandwidth test from CUDA 10 did show the NVLink connection working properly and with the bandwidth expected for a pair of RTX 2080 cards (~25GB/s each direction):

    There isn't an easy way to tell whether NVLink is working in the NVIDIA Control Panel, but NVIDIA does supply some sample CUDA code that can check for peer-to-peer communication. We have compiled the sample test we used above, and created a simple GUI for running it and viewing the result. You can download those utilities here.

    Do GeForce RTX Cards Support Memory Pooling in Windows?

    Not directly. While NVLink can be enabled and peer-to-peer communication is functional, accessing memory across video cards depends on software support. If an application is written to be aware of NVLink and take advantage of that feature, then two GeForce RTX cards (or any others that support NVLink) could work together on a larger data set than they could individually.

    What Benefits Does NVLink on GeForce RTX Cards Provide?

    While memory pooling may not 'just work' automatically, it can be utilized if software developers choose to do so. Support is not widespread currently, but Chaos Group has it functioning in their V-Ray rendering engine. Just like the new RT and Tensor cores in the RTX cards, we will have to wait and see how developers utilize NVLink.

    What About SLI Over NVLink on GeForce RTX Cards?

    While memory pooling may require special software support, the single NVLink on the RTX 2080 and dual links on the 2080 Ti are still far faster than the old SLI interconnect. That seems to be a main focus on these gaming-oriented cards: implementing SLI over a faster NVLink connection. That goal is already accomplished, as shown in benchmarks elsewhere."

    Also, the Chaos Group said:

    "Note that the available memory for GPU rendering is not exactly doubled with NVLink; V-Ray GPU needs to duplicate some data on each GPU for performance reasons, and it needs to reserve some memory on each GPU as a scratchpad for calculations during rendering. Still, using NVLink allows us to render much larger scenes than would fit on each GPU alone."

    And regarding memory reporting:

    "It seems like the regular GPU memory reporting API provided by NVIDIA currently (at the time of this writing) does not work correctly in SLI mode. This means that programs like GPUz, MSI Afterburner, nvidia-smi, etc. might not show accurate memory usage for each GPU. "

    Now this was from late last year, so things may have changed. But presumably they have changed for the better, if at all. 

     

    Post edited by ebergerly on
  • outrider42outrider42 Posts: 3,679
    RayDAnt said:

    Say... How would I test if NVLINK is working in 4.12 and combining memory? Many textures or just a really large texture?

    Logic suggests that the single texture would be much more conclusive - If you can find a texture big enough for the task...

    To the broader issue of combining memory and NVLink, you may find this recent exchange between myself and @Jack Tomalin useful (reposted from here):

     

    RayDAnt said:
    RayDAnt said:

    Jack Tomalin said:

    I don't want to muddy the waters, so consider this anecdotal.. but wanted to just test the new render server - so threw the benchmark on it and..

    Rendered snapshot in 69.6524s

    ETA that's 4x 2080ti's in there now, sent via DS 4.12 on Iray Server 2.53 (so Iray 317500.2436)

    By all means, muddy the waters (post as much benchmarking data/hardware details as you can stand.) 'Tis the reason I created this benchmark/thread in the first place. I can always just not include whatever other data you have if it proves too convoluted to get into the graphs/tables above.

    Btw do you run Iray Server on Windows or Linux? Because in case you don't already know, running it under Linux will give you some MAJOR surprise perks.

    Currently Windows, since clustering only works between the same OS's.. and I wanted to try and get that working with my work rig.  Initial attempts saw some bugs, but I didn't know if it was due to the fact I was mixing the cards.  Now everything is 2080ti's between the two machines, I'll try again.

    But yea, it would be nice to try a Linux distro at some point.

    There's been LOTS of hay made about whether or not Iray is currently capable of implementing VRAM pooling across multiple GPUs. The full truth of the matter (and this is information I was only able to finally tease out properly as recently as two weeks ago after months of exhaustive research) is that it DOES (and has been capable of doing so for years) - but only in the case of Iray Server being run on a Unix-based system with the addition of NVLink bridges with capable Turing cards (like your 2080Ti's or my Titan RTX.)

    The key sticking point for compatibility under other operating systems is apparently in the way graphics cards and their indivial hardware components are enumerated as devices by the operating system. Both modern versions of Windows and Mac OS handle io operations with onboard graphics memory directly, whereas Unix systems farm the task off to the manufacturer's driver (assuming one exists, of course.) This is key because Nvidia's implementation of memory pooling (what they loosely refer to as enhanced Unified Memory) is enacted at the basic io level. Making it currently a no-go on OS X or Windows unless Apple/Microsoft start tailoring there own low level graphics card support systems (eg. WDDM) to fit Nvidia's own proprietary implementation needs...

     

    Which isn't to say that you shouldn't test away - I'd love to be proven wrong about the above...

     

    I'm still pretty noob at this. Would it be as simple taking a mat file and blowing it up to 4mb x 4mb (16gb) ?

    I would just concentrate on building a scene that is too large for a single card to render. The easiest way to do this is to try rendering very high resolution images. The higher the resolution, the more VRAM is needed. Go for above 4K, like 8K pixels. You can add a number of items into the scene as well to take up memory. Just adding a bunch of different high quality Genesis 8 models can accomplish this. Some Genesis 8 characters have very large textures these days. Some characters have downloads that are half a gig now. While not all of those textures are used, that's probably a sign they have some large textures. Maybe even TIF which can be near 100 MB each. So if you have several of these characters that all have different and large file sizes all in one scene, the VRAM should shoot way up.

    Now what you do is make sure only 1 GPU is checked for Iray. Simply keep adding stuff until the scene drops to CPU when you try to render. Save that scene. Restart Daz and with the same settings and scene try again. We want to make sure that this scene will drop to CPU every time. Once you can confirm the scene drops to CPU every time, you can test Nvlink. But copy your Daz Help Log File first, we can use that to find what Iray is reporting. Once you have a saved copy, clear the log file, it will make it easier to locate the info next time.

    Close Daz again. Take the proper steps to enable Nvlink and make sure both GPUs are checked to render.

    If the image now renders on GPU, you quite possibly have proven that VRAM pooling works. Once again, copy and check the Help Log File to see what Iray is reporting.

    You can repeat it a few times to be sure.

    If VRAM pooling is working, that opens up a whole new load of possibilities. The new 2070 Super has Nvlink support, something the original 2070 lacked. At $500, the 2070 Super would be a very interesting option if it can handle around 16GB. And two 2070 Supers should match or possibly even beat a 2080ti, which would be quite an interesting thing. Do keep in mind that VRAM pooling is not a pure doubling of VRAM. Assuming it works, there will be some data that is duplicated across both GPUs. We do not know how much. I am guessing that there may be about 14GB of usable data with two 2070 Supers, but that is purely a guess.

  • AelfinMaegikAelfinMaegik Posts: 47
    edited July 2019

    I give up. I have Bend of the River, Mighty Oak, 10 G8 characters with hair and clothes and I just can't get past 38% memory on one card. Will try again tomorrow.

    Post edited by AelfinMaegik on
  • ebergerlyebergerly Posts: 3,255
    edited July 2019

    I give up. I have Bend of the River, Mighty Oak, 10 G8 characters with hair and clothes and I just can't get past 38% memory on one card. Will try again tomorrow.

    First of all, are you sure you set up NVLINK correctly, as outlined in my post above?

    Also, how are you measuring VRAM usage on your GPU's? As I posted above, at least at the end of last year, "It seems like the regular GPU memory reporting API provided by NVIDIA currently (at the time of this writing) does not work correctly in SLI mode. This means that programs like GPUz, MSI Afterburner, nvidia-smi, etc. might not show accurate memory usage for each GPU. 

    Also, what cards are you using?

    Post edited by ebergerly on
  • ebergerlyebergerly Posts: 3,255
    edited July 2019

    FWIW, I think “VRAM pooling” is another one of those highly complex tech issues which, unfortunately, can get vastly oversimplified in the broader internet tech community, and that leads to lots of confusion and misunderstanding. There are many components and levels to VRAM pooling, and in fact the term “VRAM pooling” can mean many things.

    As I posted previously last year, in its simplest form VRAM pooling is very easy to implement, and in fact I wrote some CUDA code last year to pool the VRAM in my two GPU’s. NVIDIA released CUDA 6 (?), many years ago, and since then it’s become much easier. We now have “unified memory”, which provides a simple way to allocate all system RAM and GPU VRAM's into one, easily accessible chunk of memory. You merely ask the OS for a “handle” to your GPU’s, then ask the OS to allocate (using “cudaMallocManaged”) the needed “managed” memory, and that’s about it. Well, not really, but that's the basic idea.

    However, since your GPU’s and CPU and system RAM are connected via a PCI bus it is likely a much lower speed than your RAM. So if the problem you’re solving in that big chunk of memory requires “Peer to peer” communication between those physically separate memory components then, while it may still work as pooled memory, it might be extremely slow.

    For example, let’s say you have a simple matrix/array of 2 million elements, and all you want to do is multiply each element by 3. In this case, the peer-to-peer hardware speed limitations are somewhat irrelevent. That’s because each multiplication doesn’t rely at all on the other multiplications going on with the remaining 1,999,999 elements, so there doesn’t need to be any communication between GPU’s. It’s what’s called “embarrasingly parallel”, and it’s where GPU’s really shine. You take half the array (1 million elements) and store it in GPU1’s VRAM, and the other half in the other VRAM, and do all the multiplications simultaneously, with no need for peer-to-peer hardware communication between GPU’s. One problem split between the VRAM of 2 GPU’s, aka “VRAM pooling”.

    But raytracing is different.

    With raytracing you also want to break your scene in half, and store half the scene in one GPU and the other half in the other GPU. The problem comes when you send a ray into the scene on GPU 1, it hits an object, then bounces into the part of the scene that’s in the other GPU’s VRAM. Suddenly you need some extremely high speed, bi-directional communication between GPU’s, and some software to manage all of that. “Hey, GPU2, there’s a ray coming into your half of the scene, so you need to calculate if it hits something in your scene and send it back into the other GPU’s scene to figure out the next bounce”. And this has to be done with a high speed that matches the VRAM speed so you don’t slow anything down, and it has to be done in both directions. Hence the need for something like a fast, bi-directional communications link like NVLINK.

    In contrast, what we have today with multi-GPU’s and no high speed peer-to-peer is that the entire scene resides on both GPU’s. And especially with path tracers like Iray (which are computationally intense and take a long time to render), the relatively slow speed of the PCI bus may be somewhat irrelevant. You share calculations across the PCI bus, but compared to overall render time the slow PCI transfer rates are relatively minor and may not affect render times significantly. Of course this depends upon the renderer…

    So clearly, NVLINK does NOT equal VRAM pooling. It’s merely the hardware mechanism to allow you to do software stuff like VRAM pooling with high speed IF the problem you’re trying to solve requires that. And in some cases, for some problems, that high speed is not even necessary. You can pool VRAM and never need NVLINK.

    But most importantly the software and API’s, etc., must be built to manage all of those inter-GPU communications needed for raytracing. Especially since you also need to have some “master” information that keeps track of what objects are in what part of the scene in which GPU and so on. (BTW, that's partly why you don't necessarily just add all of the GPU VRAM gigabytes to get double or triple or whatever the amount of VRAM with VRAM pooling. As with anything to do with computers, there's overhead to consider). So having NVLINK enabled is pretty much irrelevant without all that software implementation.

    I suspect for most of us the real question is whether, and to what extent, Iray developers have gone the extra mile to fully integrate high speed, bi-directional, NVLINK-based VRAM pooling, and can it be accessed/implemented with the latest Studio/Iray and the RTX cards.

    Has anyone actually tried that yet, and verified that NVLINK/SLI is enabled, and verified the actual memory usage (using reliable data) across GPU’s?  

    Post edited by ebergerly on
  • Takeo.KenseiTakeo.Kensei Posts: 1,303
    edited July 2019
    timon630 said:

    Может кто-нибудь протестировать эту сцену в DS 4.11 и DS 4.12, пожалуйста?

    Я хочу посмотреть, как это происходит на других картах, особенно с картами без RTX.

    Это в основном сценарий SY, модифицирующий активную прядей волос, который может быть проблемой для оперативной памяти для рендеринга (не знаю, может ли это быть проблемой)

    Спасибо

    https://www.sendspace.com/file/b78062

     

    GTX 1070
    4.11 Total Rendering Time: 4 minutes 21.22 seconds (2400 iterations, 6.852s init, 252.496s render)
    4.11 Total Rendering Time: 4 minutes 28.70 seconds (2385 iterations, 7.715s init, 259.114s render)


    4.12 Total Rendering Time: 4 minutes 20.3 seconds (2434 iterations, 6.446s init, 251.043s render)
    4.12 Total Rendering Time: 4 minutes 15.88 seconds (2363 iterations, 6.927s init, 246.595s render)

    Thanks a very very very lot. That is very interresting result.

    That shows how much RTCore can improve rendertime when lots of geometry is used in the scene and that Pre-RTX cards have no enhancement in that particular field from 4.12

     

    Post edited by Takeo.Kensei on
  • Takeo.KenseiTakeo.Kensei Posts: 1,303
    RayDAnt said:

    Even in the hypothethical case where Iray and Driver version were frozen there's still variance,

    Of course there's always going to be a certain base level of underlying variance in the rendering process wehn you're dealing with an unbiased rendering engine (which is what Iray is.) But is that variance going to be enough to throw things off statistically using a given testing methodology? Let's find out.

    Here are the 4.12.0.033 Beta render times (in seconds) for 5 completed runs of the benchmarking scene I created rendered on my Titan RTX system, all completed under completely identical testing conditions (1800 iterations, 72f ambient room temp, custom watercooled components, identical daz/iray/driver versions, etc):

    Run #1: 231.283s render
    Run #2: 231.370s render
    Run #3: 231.397s render
    Run #4: 231.389s render
    Run #5: 231.368s render

    Which gives us the following descriptive statistics for rendering time.
    Min: 231.281s (Run #1)
    Mean: 231.361s (all Runs averaged together)
    Max: 231.397s (Run #3)

    And the following descriptive statistics for variance between rendering times.
    Min: 0.002s (Run #2 - Run #5)
    Max: 0.116s (Run #3 - Run #1)

    The two most popular thresholds for statisitcal signficance are 5% and 1% of the total. What percentage of 231.281s (the average render time) is 0.116s (the largest variance here observed)? Let's do the math:
    (0.116 / 231.281) * 100 = 0.000501554 or 0.05%

    The variance you are talking about here is literally insignificant. At least, it is if you are adhering to the testing/reporting methodology I set up.

    No you still don't get it. Prove me that every iteration have the same number of samples or whatever measure you think stays the most fix

    Here if you take the total rendertime as the most stable control value then that is what you should take as control value for the benchmark

    You chose iteration. I want to see iteration

     

    RayDAnt said:

    The convergence ratio is actually more reliable.

    Limit by convergence threshold is technically the least accurate method Iray has for automatically determining when a scene is "finished" rendering. This is because it only periodically updates the Rendering Quality stat which drives the underlying mechanism (because doing so takes additional processing time that Iray's developers wanted to go toward rendering itself as much as possible - hence the "Rendering Quality Enable" toggle under Progressive Render options. Turning that to Off actually improves your overall rendering performance across the board.) Meaning that the algorithm only has an extremely coarse-grained (in comparison to iteration limited configs) set of opportunities to end the render once the desired convergence threshold is reached. Hence why completion by convergence always results in an overshoot of whatever you set it to. 

    And the same is true (although significantly less so) with limits by time. Iray always overshoots whatever time limit you set it to because it renders natively in terms of iterations - not time spans. Set it to render to 10 seconds and it will always overshoot because what it is actually doing is rendering iterations up to the time you set PLUS however more time it takes to get all the active iterations back at that approximate time regardless of how much more time than the amount you set it it takes. Iray Iteration limit is the only consistent method Iray has for ending a render.

    you can try to change the max time in iray settings to reduce the overcalculation

    You can try. And as you will discover (like I did when I first started investigating this in earnest back in January/February) you will fail.

     

    if you want a zero variance the easy solution is to choose 100% convergence.

    Iray is an unbiased renderer. Meaning that it will take it an INFINITE amount of time to render ANY scene to a true 100% level of convergence. Making this an impractial sollution.

    You don't understand what an unbiased renderer means. It certainly doesn't mean that 100% convergence = infinite rendertime

    Here is my log on the scene with hairs I posted with 100% convergence. You can have a try

    2019-07-28 16:33:30.886 Iray VERBOSE - module:category(IRAY:RENDER):   1.0   IRAY   rend progr: 100.00% of image converged
    2019-07-28 16:33:30.901 Iray INFO - module:category(IRAY:RENDER):   1.0   IRAY   rend info : Received update to 03851 iterations after 112.974s.
    2019-07-28 16:33:30.930 Iray INFO - module:category(IRAY:RENDER):   1.0   IRAY   rend info : Convergence threshold reached.

    And here is the log for 95% convergence

    2019-07-26 06:07:48.810 Iray VERBOSE - module:category(IRAY:RENDER):   1.0   IRAY   rend progr: 95.17% of image converged
    2019-07-26 06:07:48.832 Iray INFO - module:category(IRAY:RENDER):   1.0   IRAY   rend info : Received update to 02379 iterations after 107.877s.
    2019-07-26 06:07:48.851 Iray INFO - module:category(IRAY:RENDER):   1.0   IRAY   rend info : Convergence threshold reached.

    Added 6s rendertime to get 100%

     

    ebergerly said:
    Cinus said:

     

    I'm afraid I don't quite understand what you mean by "It is a hardware benchmark specifically. Not a software benchmark". What are people benchmarking when they post performance data for V4.11 vs. v4.12 for the same hardware?

    I think it has already been clearly determined that Iray (software) performance with the RTX technology can result in a staggering range of performance improvements, based solely on the complexity of your particular scene. That range is around 5% to 300%. Therefore there is no Iray software benchmark (or group of benchmarks) that can tell you how it will perform on your particular scene(s), and in fact there may be VAST differences between what an Iray/RTX benchmark shows and what you actually experience with your scenes.  

    The 300% was what misgenus achieved. With the little scene I made it seems you can get way more than that. See LenioTG's post

     

    LenioTG said:

    Can someone test this scene in DS 4.11 and DS 4.12 please ?

    I get 20x speed improvement between the two versions and would like to see how it goes on other cards especially with non RTX cards

    It's basically the SY scene modified with a strand hair asset which is configured to be cached on RAM for rendering (don't know if that may be a problem)

    thanks

    https://www.sendspace.com/file/b78062

    Done!

    OMG: not only it rendered much faster, but it also opened the scene in a fraction of the time! surprise

    • 4.11 - 9 minutes 54.12 seconds
    • 4.12 - 1 minutes 9.44 seconds
    • Speed Improvement: 8,6x

    I have a RTX 2060.

     

  • Takeo.KenseiTakeo.Kensei Posts: 1,303
    edited July 2019
    RayDAnt said:

    Say... How would I test if NVLINK is working in 4.12 and combining memory? Many textures or just a really large texture?

    Logic suggests that the single texture would be much more conclusive - If you can find a texture big enough for the task...

    To the broader issue of combining memory and NVLink, you may find this recent exchange between myself and @Jack Tomalin useful (reposted from here):

     

    RayDAnt said:
    RayDAnt said:

    Jack Tomalin said:

    I don't want to muddy the waters, so consider this anecdotal.. but wanted to just test the new render server - so threw the benchmark on it and..

    Rendered snapshot in 69.6524s

    ETA that's 4x 2080ti's in there now, sent via DS 4.12 on Iray Server 2.53 (so Iray 317500.2436)

    By all means, muddy the waters (post as much benchmarking data/hardware details as you can stand.) 'Tis the reason I created this benchmark/thread in the first place. I can always just not include whatever other data you have if it proves too convoluted to get into the graphs/tables above.

    Btw do you run Iray Server on Windows or Linux? Because in case you don't already know, running it under Linux will give you some MAJOR surprise perks.

    Currently Windows, since clustering only works between the same OS's.. and I wanted to try and get that working with my work rig.  Initial attempts saw some bugs, but I didn't know if it was due to the fact I was mixing the cards.  Now everything is 2080ti's between the two machines, I'll try again.

    But yea, it would be nice to try a Linux distro at some point.

    There's been LOTS of hay made about whether or not Iray is currently capable of implementing VRAM pooling across multiple GPUs. The full truth of the matter (and this is information I was only able to finally tease out properly as recently as two weeks ago after months of exhaustive research) is that it DOES (and has been capable of doing so for years) - but only in the case of Iray Server being run on a Unix-based system with the addition of NVLink bridges with capable Turing cards (like your 2080Ti's or my Titan RTX.)

    The key sticking point for compatibility under other operating systems is apparently in the way graphics cards and their indivial hardware components are enumerated as devices by the operating system. Both modern versions of Windows and Mac OS handle io operations with onboard graphics memory directly, whereas Unix systems farm the task off to the manufacturer's driver (assuming one exists, of course.) This is key because Nvidia's implementation of memory pooling (what they loosely refer to as enhanced Unified Memory) is enacted at the basic io level. Making it currently a no-go on OS X or Windows unless Apple/Microsoft start tailoring there own low level graphics card support systems (eg. WDDM) to fit Nvidia's own proprietary implementation needs...

     

    Which isn't to say that you shouldn't test away - I'd love to be proven wrong about the above...

     

    I'm still pretty noob at this. Would it be as simple taking a mat file and blowing it up to 4mb x 4mb (16gb) ?

    I would just concentrate on building a scene that is too large for a single card to render. The easiest way to do this is to try rendering very high resolution images. The higher the resolution, the more VRAM is needed. Go for above 4K, like 8K pixels. You can add a number of items into the scene as well to take up memory. Just adding a bunch of different high quality Genesis 8 models can accomplish this. Some Genesis 8 characters have very large textures these days. Some characters have downloads that are half a gig now. While not all of those textures are used, that's probably a sign they have some large textures. Maybe even TIF which can be near 100 MB each. So if you have several of these characters that all have different and large file sizes all in one scene, the VRAM should shoot way up.

    Now what you do is make sure only 1 GPU is checked for Iray. Simply keep adding stuff until the scene drops to CPU when you try to render. Save that scene. Restart Daz and with the same settings and scene try again. We want to make sure that this scene will drop to CPU every time. Once you can confirm the scene drops to CPU every time, you can test Nvlink. But copy your Daz Help Log File first, we can use that to find what Iray is reporting. Once you have a saved copy, clear the log file, it will make it easier to locate the info next time.

    Close Daz again. Take the proper steps to enable Nvlink and make sure both GPUs are checked to render.

    If the image now renders on GPU, you quite possibly have proven that VRAM pooling works. Once again, copy and check the Help Log File to see what Iray is reporting.

    You can repeat it a few times to be sure.

    If VRAM pooling is working, that opens up a whole new load of possibilities. The new 2070 Super has Nvlink support, something the original 2070 lacked. At $500, the 2070 Super would be a very interesting option if it can handle around 16GB. And two 2070 Supers should match or possibly even beat a 2080ti, which would be quite an interesting thing. Do keep in mind that VRAM pooling is not a pure doubling of VRAM. Assuming it works, there will be some data that is duplicated across both GPUs. We do not know how much. I am guessing that there may be about 14GB of usable data with two 2070 Supers, but that is purely a guess.

    I agree with the process with some modifications :

    1°/ Load some big HDR like https://hdrmaps.com/freebies/egg-mountain-at-afternoon

    And set the Environment resolution to something big to be sure it doesn't get compressed

    Eventually, upscale the HDR to some monster size

    2°/ Artificially upscale all textures from the models you load and change texture compression threshold in the Iray advanced settings and put some high values

    Post edited by Takeo.Kensei on
  • outrider42outrider42 Posts: 3,679
    I suppose another good question would be how much VRAM do you currently have? If you are talking about pooling 2080tis, that has 11gb, and I can build a scene that will use all of that up. I have Bend on the River, too, perhaps I can upload a scene file that uses it. Also, make sure you have the foliage turned on.

    Iray has compression settings in the advanced tab. Turning the values up on the minimum and max will force the engine to use the larger size. So you don't need to try and resize all the textures, just choose models that have large textures to start with.
This discussion has been closed.