Rendering on more than one GPU, or CPU + GPU, how does IRAY assemble the final image?

This question has probably been answered before, but I have not found it.  So, when IRAY is rendering an image on two or more different graphics cards, I get the impression that it loads up both cards, and they start rendering at least semi-independently.  Obviously at some point the render products have to be composited to form a single final image.   Does this happen periodically during the process by exchanging image data between the GPUs, or does it happen at the very end, when both GPUs are "done"?

The reason for the question, other than curiosity, is that I am trying to think of some ways to improve noise/firefly reduction and reduce render times.

Thanks!

Comments

  • namffuaknamffuak Posts: 4,406

    I don't know that the question has been asked, as such. But if you watch cpu usage while running a gpu-only render you'll see a fair bit of usage going on. I haven't played with the setting, but I think the 'Update Interval' in the general render settings controls how frequently the gpu reports the rendered image back to the Iray resident code, and the Iray resident code is in charge of building the composite render image that we watch as the process runs.

    So maybe changing this from the default 5 seconds to some longer time would speed things up a bit. Maybe. This would depend on PCI-E lanes and overall available bandwidth for starters.

  • GreymomGreymom Posts: 1,139
    namffuak said:

    I don't know that the question has been asked, as such. But if you watch cpu usage while running a gpu-only render you'll see a fair bit of usage going on. I haven't played with the setting, but I think the 'Update Interval' in the general render settings controls how frequently the gpu reports the rendered image back to the Iray resident code, and the Iray resident code is in charge of building the composite render image that we watch as the process runs.

    So maybe changing this from the default 5 seconds to some longer time would speed things up a bit. Maybe. This would depend on PCI-E lanes and overall available bandwidth for starters.

    Very good points!

    If noise and fireflys in an image are at least partially random, then if you have two devices rendering the same image, and then being composited, I would think that this would be a kind of built-in antialiasing, since you are averaging together two images with noise in different locations.  Kind of the same idea as rendering at 4x the target resolution and then reducing the resolution in postwork.   If this theory is correct, then rendering on two devices should result in more rapid noise/firefly reduction than a single device at the same sampling rate.  If I ever have the bucks to get two NVIDIA GPUs, I might be able to test this. 

    Also, the frequency at which the images are combined/composited should effect the noise reduction (I think).  So, your suggestion concerning the update frequency is testable!

    Carrying that thought a bit further, if you were to render an image for say 100 samples, then render the same image 4 times to 25 samples and composite them in Photoshop, what differences would you see in apparenty quality?  It is possible that IRAY already does something like this internally as part of the noise reduction process.

    The upcoming "AI noise reduction" feature (if it is not already implemented) may make all this moot.

    Anyway, thanks for your thoughts!  My goal here is to get higher quality, faster renders without spending even more money on hardware, or on IRAY licenses.

  • Render detail isn't so much about the hardware as it is the settings in the software and the lighting in the scene. A grainy image will render twice as fast on 2 1080tis as it will on one, but it will still be grainy.

  • adaceyadacey Posts: 186

    My understanding is that you're essentially having to calculate a whole pile of possible light paths. If you have multiple devices involved it's not as if you're saying, "okay, both cards render this scene" you're saying, "okay, I have to calculate 1000 paths, you take 500 and I'll take 500." That's a bit of an oversimplification, but as I understand it, neither card should be duplicating the work unless the rendering engine itself recalculates the value for the same path multiple times, but again that would still be work that's split up by the 2 cards.

    Essentially, the way it works is you have your light source, for simplicity's sake let's assume this is a more or less point light source. That light is emitting light at a certain angle of spread (plus adding in fun things like falloff, etc), so starting from the light the renderer starts following one of the possible paths that light will take, it hits an object and looks at the surface properties of the spot it hit, how much light is reflected, absorbed, what colour is the surface, how much does the reflected light spread, at what angle, is there any refraction, etc, etc. It figures out what the surface is going to look like from the light that hit it, then keeps following the path. Eventually, the light falls off, hits some bounce limit, or otherwise stops contributing to the scene, then the renderer goes back and picks the next path, and so on.

    As I understand it, when you're dealing with unbiased renderers (like Iray) the renderer keeps performing these path traces over and over again and, in Iray's case, it looks at how much variance there is between the latest scene's sample compared to the previous. At some point, you start hitting diminishing returns, where the latest sample stops varying much from the previous samples and the renderer deems the scene "done". Or, you hit the maximum number of samples and the render finishes because you've said not to collect more than 5000 samples.

    If you think of it this way, the GPU isn't exactly "assembling" the scene, the GPU in this case just happens to be a very specialized piece of hardware that's very good at the types of calculations that are needed for tracing these paths. You load the scene over into the GPU's memory so that it has the information it has and it starts calculating paths at a much faster speed than your more general purpose CPU can do, but it has to periodically report the results back to the CPU, and it's the CPU that is actually assembling the scene, taking the results from the various cards that have contributed to the scene, and periodically coordinating what work the GPUs have to do.

    This is quite a bit of a simplification, but that's my understanding of how things work. If somebody does find something obviously wrong please do let me know. I've also just gone for the simple path tracing technique and not covered things like bidirectional path tracing, or other more complicated techniques, but ultimately it does all still fall back to tracing the paths of the light one way or another.

  • DrNewcensteinDrNewcenstein Posts: 816
    edited March 2018

    Rendering to 25% completion 4 times will not yield 100% completion. You'll have 4 images that are 25% complete. Compositing them in Photoshop will not improve the outcome.

    The reason why rendering to a larger image and then shrinking it works is because those pixels are locked to the resolution. At 1920x1080, a single firefly is 1x1. Rendering at 4x that and then shrinking it to 1920x1080 reduces that firefly's pixel to 0.25x0.25 pixels, It's still there, it's just blending in better.

    As for how the workload is distributed, if you expand the History button in the little rendering dialog box that opens, it will show all CUDA devices and the interactions. It will load meshes, textures, lights, and other scene info into all devices. It doesn't say "card 0 gets the meshes, card 1 gets the textures, card 2 does the lights, card 3 does the calculations of light bounces" and on like that. It dumps the scene into all available cards so they all work on it together. However, I don't know if that's saying "I have 100MB of data, I want to split it up 50/50 between 2 cards", or if it's saying "I have 100MB of data for card 0 and 100MB of data for card 1."

    Given that it dumps to the CPU if you exceed the VRAM of a given GPU in the chain, I'm guessing it gives all cards the same amount of work rather than splitting it up evenly across all cards.

    Post edited by DrNewcenstein on
  • adaceyadacey Posts: 186

    Right, it's not splitting up the data, each card needs all the data. What I meant by splitting up the work is that it splits up the workload that needs to be done on that data. So essentially let's say that there are 1000 paths that need to be calculated, each card will theoretically calculate 500 of them.

    Or put another way, if you have 1 card you have a certain number of CUDA cores to do the workload (or in my case, I wish I had CUDA cores to do the workload). Adding another card just means you now have more cores to do the workload. Each card needs a full copy of the scene to work on it. As I understand things, the cards aren't assembling the scene, they're simply performing calculations. The results of those calculations still go back to the CPU for assembly as far as I know.

  • GreymomGreymom Posts: 1,139

    Rendering to 25% completion 4 times will not yield 100% completion. You'll have 4 images that are 25% complete. Compositing them in Photoshop will not improve the outcome.

    The reason why rendering to a larger image and then shrinking it works is because those pixels are locked to the resolution. At 1920x1080, a single firefly is 1x1. Rendering at 4x that and then shrinking it to 1920x1080 reduces that firefly's pixel to 0.25x0.25 pixels, It's still there, it's just blending in better.

    As for how the workload is distributed, if you expand the History button in the little rendering dialog box that opens, it will show all CUDA devices and the interactions. It will load meshes, textures, lights, and other scene info into all devices. It doesn't say "card 0 gets the meshes, card 1 gets the textures, card 2 does the lights, card 3 does the calculations of light bounces" and on like that. It dumps the scene into all available cards so they all work on it together. However, I don't know if that's saying "I have 100MB of data, I want to split it up 50/50 between 2 cards", or if it's saying "I have 100MB of data for card 0 and 100MB of data for card 1."

    Given that it dumps to the CPU if you exceed the VRAM of a given GPU in the chain, I'm guessing it gives all cards the same amount of work rather than splitting it up evenly across all cards.

    Right, rendering to 25% completion four times will be inferior to rendering to 100% completion, but that's not the question I asked.  I was wondering about four renders of 25 samples vs. one render of 100 samples. Depending on how IRAY handles image sampling, multiple renders of the same image may result in differences in noise and fireflys.  If so, compositing these images would tend to average out the noise and fireflys.  Depending on how noise and fireflys are generated in IRAY, and how the data from different rendering devices is sampled, some noise/firefly averaging may be inherent with mutiple devices, and may be affected by how often the data from the different devices is combined.

    For example, Luxrender 1.6 has two different image samplers you can select, SOBO and Metropolis.  As I understand it, both start with a Monte Carlo random number sampling techinque.  The older SOBO sampler keeps sampling randomly.  The Metrpolis Light Transport system saves and reuses light paths and calculations to improve efficiency, and this results in more rapid denoising and firefly reduction.  This is particularly interesting because IRAY supposedly has an MLT option.   A more advanced light path system is part of Disney's Hyperion renderer, which uses "light path bundling" and other estoteric optimzation techniques (it is entirely CPU-based because the images are huge, designed for IMAX/4K.   They used 55,000 physical cores to render Big Hero Six).

    I did a few simple experiments a while back with the SOBO sampler, running some short renders.  It appeared that while the noise and fireflys were in similar areas of the image, they were not exactly the same, or in the same location, because each render started with a new initialization of the sampler.  In my very brief look, it appeared that the overlay of these images really reduced the fireflys by averaging compared to a single render with the same number of samples.

    I would expect that this would have the opposite effect with the Metropolis sampler, as this would also restart the optimization cycle.

    Oh, the concept of rendering at a higher resolution and then reducing the image size is a good general technique.   I have been using this for years, starting with Poser 7 Firefly, then with Luxrender.

    Anyway, thanks to everyone for the info and comments!  I have noted down some experiments to try whenever I get some time.

    "Dammit Jim, I'm a retired polymer chemist, not a software engineer!"

  • Even so, I don't see 25 samples x4 being equal to 100 samples, or 1000 samples x5 being equal to 5000 samples (the default setting). I see a lot of post-work, and the same amount of time spent rendering the image 5 times to 1000 samples as it would take to render once to 5000 samples.

Sign In or Register to comment.