Multiple-GPU rig freezes when Iray rendering

RobertDyRobertDy Posts: 236

Hi everyone, I really need your help on a problem I'm having, and the worst problem is one which you don't know its cause. I previously never had this problem and it just suddenly started out of nowhere about more than a week ago. I don't think my specs are the cause, but just in case, here are they are:

AMD RyzenTM ThreadripperTM 3960X
ASROCK TRX40 CREATOR
GPU 1: RTX 3090 24GB
GPU 2: RTX 2080 Ti
4 X 16GB KLEVV BOLT X DDR4 3200 RAM
2 TB Samsung 980 Pro SSD
EDIT: 2000W PSU (FSP Cannon)

I've also attached my Iray advanced settings - both GPUs for photoreal, only one for interactive, CPU fallback disabled. The scenes I render are usually just 2 characters and mostly using HDR as a background, so it's really nothing complex there (and in the past I rendered even more complex stuff with no freezing whatsoever). I've tried rendering in DAZ 4.16 and the latest 4.20, both froze, so I don't think it's a version issue. I've attached a log file when the latest freeze happened. I'm a beginner, so I can't find anything in there, the log stops at "rendering convergence at X%" and then immediately jumps to 'DAZ 3D starting up'. I'm really hoping that I might've missed something out in the log file and someone can point it out.

In terms of Iray rendering, everything was going fine for 4 months since this PC was purchased with no problem. An old SSD died just last month and was replaced by this Samsung SSD, though I don't think that's the cause, but if it could be, please flag it. The freeze can happen at any time during the render and when it does, the whole system just freezes. No BSOD or black screen, just a complete freeze and I have to force power off the PC and start it up again.

I've done Furmark and Unigine Heaven tests with no problem at all. Just to be sure, I updated the NVIDIA drivers, still this problem persists. (I just update them through NVIDIA's software without using Device Driver Uninstaller). I can hear some noise (rather mild, but audible) from the PC, most likely it's from the PC fans, but googling 'do malfunctioning PC fans cause PC crashes' didn't turn up anything matching. 

I find that the only way to solve this is to do a render using only one GPU, that sort of 'resets' the entire thing, then I can render with both GPUs until the next freeze, then I have to render with one GPU again, and the chain goes on.

Checking the reliability history didn't show up anything either except 'Windows shutdown was unexpected.' However, on the day before the 4th freeze occured, reliability showed this error:

Problem Event Name: LiveKernelEvent
Code: 144
Parameter 1: 3003
Parameter 2: ffff82051985b538
Parameter 3: 40010000
Parameter 4: 0
OS version: 10_0_22000
Service Pack: 0_0
Product: 768_1
OS Version: 10.0.22000.2.0.0.768.101
Locale ID: 18441

And another one like this:

Problem Event Name: LiveKernelEvent
Code: 141
Parameter 1: ffffce8a7dbdc010
Parameter 2: fffff8008f7c0f90
Parameter 3: 0
Parameter 4: 40ac
OS version: 10_0_22000
Service Pack: 0_0
Product: 768_1
OS Version: 10.0.22000.2.0.0.768.101
Locale ID: 18441

I'll be asking Windows about these errors, through googling I found that the first LiveKernel is usually related to the graphics driver, and the advice is just to update the drivers, while googling the second LiveKernel turned up nothing at all.

I'm begging for any kind, experienced pros to guide me here. I'm completely lost and don't know where the problem lies. Any help is greatly appreciated.

iray settings.png
831 x 1823 - 182K
txt
txt
log file.txt
78K
Post edited by RobertDy on

Comments

  • Richard HaseltineRichard Haseltine Posts: 97,041

    My suspicion would be that the PSU is failing to meet the (probably fairly heavy) demands on the system when both cards are working. Is it possible that the new Samsung drive draws more power than the drive it replaced? If the load was nudging the upper limit before that might just tip it over the edge, although if it was that marginal I'd have thought it likely to be a more intermittent effect.

  • Dim ReaperDim Reaper Posts: 687

    I am running the same two GPUs on my system, and thought that some info might help to check your power use in order to check Richard's theory.

    CPU: Intel i7 5960X 
    System Memory: 32GB KINGSTON HYPER-X PREDATOR QUAD-DDR4
    GPU 1: RTX 3090
    GPU 2: RTX 2080Ti
    1xM.2 drive, 5xHDD drives, 3xDDS drives
    PSU: Corsair HX1200

    With a scene loaded, but not rendering, the system draws around 235W from the power socket.
    During rendering with GPU-only, the system draws around 748W.
    Rendering with GPU and CPU draws around 790W.

    I'm sure that your system (with the more powerful CPU) will draw more power when your CPU is in use.

  • RobertDyRobertDy Posts: 236
    edited May 2022

     

    Dim Reaper said:

    Richard Haseltine said:

    My suspicion would be that the PSU is failing to meet the (probably fairly heavy) demands on the system when both cards are working. Is it possible that the new Samsung drive draws more power than the drive it replaced? If the load was nudging the upper limit before that might just tip it over the edge, although if it was that marginal I'd have thought it likely to be a more intermittent effect.

    Dim Reaper said:

    I am running the same two GPUs on my system, and thought that some info might help to check your power use in order to check Richard's theory.

    I'm sure that your system (with the more powerful CPU) will draw more power when your CPU is in use.

    Thanks Richard and Dim Reaper for your replies, I never considered that angle. My PSU is a 2000W FSP Cannon. I did some googling and theoretically it should be more than sufficient? But just to be sure, how do I monitor power consumption during rendering?

    Post edited by Richard Haseltine on
  • Dim ReaperDim Reaper Posts: 687
    edited May 2022

    I would imagine that a 2000W PSU is more than sufficient and very surprised if your pc is using anywhere near that.

    I monitor the power consumption with a power meter that plugs directly into the socket, and then the device (pc) plugs into that.  Search for Power Meter Electricity Usage Monitor, but with the PSU you have it's probably not necessary.

     

    As your next step, I would suggest downloading GPU-Z and have it showing the monitoring whilst rendering to take a look at temperatures.  UNfortunately, I don't have any better suggestions.

     

    Post edited by Dim Reaper on
  • RobertDyRobertDy Posts: 236

    I would imagine that a 2000W PSU is more than sufficient and very surprised if your pc is using anywhere near that.

    I monitor the power consumption with a power meter that plugs directly into the socket, and then the device (pc) plugs into that.  Search for Power Meter Electricity Usage Monitor, but with the PSU you have it's probably not necessary.

     

    As your next step, I would suggest downloading GPU-Z and have it showing the monitoring whilst rendering to take a look at temperatures.  UNfortunately, I don't have any better suggestions.

     

    I did try that for a render (that didn't freeze). The RTX 3090 hit 83C/181F while the 2080 Ti was at around 73C/163F maximum or slightly lower. Are these critically high temps? (One freezing incident occurred even at temps way below this though)
  • Dim ReaperDim Reaper Posts: 687

    Both of those temperatures are normal.

  • RobertDyRobertDy Posts: 236

    Someone? Anyone?

  • PerttiAPerttiA Posts: 9,524

    What happens if you remove the check mark for the 2080TI in rendering devices?

    Unexplained and sudden crashes could also point to defective RAM, have you stress tested your RAM and GPU's?

  • RobertDyRobertDy Posts: 236
    edited May 2022

    PerttiA said:

    What happens if you remove the check mark for the 2080TI in rendering devices?

    Unexplained and sudden crashes could also point to defective RAM, have you stress tested your RAM and GPU's?

    When I render with either card, everything is fine. It's only when both are used that the occasional freeze occurs. I've done Furmark (completely smooth) and Unigine Heaven (stuttering only a few times, less than a second per stutter)on the GPUs. Just did Windows Diagnostics on RAM, no issues.

    One thing though: I got a crash when using DAZ 4.20 with an outdated driver. Now that I've updated the driver to the latest version, 4.20 seems fine (so far). But freezes still occur with 4.16. So I'm wondering if older versions of DAZ might be incompatible with the latest NVIDIA drivers?

    Post edited by RobertDy on
  • Richard HaseltineRichard Haseltine Posts: 97,041

    If both cards can freeze when other combinations are OK that sounds very like a power issue.

  • PerttiAPerttiA Posts: 9,524

    OCCT has been successfully used to find problems other stress tests or games have failed to find.

    https://www.ocbase.com/

     

  • RobertDyRobertDy Posts: 236
    edited May 2022

    PerttiA said:

    OCCT has been successfully used to find problems other stress tests or games have failed to find.

    https://www.ocbase.com/

    (EDIT) Please see post below.

    Post edited by RobertDy on
  • RobertDyRobertDy Posts: 236
    edited May 2022

    PerttiA said:

    OCCT has been successfully used to find problems other stress tests or games have failed to find.

    https://www.ocbase.com/

    Sorry to double post - I ran OCCT test on 3D, 10 minutes on shader complexity 8. In the 3rd minute it showed WHEA error detected, and around the 8th minute the system froze. I think you may be on to something, what's this WHEA error? I looked it up and there seems to be all kinds of causes - vcore, hardware, BIOS, (EDIT - even PSU, Richard might be right) but I don't know how to find out for sure. I also recently found out that my fans and possibly GPU started to make rattling noises sometimes, might this be the cause?

    Post edited by RobertDy on
  • PerttiAPerttiA Posts: 9,524

    Taking into consideration what you have told us, if you are not overclocking the components in your system, the most likely culprit would be the PSU.

    Noise comes usually from the fans (or spinning harddrives if you still have them), but if several fans start making noise, it is unlikely that they are all failing at the same time. Failing fan can cause overheating.

  • RobertDyRobertDy Posts: 236

    PerttiA said:

    Taking into consideration what you have told us, if you are not overclocking the components in your system, the most likely culprit would be the PSU.

    Noise comes usually from the fans (or spinning harddrives if you still have them), but if several fans start making noise, it is unlikely that they are all failing at the same time. Failing fan can cause overheating.

    Thanks for all the feedback. I got my PC to the shop where I bought it from and they said that the GPUs and PSU were fine. The culprit is the motherboard. I really don't know whether to trust their word as after digging in a lot deeper I found that there are many customers with the same experiences as mine and the company may not be that honest, but this is all I have to rely on for now. I'll see how it goes under a new motherboard and I'll update this thread again.

    Once again, thanks everyone who helped. You guys make an amazing community.

Sign In or Register to comment.