Render crashes to CPU
areg5
Posts: 617
This one is kind of weird: I am rendering a scene in the finished basement of this product: https://www.daz3d.com/collective3d-long-island-mansion. First off, I should preface by saying to even use the product at all in any room I had to use the Scene Optimizer to lessen the texture resolutions or it won't render at all. For most of the rooms I cut the resolution at least in half or quarters. I'm working a scene in the finished basement. The render takes at least an hour to do the empty room. Not only that, but where as other rooms in the house are finished after 10k iterations, the basement doesn't finish in less than 30k. So I set the iterations to 50 k just to be on the safe side.
But here's the real issue: after about 30 minutes many times the render crashes to CPU. If I turn off the 980 Ti it's more likely to finish. What I'm thinking is that this might be a hardware issue. Namely, I think the motherboard is having issues with the lane traffic when the render is too long. But I don't know. I tried adjusting the BIOS so that each PCI slot will only run at 4X rather than auto (this was recommended to me for a different issue at Tom's Hardware site). I tried just using one card, which takes twice as long and still crashes to CPU. I tried turning Opti x off.
If anyone has any thoughts on what this issue might be I would appriciate it.

Comments
I'd recommend trying to get as much data as you can first to help you sort it out. Otherwise you can spend hours/days running down rabbit holes based on hunches.
First I'd try looking into the iray log file and see if you can decipher what happened (Help/Troubleshooting/View Log File).
I'd also monitor the Task Manager GPU and CPU information to see what's happening to the system RAM, CPU utilization, GPU VRAM and utilization, etc., as the scene loads and rendering progresses.
I'm not sure I understand concerns about PCI lane traffic and speeds. I'm sure there are many of us who have successfully done GPU renders for many hours without issues with PCI traffic and speeds.
Also, it might help if you post some more info on your system, like how much RAM you have, how much VRAM your GPU has and how much is utilized, etc.
I'll look at the log files. All of my system information is listed in my signature at the bottom. I have 32 gig RAM, the 1080 Ti's have 11 gig VRAM, the 980 Ti has 6. I'm tryiung to render a scene at the moment that I'm having trouble with, and the VRAM utilization is 7358 MB. As for the lanes, when using 3 PCI slots on my board, each slot can't run faster than 4X.
So you have 3 cards (2x 1080ti and 1x 980ti), but if you try with only one card it still goes to CPU?
As with anything, I'd recommend you first simplify and see if the basics work. With only one 1080ti, what happens? What are the numbers for CPU, GPU, RAM and VRAM utilizations? If the scene isn't filling up your RAM and VRAM but it's crashing to CPU then there's something strange going on. Also, have you received any driver updates recently? Is this something that only started recently?
Also, when you say "crashing to CPU" do you mean the GPU utilization goes to 0 (as measured by Task Manager or something else) and the CPU utilization goes to 100%?
Yes, I mean the GPU utilization goes to zero. I'll try each card individually. Maybe I just have a bad card. I noticed it started when I began doing these particular scenes in the finished basement of the collective 3d Long Island Mansion.
As for the driver: I run Windows 10. My current driver is the latest one from March. I've tried to update the driver 3 times in may, and this latest one from June 5. There seems to be an issue with those newer drivers. I mean, they work ...but at some point while I'm working I get that Windows 10 has to restart blue screen. The only driver that doesn't make my computer restart is the one from March.
Just a thought. Check your log file after a render. I often get carried away with loading my scene up with Lots of stuff. After loading up geometries and textures, if you run out of vram, you will default back to cpu rendering. I had a 1080, 1070, and a 1050ti running, the 1050ti naturally would bounce out if using over 4GB of vram. Its pretty easy to go over the 8GB mark. on your 1080 TI. I do it all the time......
In my experience, if your scene is too large it never uses the GPU to begin with. In my case, it'll render for 5-10 minutes then go to CPU. It does seem scene related.
Here is my log file from the point opf failure:
2018-06-18 20:36:26.580 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.6 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): Kernel [1] failed after 0.002s
2018-06-18 20:36:26.580 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.5 IRAY rend error: CUDA device 0 (GeForce GTX 1080 Ti): Kernel [10] failed after 0.005s
2018-06-18 20:36:26.596 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.6 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): an illegal memory access was encountered (while launching CUDA renderer in core_renderer_wf.cpp:807)
2018-06-18 20:36:26.596 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.5 IRAY rend error: CUDA device 0 (GeForce GTX 1080 Ti): an illegal memory access was encountered (while launching CUDA renderer in core_renderer_wf.cpp:807)
2018-06-18 20:36:26.596 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.6 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): Failed to launch renderer
2018-06-18 20:36:26.596 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.5 IRAY rend error: CUDA device 0 (GeForce GTX 1080 Ti): Failed to launch renderer
2018-06-18 20:36:26.597 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.8 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): Device failed while rendering
2018-06-18 20:36:26.597 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.9 IRAY rend error: CUDA device 0 (GeForce GTX 1080 Ti): Device failed while rendering
2018-06-18 20:36:26.597 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.8 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): an illegal memory access was encountered (while initializing memory buffer)
2018-06-18 20:36:26.597 WARNING: dzneuraymgr.cpp(307): Iray ERROR - module:category(IRAY:RENDER): 1.8 IRAY rend error: CUDA device 1 (GeForce GTX 1080 Ti): an illegal memory access was encountered (while de-allocating memory)
2018-06-18 20:36:26.597 WARNING: dzneuraymgr.cpp(307): Iray WARNING - module:category(IRAY:RENDER): 1.9 IRAY rend warn : All available GPUs failed.
Ok ...looks like this issue in my case was scene dependant. With all of the characters at base resolution (including the hair), with further lessening the house resolution using Scene Optimizer, and with deletion of any objects not in the scene, all of my cards run. GPU Z continually through the crashes was showing that I was well within the 11 gig VRAM of my 1080 Ti's. So it never should have crashed. So what's up with that? The cards were running around 77 C so they were not overheating.
The answer (I think) is that the lane traffic through my PCI Ex slots overwhelmed my mobo and i7. My system could not sustain it. When I switched my bios to auto, allowing for the first PCIex slot to run at 16X, the card plugged into that slot finished the render if used alone. On auto with 3 cards the slots throttle down on auto, lessening the lanes available. I discussed this with some pretty knowledgeable people at Tom's Hardware. I asked the question about a build I have planned for an Iray rendering server using 3-4 cards. I was told that my current build, which they referred to as consumer grade, won't be able to handle the lane traffic reliably for 3 cards. I was told to build something server grade, using a LGA 2066 Intel X299 board with a i7-7800X Skylake chip. That should be able to handle 3 cards (and hopefully not crash to CPU), because it allows the slots to run faster without lessening the available lanes. I haven't started the build yet, but I plan to by the end of the year. I guess if I go to all of that time and expense and it still quits on me, then that isn't the answer. We'll see.
I think the way to think about it, is that although the video cards could handle the load, my i7-4790/z97x combo could not. They talk about "available lanes" when they talk about PCIex slot traffic. So the explanation of the difference is that in my current build, I have 16 available lanes. When 3 slots are used, the traffic goes 8x/4x/4x. The 2066 board has 44 lanes, and would run at 16x/12x/8x/8x. I figure that the rendering runs down those lanes, so on my current build there is a bottle neck at the PCIex lane CPU interface. The supoprt of this is that when I run 1 card in the first slot, it'll run at 16x and it finishes. If I run cards in the first 2 slots it should run at 8x/4x, which usually is suffiecient but in complex scenes it is not. The lane traffic uses the chip for processing even if the chip itself isn't rendering. So I'm hoping if I allow for more lane traffic this problem will just go away on bigger complex scenes and allow my cards to run at full capacity. That's my theory.
That I don't know, but when you think about it you have 3 cards rendering the same scene essentially. Those 3 elements have to be merged by the processor, and it could be that the speed issue (meaning the difference in speed of the lanes) can cause faults for longer renders. Again, this is just my theory.
If they are not faulty, the MB and CPU should be fine.
Try with the simplest of scenes: a Cube with IRAY shaders. Then try with a just a figure; see if they render on either 1080ti and/or the 980ti
I don't think they're faulty. I just think the MB has limits. Simple scenes aren't an issue regardless of which card I use, in any combination.
I certainly agree that it's strange. But another thing that's strange is that it seems to very scene related. I've done renders well in excess of an hour with no problems. Certain products are a killer, like this one. The Collective Long Island Mansion. I don't know what it is about it but I think the textures are just too big. Like I said, if I don't scene optimize it it won't run at all. I've done renders of say just characters in a different scene in excess of 10 gig by GPU-Z, and they go off without a hitch.
I wasn't planning on a new build because I'm having a problem with occasional crash to CPU. I just want a dedicated redering server so I can continue to make scenes on my current build while rendering on the other. I was just curious if I was going to have a problem like this on the new build. The guy at Tom's Hardware was a bit vague, only saying that a 2066 build would perform better. If the cards are the same in both builds, I think "perform better" may be not having issues like the one I'm having now.
Another observation: Optix increases the VRAM utilization. Now, I've noticed that although we are told that the entire file is used by all GPU's, then I would expect the VRAM utilization to be exactly the same in all cards, but it isn't. I'm currently doing a scene where my first card is using 6990 of VRAM, and the second is using 6490. I've also noticed that although my 980 Ti has a max VRAM of 6 gig, as a third card in my system, sometimes in a scene like I'm doing now it will see under 6 gig and run, and sometimes it won't. Sometimes it runs for a bit and then the whole render crashes to GPU. So clearly all cards are not seeing the same VRAM utilization from the same scene. The first card in slot 1 ALWAYS see more VRAM used than the other cards, and if you are close to the limit of the card then turning Optix off allows the render to finish. There seems to be this unstable zone around where the max VRAM is, say in the 980 Ti, like 5400 MB (just as an example). It may run below that number or above it, but there always seems to be this narrow zone of instability that can crash it.
As do mine, except that I would have thought that the usage would be identical, and it isn't. Also unexplained is: if my 1080 Ti's see a scene as 7 gig and 6.7 gig respectively, then the scene should be too large to to be used by my 980 Ti at all, and yet GPU-Z tells me that it's running full bore and using 4.9 gig. So it appears that the usage is not simple or cut and dry.
I've been using a 3 card build for at least a year. My prior build was 980 Ti x2 and 780 Ti, and I saw the same sort of thing. If a scene was around 4 gig, my first 980 ti would run at 4, my second at 3.7 and my 780 ti at 2.8 or so. So I've seen this sort of thing. I just don't know why. The 780 Ti should be able to run at all on a 4 gig scene.
Based on what I've observed, it seems like the process is something like this. If somebody knows better please enlighten me:
I'm especially unclear on the first couple of steps above, since when you load a big scene, before you hit Render, the scene is visible, but doesn't load up system RAM very much. Not sure how that happens...
It seems that the figurs reported by the preparation stage are for uncompressed textures, so the resemblance between their total and your gpu ram is coincidental. DS prepares the scene by making a pure geometry and materials, including textures, version - Iray doesn't care about rigging or morphs - and sends that to the GPU.
Ahhh...thanks...That's what I was trying to figure out, what scene stuff was unnecessary to transfer when rendering.
BTW, I'm trying to figure out what "uncompressed" means when it comes to textures. Presumably most are JPGs, which by definition are compressed. And I'd think the renderer needs high-res textures to calculate the image. So if you compress the already-compressed JPG textures you're losing definition in your image.
Hmmm.....
Iray has its own compression routines, the thresholds at which they are applied are set in the Advanced tab of Render Settings.
Interesting theory. I have 32 gig of RAM, and like you say, when I'm doing a large scene, my RAM loads to around 20 gig, or 2 gig less than the combined VRAM of my 1080 Ti's. However, when I run 3 cards, it still only loads to 20 gig, and by your theory it should load to 25 gig. On this particular scene, the 980 Ti ran for a while then crashed. On a smaller scene, that allows the 980 Ti to run to completion, the RAM only filled to 11.9 gig. The cards are seeing 5.5 gig, 5.1 gig and 4.1 gig respectively. So 14.7 gig all told. Minus 3 gig, 1 for each card, is around the 11.9 gig of RAM ... so it seems that there are other calculations that the system is doing. In the first case, something is telling the system that the scene is too large for the 980 Ti, and so the RAM loads to accomodate the 1080 Ti's. Why the 980 Ti runs at all for a while is still a mystery. It should have been precluded from running at all, and maybe it would have been if the scene was a gig larger. When it's allowed to run, sometimes it finishes...but when it runs for a bit and then fails, all 3 cards crash to CPU. When it's not allowed to run at all, the 2 bigger cards finish the render. That may answer my crashing question. The scenes were in that in between place that almost allows all 3 cards to run, but not quite.
It is noteworthy that my RAM stays at whtever the load size was until the render is done. I would have thought it would flush itself but it doesn't do that until the scene is completed. I batch render, and when the scene finishes, the RAM flushes and then loads up again for the next scene.
That's not what I'm saying. I *think* that the scene loads into RAM based on the scene size and available RAM and/or virtual RAM space, not the GPU VRAM size. I doubt it checks VRAM first before loading. If there's enough system RAM, then it loads. After that, Iray kicks in and checks the VRAM availability, lumps all the cards together into one block, and decides along with W10 VidMm and VidSch how it will allocate the Iray/compressed version of the scene to the VRAM. Doesn't mean it has to do it equally or anything like that. Maybe if you have 2 x 1080ti it uses them first, and ignores the 3rd card if it doesn't need the extra VRAM. It might also depend on which card is running the monitor or other apps using some VRAM, etc.. I doubt its a simple scheduling chore.
And as far as your RAM usage staying at what it was when it loaded, yeah, that's how I think it works. It takes the scene from disk, then fills the system RAM as needed, and once it's loaded from disk into RAM there's no need to change it, unless there's some optimization that goes on in RAM or something.
It sounds like there are still a bunch of questions that need to be answered about why you're having the problem. Is there a memory problem with one of the GPU's like the iray log implies? Is there a messed up driver? Have you done any narrowing-down like has been suggested? Otherwise you can guess and run on hunches forever and not find a solution.
Actually, DS can't tell if a scene is going to fit - it just offers it up. That's why we can't get a reliable estimate before rendering.
The issue is so sporadic that I don't think it's a hardware problem, and it really is limited so for to this one particular product I've been using. Each card runs fine when individually tested. It could be a 3 card issue. As far as the driver goes, I'm running 391.35 which came out in March. I've tried the last 3 or 4 updates, and each one evenually would cause my system (Windows 10) to eventually shut down, with that "we're restarting your system" message. When I looked up the error code it said it could be driver related. So each time an update comes along, I try it, and if my system does the restarting thing I roll it back. Do you think there's any possibility that it could be related to the anti-virus? I run Avast and Malwarebytes premium.
Again, start simple. Completely remove the video drivers using DDU. Remove unnecessary components.
Don't assume.
It can be sporadic AND a hardware problem. Maybe it's a certain memory location that's only accessed with certain sized scenes. You have no way of knowing. UNLESS you do as I suggested and test everything, including your GPU VRAM, system RAM, hard drive, etc.
Make sure the basics are okay, testing along the way. Let the results guide you, don't be guided by assumptions and hunches.If you're worried about some software getting in the way, uninstall it for now. Can it be related to Avast or Malwarebytes? Of course, it can be related to anything. Heck, not long ago Malwarebytes took over users' computers and grabbed ALL their RAM like a virus. Computers and software are very very complicated. I know we love to make everything a simple answer, but often it's not. So the only way to deal with complexity is simplify.
BTW, have you checked the W10 Security & Maintenance log to see what's happened lately to your computer? Data like that can help you figure out what's going on a lot better than guesses.