Jump to content
cojoMan

SLI for openCL

Recommended Posts

Yes, sorry I should have described my CPU.  It's just my regular (2 year old?) dev machine, dual Intel Xeon X5650.  So total 12 hardware CPU cores.

 

Also, I should probably redo this test, since we have since further improved GasProjectNonDivergentVariational's handling of empty, non-fluid voxels in the fluid domain.  Meaning there's less "other stuff" in the pressure solve and more solving the linear system itself, which the part that is OpenCL-accelerated.

Share this post


Link to post
Share on other sites

thanks for sharing your results, and the detailed analysis.

would you possibly like to share your scene with us so we could have similar starting points for our tests  ?...

 

also, how did you measure the pressure linear time solve, and all the other separate components ?

Share this post


Link to post
Share on other sites

thanks for sharing your results, and the detailed analysis.

would you possibly like to share your scene with us so we could have similar starting points for our tests  ?...

 

Sure, for some reason I can't attach files to messages in this forum, but I've uploaded it here.  Note that the fluid has highly varying viscosity, which makes the linear system harder to solve (takes more iterations to converge = better acceleration from OpenCL on the GPU).  With a fluid of lower, uniform viscosity the speedup would be less pronounced.

 

also, how did you measure the pressure linear time solve, and all the other separate components ?

 

 

The Performance Monitor has sub-tasks within several of the FLIP microsolvers (DOPs) that give more detailed timing info.  For the pressure and viscosity solve there are entries for "Solving Linear System" which are the OpenCL accelerated parts.

 

Also I just remembered that for a viscous fluid we solve for pressure twice: once before and once after solving for viscosity.  This is another reason viscous fluid sims show a bit more OpenCL speedup than non-viscous.

 

Edit:  Oh, and it's also possible to run the viscosity solve in 32-bit or 64-bit now (under the Viscosity tab).  It defaults to 64-bit since it's more accurate, but you can get away with 32-bit in a surprisingly large number of situations (this test file for example).  That can save you another 20% or so on the viscosity solve on the GPU, possibly more on a consumer-level GPU that doesn't have as good 64-bit float support.

Edited by johner
  • Like 1

Share this post


Link to post
Share on other sites
Guest tar

As a side note - testing OpenCL in Houdini 14 on the Nvidia Gtx 980 on OsX is lacklustre - it appears Apple have optimised their AMD drivers in OsX Yosemite and left the Nvidia ones to rot. i.e. the lesser card Amd 7950 OpenCL is far superior currently

 

Edit: Bug logged #67161. OpenCL+Nivida+OsX is currently twice as slow as Linux Unbuntu. More investigation needed - wrong card selected in OpenCL environment preferences.

Edited by tar

Share this post


Link to post
Share on other sites
My config for testing was i7-4930k @3.4ghz and the graphic card was a Sapphire Radeon R9 280x (3G RAM GDDR5 @6Ghz, 2048 Stream processors ).

Other than that, 64g ram, and an SSD to write to.

I thought at first that writing to the SSD was biased, but seeing that I am not comparing my numbers to your necesarily, but between H13, H14 cpu and gpu and seeing percentage increases...I guess that goes.

 

SO in the same formatting style as yours, and keeping your scores as well here for reference :

 

H13: 183 (yours 232) 

h14 CPU: 143 (yours 146) - speedup vs H13 = 1.25(yours 1.6x)

h14 OpenCL: 71 (yours 74) - speedup vs H13 = 2.5(yours 3.1x), speedup vs H14 CPU = 2.4(yours 2x)

 

GasViscosity DOP: 

H13: 120 (yours 165)

H14 CPU: 90 (yours 82) - speedup vs H13 = 1.33(yours 2x) 

H14 OpenCL: 27 (yours 22) - speedup vs H13 = 4.4(yours 7.5x), speedup vs H14 CPU = 3.3(yours 3.7x)

 

For just the viscosity linear system solve itself: 

H13: 92(yours 132)

H14 CPU: 82(yours 72) - speedup vs H13 = 1.15(yours 1.8x) 

H14 OpenCL: 18(yours 12.3) - speedup vs H13 = 5.11(yours 10.7x), speedup vs H14 CPU = 4.5(yours 5.9x) 

 

GasProjectNonDivergentVariational times: 

H13: 11(yours 30) 

H14 CPU: 7(yours 28) 

14 OpenCL: 5(yours 14)

 

Just the pressure linear system solve: 

H13: 5(yours 16 )

H14 CPU: 4(yours 16 )

H14 OpenCL: 1(yours 3.25 )

 

 

So trying to make sense of these scores - my CPU is a hexacore, multithreaded – does that mean 12 cores to take into account for Houdini, or 6 ?...
  • Like 1

Share this post


Link to post
Share on other sites

 

So trying to make sense of these scores - my CPU is a hexacore, multithreaded – does that mean 12 cores to take into account for Houdini, or 6 ?...

 

 

A six core processor is a six core processor. Hyper-Threading from Intel doesn't double the number of cores. It slightly improves the performance of applications that scale well to many cores, typically 5-10% difference. Thanks for posting the results for your setup. If I have time I'll run the same tests on my machine here.

Share this post


Link to post
Share on other sites

I figured that much but just wanted to check (as I saw some pretty close numbers between my machine setup and johner's).

 

And yes, if a few more people with their configurations would run the test, we could poll a bit of a wider range of results, though for the moment, the speed increase from H13 to H14 is obvious, on the flip at least with 25%-50% increase on the CPU, and depending on the GPU, on it up to 150% increase as I see it.

Share this post


Link to post
Share on other sites
Guest tar
Test results:
 
Gtx980 4GB Evga SC @ ~1367MHz, 2x 6core X5680 @ 3.33GHz, 32GB Ram, Ubuntu 14.04, Nvidia driver 346.35.00, H14.0.249 | OsX 10.10.2 results in Purple

Other results will come in when there is time. All times in minutes.
 
H13:
h14 CPU: (OsX 170)
h14 OpenCL: (Gtx980 65) (OsX Gtx980 81) (R9280x 71) (K6000 74)
 
GasViscosity DOP: 
H13:
H14 CPU (OsX 120)
H14 OpenCL: (Gtx 980 29) (OsX Gtx980 36) (R9280x 27) (K6000 22)
 
For just the viscosity linear system solve itself: 
H13:
H14 CPU (OsX 110)
H14 OpenCL: (Gtx980 20) (OsX Gtx980 26) (R9280x 18) (K6000 12.3)
 
GasProjectNonDivergentVariational times: 
H13:
H14 CPU: (OsX 11.5)
14 OpenCL: (Gtx980 5) (OsX Gtx980 9.75) (R9280x 5) (K6000 14)
 
Just the pressure linear system solve: 
H13:
H14 CPU (OsX 8.15)
H14 OpenCL: (Gtx980 1.7) (OsX Gtx980 5.9) (R9280x 1) (K6000 3.25 )
 
Link to Linux Gtx OpenCL hperf file:
 
OsX Gtx OpenCL hperf file:
 
OsX CPU hperf file:
Edited by tar

Share this post


Link to post
Share on other sites

thanks for all the tests. But frankly I find off the shelf GPU simulations absolutely useless for any kind of production work, unless you are ILM and have your own GPU farm with your own developers. I can never seem to make any decent flip or pyro sim at least using min 20GB ram that is production quality. Let me know your thoughts on this also. and I also occasionaly hear from the companies, GPU accelerated simulations do make difference on big sims, but then you still have the memory limitations there.

Share this post


Link to post
Share on other sites
Guest tar

But frankly I find off the shelf GPU simulations absolutely useless for any kind of production work, 

 

Every technology we are all using today was 'absolutely useless' at some point. I really enjoyed reading recently a decade+ old paper/report that was dissing the new kid on the block Arnold, written by a PRman person :) 

 

Today's GPU are limited by ram, but they offer iteration capabilities that can add to the creative undercoat of a sim.  Start the sim on GPU, then up it on the OpenCL CPU or pure CPU path.

Share this post


Link to post
Share on other sites

But frankly I find off the shelf GPU simulations absolutely useless for any kind of production work, unless you are ILM and have your own GPU farm with your own developers. I can never seem to make any decent flip or pyro sim at least using min 20GB ram that is production quality. Let me know your thoughts on this also. and I also occasionaly hear from the companies, GPU accelerated simulations do make difference on big sims, but then you still have the memory limitations there.

 

The memory limitations on GPU have definitely persisted longer than we expected.  And unfortunately even if you can get a 12GB NVIDIA card, their OpenCL driver is still 32-bit at the moment so you're still limited to 4GB per process.

 

The silver lining here is that there are some production-level sims that can fit in 4GB, and we still get a very nice speedup for Pyro / Smoke using OpenCL on the CPU without the memory limitations (particularly with some of the more accurate advection schemes introduced in H14).  And the newer uses of OpenCL in H14 for the grain solver and FLIP solver only accelerate smaller-but-expensive iterative parts of the sim and are less memory hungry.   For example I think production-scale sims are absolutely possible on the GPU with the grain solver.

 

If you're in a big studio where almost all sims are done on the farm, the lack of GPUs on most render farms is obviously an issue.  The OpenCL CPU driver can help there, but there's a bit of chicken-and-egg issue on getting more GPUs on the farm.  But these days (especially with Indie) a lot of production/commercial quality work is being done by small studios or individuals; for them running a big grain sim overnight on a GTX 980 is a really nice option.

  • Like 1

Share this post


Link to post
Share on other sites

Hi all, does anybody know the reason for allowing only one OpenCL device and not having heterogenous device ressource pooling?

 

 

Start the sim on GPU, then up it on the OpenCL CPU or pure CPU path.

 

 

Can you elaborate on this please, marty? I have time to do some test for the first time in months this week on a dual E5-2640v2 with 2 GTX 780 (on Windows now, but I can install linux too if it can help).

 

Thanks,

 

Vincent

Share this post


Link to post
Share on other sites
Guest tar

The general idea is to use OpenCL on the GPU first, where it is faster, then when you exceed the GPU-Ram limitations switch over to CPU-OpenCL; this allows you to use the full system ram for OpenCL. The link below from the documentation has the details on setting it up.  Linux will work better, I believe, but Windows was running very well too last time I checked.

 

Experimental OpenCL driver
http://www.sidefx.com/docs/houdini14.0/news/13/opencl

 

Below is the original OpenCL smoke example file that shows the advantages and caveats. Not too sure how much has changed since this was first introduced a few years ago.

 

Open CL smoke example

http://www.sidefx.com/docs/houdini14.0/examples/nodes/dop/smokeobject/OpenCL

Share this post


Link to post
Share on other sites

This example is noticeably faster (more like realtime) on the GPU (GTX 780, just running the example without modification) on a machine with 16GB RAM and a single i5 (on the CPU, I get maybe 5FPS, only turning OpenCL off). Will test on a machine with dual Xeon later. But It's nice to know that you can simulate on small CPU-ed machines. Maybe I should test with OpenCL but using the CPU as the device too.

Share this post


Link to post
Share on other sites

Yes, it's just that having realtime sims running on a single 2-year-old GPU when top-of-the-line dual-CPU Xeons are /not/ giving me realtime makes me wonder what the bottom line really is for SESI in not supporting multiple-GPU. Do you know?

Also, when the freeware DAZ Studio supports NVIDIA's Iray and its _extremely_ interesting MDL paradigm, why haven't we heard of GPU support in Mantra yet. There must me super good reasons I would just like to have a 2015 update on the matter. Anyone?

Thanks!

Share this post


Link to post
Share on other sites
Guest tar

I honestly don't know the technical reasons why but, from experience, the examples of others that show technology does not always translate into what the Houdini paradigm wants to achieve. 

 

As a recent analogy; from my initial testing of the GPU renderer I have always heard of as the best, Redshift, I find it does not render Nurbs surfaces. huh? So if something similar is implemented in Houdini we then lose non-polygonly rendering- might be okay, might not be but that's more that just having a checkbox on Mantra... then the other GPU renderer of note, Octane doesn't appear to bake photon maps! So we have to invest in lots of GPU cards to bring path-tracing up to the current standard. Now these tests may be wrong, as I have had very very limited testing, but they remind me that you have to test in-situ the technology of others to know that they really do work within your framework.

Share this post


Link to post
Share on other sites

Kind of getting off topic but I'll chime in. Most GPU renderers are still very limited in functionality from what I've seen. For example most GPU renderers don't do hair, displacements, motion blur, instancing, volumes, the list goes on. Redshift looks promising and supports more features that real-world productions actually need but it supports two applications (one of which is EOL so really only one moving forward) on one platform and supports one GPU vendor. So basically if you run Maya on Windows with Nvidia hardware then you're set but for everyone else it's useless. They're just too specific and/or limited to be useful in my opinion.

 

Having said that if tomorrow Side Effects released a new version of Mantra with OpenCL acceleration, wasn't limited to GPU memory, could still do all of the things Mantra can do today, and supported all the platforms it already does I'd start using it right away. I'm not going to hold my breath though.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×