Jump to content
Sign in to follow this  
marty

Pascal's 16 bit compute

Recommended Posts

Pascal 16 bit compute looks super nice @ 22TF. Wondering if this might be able to be used in H15+ as a faster OpenCL path? Assuming the tradeoff is some stability & accuracy but error reduction might be able to be mitigated in 32 or 64 buffers.

Share this post


Link to post
Share on other sites

Possibly, if you don't mind added jitter :)

 

It'd have to be restricted to values with [0,1] or [-1,1] ranges, like colors or normalized direction vectors. Problem is you're back to 32b as soon as you do a matrix multiply.

 

I think fp16 is mostly for image processing and machine learning (which Nvidia is pumping hype for right now).

Share this post


Link to post
Share on other sites

Haha - jitter wouldn't be too much fun but it does sound good for Nuke / Cops acceleration!

Share this post


Link to post
Share on other sites

Yeah, only $129K for a fast compositing station :)

 

Where fp16 really shines is in data movement. It effectively doubles your bandwidth when fetching that data compared to fp32, and often that data transfer is the bottleneck in these massive compute engines. That's probably more advantageous than the double-rate compute rate - and it's been possible to pack/unpack fp16 data since almost the dawn of shaders :)

 

Reading a bit more about Pascal, it achieves this rate by processing 2 fp16 ops with one fp32 ALU, which probably means that you need vec2 or vec4 fp16 data to truly take advantage of it (SIMD). Scalar data wouldn't likely see a boost, and vec3 data a moderate one.

Share this post


Link to post
Share on other sites

Nice investigations!  

 

Reading around, most people seem disappointed with Pascal; only 2x the performance instead of the fabled 10x. Moving forward, hopefully it will be possible to process sims across OpenCL devices soon, especially if we NVLink the cards. Foundry now allow two equal cards to do compute in NukeX 10.

 

sumitgraphic1.jpg

Edited by marty

Share this post


Link to post
Share on other sites

I believe Nvidia stated in their earlier roadmap slides that it was a 10x improvement in performance per watt, not just performance. And I believe that was against either Kepler or Fermi, not Maxwell.

 

The best thing about Pascal, IMO, is true virtual memory with page faulting. That means it should no longer run into "Out Of Memory!" issues with large datasets - it should be able to page bits of them out to main memory and pull other pages in. As anyone who's tried serious OpenCL sims probably knows, not being able to finish the compute is really bad :) This should allow huge sims to finish, though slightly slower due to the PCI-Ex paging transfers.

Share this post


Link to post
Share on other sites

That's a killer feature!  HBM2 size appears more limited currently, 16GB on the tesla, hopefully we get more! Pascal is good in my book.

 

Cheers!

Share this post


Link to post
Share on other sites

Consumer versions out at the end of May!

  • GEForce 1080, 8GB GDDR5X, May 27 ($600)
  • GEForce 1070, 8GB GDDR5, Jun 10 ($380)

Significantly boosted clocks speeds compared to the 900 series (double, to 2.1GHz) but with similar power draw. Dropping to a 16nm process (from 28) really helps with power consumption. Performance claims by Nvidia put the 1080 just above TitanX levels.

ARSTechnica article

  • Like 1

Share this post


Link to post
Share on other sites

Looks nice! It's a good upgrade from the 980/70 - 8GB is also much better for 4K monitor/s, sims and 4k comping. Hoping the Titan and Ti version with lots more ram will be out in the following months too. 

As GPUs should be hitting the same manufacturing limitations as CPUs over the next few years, future work that could help Houdini for GPU compute would be using less particles for deep water Flip(narrow band), adaptive grid where more detail is required(Space X), and heterogeneous compute(Foundry HPC).  That should be a good 5 years of R&D!

Along these lines just put an Rfe in to have OpenCL GPU sims that fill up the GPU memory, to automatically overflow to CPU ram instead of failing. Ref #75317.

Edited by marty
Rfe
  • Like 1

Share this post


Link to post
Share on other sites

Just a quick update on the non-Tesla Pascal cards: The GEForce 1080, 1070, 1060 and the new pascal-based Titan X (no "GTX" to differentiate from the Maxwell-based GTX TitanX) and the new Quadro P series.

FP16 compute on these cards is horribly crippled, even worse than FP64. For every shader module of 128 FP32 units, there are 4 FP64 units (1/32) and a single vec2 FP16 unit (1/64 rate) on the GP102 and GP104 that powers the GEForces, Quadros, and Titan X. Basically they are only there for debugging programs to run on the Pascal-based Teslas.

:(

  • Like 1

Share this post


Link to post
Share on other sites

...that's pretty terrible, pity they didn't exploit it more.  Thanks for the update!

Share this post


Link to post
Share on other sites

So new GP102 and GP104 "Pascal" cards are worse then their Kepler predecessor for OpenCL sims?

 

So what about (new) Radeon Pro WX and Radeon Pro Duo cards?

http://wccftech.com/amd-radeon-pro-wx-7100-workstation-card/

This could be better $/performance then Nvidia.

As Tesla P100 prices wasn't revealed yet even Radeon Pro SSG looks interesting.

 

 

Share this post


Link to post
Share on other sites

Maxwell and Kepler didn't have fp16 compute, the new pascal-based Telsa card (GP100) was the first to introduce it. The reason cited for fp16 compute was for "deep learning" applications. It basically allowed for twice the throughput of FP32 operations.

The pascal-based Quadros and GEForces have the same 1/32 FP64 rate as their Maxwell predecessors, so at least in that aspect they're the same. As for fp16 compute, you wouldn't be able to run it on Maxwell or Kepler cards - or at least it'd be FP32-emulated. It would have been nice to have since HDR color is a good candidate for fp16 compute, but fp32 will do I guess (just not at double rate that full fp16 support would have had).

The Radeon Pros are based off the same Polaris 10 architecture as the Radeon 480, putting them right about at the GEForce 970-980 mark in terms of CL performance.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×