Cheers,
M.
Threading
There has been a lot more emphasis on multithreading now that multiprocessor systems have become more accessible over the past few years. Faced with a "GHz wall" where chips simply can't run faster due to heat buildup and pesky electrical things like current leakage and capacitance, chip manufacturers have made a 90' turn to improve parallelism instead. Basically, it's trying to do more things at once, rather than trying to do those things faster, and overall things get done quicker.
Early parallelism attempts can be seen in things like MMX, SSE and Hyperthreading, which gave very specific applications a bit of a boost (from about 10% to 800%). MMX, SSE and Altivec are vector-processing instruction sets which operate on several variables at once (usually 4 or 8). This is very useful in 3D processing and general array processing, but the vast majority of a program (even Houdini) doesn't fit into this category. For one, the instruction sets are fairly limited in what they can do, and two, even 3D processing has a lot of non-3D overhead (loops, conditional branches, setup, etc). So even though they can accelerate things by 4-8x, they tend to only be active for a small portion of the time (say 5-10%), which only accelerates the program on the whole by about 1.2-1.8x. Useful, just not groundbreaking.
"Hyperthreading" is an Intel marketing term for something that's been around in PowerPC and MIPS processors for some time. It allows two programs to be running on a single CPU at the same time, and they share the limited functional units. So, if a CPU has 3 floating point units, 2 integer units and 1 memory unit, hopefully between the two programs they can be kept busy most of the time (usually a single program can't accomplish this). If all the units are working 100% of the time, their work gets finished a little bit faster than if the programs were executed one at a time. Intel's hyperthreading on the P4 usually saw anywhere from a -10% to 40% improvement, averaging around 10% (so your hour-long render might instead take around 55 mins -- woooo hoo).
Now, with multiple CPUs on a die, there is a lot more opportunity for parallelism. There are two ways to go about using more than one CPU - multi-processing and multi-threading. Multi-processing allows you to run more than one program at the same time - such as running Houdini and mantra concurrently - which generally increases productivity. If you have the memory, it can improve performance as well, like running 2 renders on a frame range - one on odd frames and the other on evens. Multi-threading is the term given to a program that runs several threads at a time (a "thread" being a separately running task inside a program). Multi-threading doesn't re-allocate all the resources of the program when it splits, so it's a lot leaner from a memory point-of-view than running the same process twice.
Single CPU systems appear to run many applications at a time, but in fact very rapidly switch between them (time-slicing), so they aren't actually running at the same time. Multi-CPU systems still use time-slicing, since there are always more programs than CPUs, but it is now possible to run 2 or more programs at exactly the same time. So, a multi-CPU system can provide more performance equal to the number of processors available (2x,4x,8x currently). The trick is keeping the processors busy since idle processors, even for mere fractions of a second, eat into your nice 8x speedup that you paid big bucks for.
So, when it comes down to actually completing a task, it's tempting to think that with 4 processors it can be completed 4x faster. This is where most programmers wish the CPU manufacturers had been able to just keep running up the clock speeds... because this just generally isn't the case. There are a number of factors standing in the way.
First, you need to figure out how to split the work up. As any TD can attest, this isn't trivial. Ideally, you give the first half to one, and the second to another, and some time that actually works (which is great; you can go home by 5). Applying gamma to an image can be done this way pretty easily by farming out scanlines, and normalizing normals on geometry can also be split up nicely. Problem is that the simple stuff generally doesn't really take that long anyways (you always want to optimize the code that's taking the longest to run).
Second, you need to make sure that the work you're doing actually takes longer than the time to start up and stop a thread (or threads). Unfortunately, they aren't free and take a pretty significant chunk of time to startup (since the OS often has to switch from user to system mode to create and destroy a thread - if that made no sense to you, don't worry - just think 'slow'). An image has to be pretty large before it'll benefit from threading when just applying gamma, and in fact, most small images will see a performance slowdown instead.
Third, data dependencies can really be a killer. For example, a simulation can't be threaded by computing frames 1 and 2 simultaneously; frame 2 depends on the results from frame 1. Similarly, a subdivide, fluid or cloth algorithm depends on a lot of neighbouring points and this can make divying up the work difficult. Algorithms that work in multiple passes also have similar problems, as do ones that modify global data as they work.
Fourth, even though you have multiple CPUs, you still only have one memory controller and one I/O system (or disk target). So tasks doing heavy memory accesses or disk I/O in parallel will generally slow things down as they're bottlenecked by the single resource. Multiple graphics cards are also bottlenecked by the single driver that runs them (some drivers are threaded, but have exactly the same issues listed here).
Fifth, threads can't just do whatever they want. Imagine two workers at a desk, grabbing papers and writing on them whenever they need to. No useful work is going to get done if Bob keeps grabbing a paper he needs while Joe's writing or reading it. If someone kept grabbing my stuff while I was working on it, I'd likely end up hitting them, and that's probably the real-world equivalent of a multithreading crash. So if Joe's working with a certain paper, Bob needs to wait for him to finish. This does two things - One, it eats into that nice 2x-8x speedup because Bob's sitting there doing nothing (Bob isn't smart enough to do something else - no offense, Bob). Two, the manager (programmer) has to first determine all the places where the two workers can't work at the same time, which gets more difficult as the threads get longer. As more areas of the program are restricted to one worker, this situation occurs more frequently. This gets worse when Sally and Jorge show up to help. In reality, most tasks that are highly dual-threaded see about a 1.6-1.8x speedup, and quad-threaded tasks see a 3.2-3.6x speedup. Sound at all familiar?
Finally, threading increases the complexity of the underlying code *a lot* (like "maybe I can help figure out a way to solve the GHz wall problem"-lot). Researchers have suggested that it's about 10x harder to write multithreaded code than singlethreaded. This translates into longer development time, more bugs and more maintenance in the long term. Adding additional features to the task can be difficult, as can passing the code onto another developer. Put another way, this means that fewer features are added each cycle with the same development crew. So you want to make sure that the performance increase is worth the trouble. With dual-CPU systems this was a bit difficult to justify some times. At least now with quad core and 8-core systems available, that part's become easier
After running through this list of threading issues with a bunch of potential candidate tasks, you'll find some of them can't be done (1x speedup) and others can't reach their full potential (which is generally 80-90% of expected speedup). Sometimes it's better to spend the time figuring out a better way to implement the algorithm - SIGgraph papers can show approximations and structures that accelerate things by orders of magnitude, not just by a paltry 2-4x. This is not to say that threading isn't useful - it is tremendously - just that there are a lot of cases for which it isn't well suited.
There's other ways to use threading as well, besides for doubling up performance. One is to hand off slow tasks from quick ones, like spawning a thread to write out a file, or do network I/O (which works nicely even on single-processor systems). Another is to separate the user interface from the underlying processing, giving the program a feeling of responsiveness that the user perceives as "fast" which never hurts.
As multi-CPU systems continue to climb rapidly into 8-core systems and above, hopefully more software development tools will be made available so that developers, and by extension users, can take advantage of them. For now, it feels like software developers are playing catch-up. After years of software developers banging on their doors for more processing power, the hardware guys must be grinning.
GPGPU
GPGPU, or "General Purpose computing on a Graphic Processing Unit" is the art of doing non-graphics related tasks on a graphics device. Graphics Processors (or GPUs) have changed over the years from rigid graphics pipelines to programmable graphics shading. Along the way, the precision of these devices has increased from 256 colors, to 24bit color (16million colors, or 1 fixed byte per RGB channel), to full 32bit floating point per channel color (don't know the exactly #colors, but way more than your eye could hope to see, even with pharmaceutical help). Together, this has made them viable platforms for doing more than just computing pixels.
Because of the nature of shading pixels is highly repetitive, it can be done in parallel. Hardware developers exploited this by adding multiple shading units to their GPUs. Slowly starting around 4-8 units, this has exploded into hundreds, making GPUs very capable number crunchers. For processing huge amounts of data, this would be the perfect device. Well, in theory...
Several years ago, GPU manufacturers realized they had a good product for doing more than just graphics. The marketing teams did what they do best and GPGPU was born; the only problem was that it was a tad premature. There were a lot of limitations placed on the shaders by the hardware, which in turn limited what could be done outside of the graphics realm (but Doom3 sure looked cool - when you could see it). Some of these early issues were:
- Very small shaders, at 256 and then 4096 instructions.
- Reduced precision at 16bit or 24bit floating point, which isn't enough precision to do most general purpose math.
- No integer support above 8 or 16bit, which is generally required for task control.
- Extremely limited branching and looping. If you could actually do it, it was very slow (like "don't even bother" slow).
- Limited number of inputs. Only being able to access 2 or 4 arrays (textures) and a small set of constants at a time from a shader required multiple passes with different shaders, increasing the complexity and slowing the process down.
- Texture sizes were limited to powers of 2, and limited to a maximum size of 2048x2048 or 4096x4096.
- Limited onboard GPU memory. 64 or 128MB equipped boards were already running your OS's framebuffer and all your graphics.
- Bus bandwidth. The AGP bus took some time to get the data to the GPU and back, which added significant overhead.
- Proprietary extensions. There were no real standards for certain workarounds to the above issues, so it was almost like multi-platform development.
- Crappy ass driver support. Some of the features of OpenGL or DirectX just plain didn't work (Oops... personal bias creeping in)
As graphics hardware progressed from the nv6800/ati9800 to the nv8800/ati1900, the situation became much more favourable for GPGPU. PCI-Express improved the bandwidth, DX9 enforced 32bit processing, GPU memory sizes increased dramatically, shader programs' sizes could be much larger and the power-of-2 limitation was relaxed. Additionally, the vertex shader unit and the fragment (pixel) shader unit merged into a single processing unit, inheriting the strengths of both. Some of the restrictions still exist, but aren't quite as severe, making programming on the GPU a more reasonable proposition.
Current hardware (NV 8800, ATI X3800 at the time of writing) provides a massively parallel environment to play in. While CPUs are at sitting around 4 functional cores, GPUs have hundreds. They can run the same program on hundreds of pieces of data simultaneously, in contrast to a CPU which with SSE support may run on dozens. The clock speed of the GPU is a lot slower - 600-900MHz - but the added parallelism is great for large datasets.
The biggest problem is the "distance" of the hardware from the CPU. Besides having to travel along a bus to the GPU and back, all processing must go through a software driver. It's almost like outsourcing - it gets the job done, but you have to package everything up and do a lot of communication in addition to the work that gets done. So, while the process might run 500x faster on the GPU, the transmission back & forth can take a huge amount of time (graphics processing generally only goes one way; GPGPU by necessity has to be two way). You don't want to send away the work unless it's a lot of work (it has to be a lot of data, and it has to do a fair amount per data element). Because of this, most tasks are immediately disqualified.
So the trick is finding a task that runs frequently enough to make a difference when optimized, and runs well on the GPU vs. the CPU. In this case, speedups of 10-20x aren't uncommon. In addition, you can even cleverly have the CPU do something else while waiting for the result, turning the GPU into an additional parallel processor (also known as Asymmetrical Multi-processing).
Being in the 3D business, we're graced with a lot of algorithms that will benefit from the GPU's parallel processing and its vector nature. Still, the biggest weakness of the GPU is that it's still not general enough for general purpose programming. Most algorithms require more advanced data structures than 1D or 2D arrays, which can be difficult to represent and modify within the GPU. Shaders generally operate on a 1-to-1 basis, making algorithms that create or destroy data inefficient. Looping and branching is still far behind the CPU in terms of performance, and is very common in advanced algorithms.
GPGPU projects like nVidia's CUDA have begun to appear, which should definitely help with many of these issues. Designed as a general purpose dataset computation toolset, it removes all the graphics API overhead and allows programming in more generic fashion, with more of the conveniences developers are used to. Also on the horizon are GPUs with 64bit floating point support, needed for serious mathematical computation. AMD/ATI's Fusion project is planning on marrying a GPU with a CPU on a single chip, hopefully eliminating a lot of the data transmission issues. And GPUs will continue to fight over the highest FPSs in whatever games happen to be hot, pushing the performance even higher.
The next few years should be very interesting for GPGPU, after a bit of a slow start. I, for one, just hope the drivers work.
Article continued further down the page

Sign In
Register
Help





MultiQuote





