Continued from Part 1 published last week.

Mind the memory

Aachen, Germany-based MainConcept develops software for encoding and decoding signals such as high-definition video, and it is an accomplished practitioner of such fine-grained locking.

Video processing is a computational challenge; high-definition movies have to be processed in real time, and each frame takes up 1MB of memory, with each slice of the frame requiring extensive mathematical manipulation.

MainConcept has tuned its software to run on systems with dual-core chips, with the cores working on frame slices in parallel.

The company has seen performance improve by a factor of 1.8, says MainConcept CEO Markus Monig. He says performance has improved by another factor of 1.8 through the use of two dual-core processors. Such near-linear speedups are very close to ideal, with any gain over 1.5 considered good.

The software uses and searches "huge areas of memory," Monig says. If the software is carefully constructed, it can use on-chip cache memory for much of its work, speeding processing. MainConcept uses performance-tuning tools from Intel Corp. to tune the software to the hardware architecture.

Intel's VTune Performance Analyzer helps optimise the code, and its Thread Profiler and Thread Checker help balance the work of multiple threads and identify bottlenecks in multithreaded codes.

But Monig worries that he won't be able to boost performance linearly as the number of processor cores increases. "We don't expect this for eight-core, 16-core and beyond," he says. "The faster and the more cores there are, the more the memory access is the bottleneck."

Code writers trying to exploit multiple processors or processor cores face three challenges, says James Reinders, director of business development for Intel's software development products. The first is scalability -- how to keep each additional processor busy. A threefold performance boost on a four-processor system is "darn good," he says; anything more is "exceptional."

The second challenge is "correctness" -- how to avoid race conditions, deadlocks and other bugs characteristic of multi-processor applications. Intel's Thread Checker can find threads that share memory but do not synchronise, which, he says, "almost always [indicates] a bug."

The third challenge is "ease of programming," Reinders says, modern compilers can help by finding and exploiting opportunities for parallel processing in source code. The programmer can help the compiler by including "a few little hints" in the code, he says.

These "hints" are available in a new standard called OpenMP, specifications for compiler directives, library routines and environment variables that can be used to specify parallelism in Fortran, C and C++ programs. "The alternative to using these extensions is to do threading by hand, and that takes some clarity of thought," Reinders says. "So OpenMP can be tremendously helpful."

Kennedy agrees. "My philosophy is the programmer should write the program in the style that's most natural, and the compiler should recognise the properties of the chip that have to be exploited to get reasonable performance," he says.

Tom Halfhill, an analyst for In-Stat's "Microprocessor Report" in San Jose, says some software developers are "tearing their hair out" over the new CMP systems. "Rewriting the software for multi-threading is a lot of work, and it introduces new bugs, new complexities, and the software gets bigger, so there is some resistance to it."

He says Fortran and C++ don't contain parallel constructs natively, whereas Java does, so the move to CMP may boost Java's fortunes.

But the CMP train has left the station, whether software developers like it or not. Intel says 85 percent of its server processors and 70 percent of its PC processors will be dual-core by year's end.

Halfhill predicts that in five years, microprocessor chips in servers will have eight to 16 cores, and desktop machines will have half that number. And, he says, each core will be able to process at least four software threads simultaneously, a technique Intel calls hyperthreading.

The angst today over optimising software for CMPs is a little like that of 20 years ago when developers obsessed about the amount of memory and disk space available, says Halfhill. Now both resources are so cheap and plentiful that most applications just assume that they will get whatever they need.

"In five to 10 years, we'll get to the same place with processor cores," Halfhill predicts. "There will be so many that the operating system will just dedicate how many cores the application needs, and you won't worry if some cores are being wasted."

Intel's Reinders says CMPs will give a boost to hardware virtualisation -- by which a computer is made to run multiple operating systems -- with CMPs allowing for a more fine-grained partitioning of a machine. It is possible to carefully control and allocate processing resources by specifying, for example, that a certain application may use two cores and no more, while some higher-priority application gets four cores.

"If you map virtualisation onto individual cores, you can get more predictable response," he says.

CMPs offer performance advantages over systems with multiple, separate processors, because inter-processor and processor-memory communication is much faster when it's on a chip.

Rice University's Kennedy predicts that will lead to hybrid systems consisting of clusters of computers running multi-core processors.

"Then you have two kinds of parallelism: cross-chip parallelism, perhaps with message passing and a shared memory, and on-chip parallelism," he says. Functions that require very high inter-processor bandwidth can be put on a CMP, and those that don't can be distributed across the cluster.

Various types of transaction processing and database systems could make good use of such an architecture, Kennedy says.

While everyone agrees that more processors, more cores and more power can generally be put to good use in big enterprise-wide systems, the future of CMPs on desktops and laptops - where even single-core processors are idle much of the time -- is not quite so clear.

Multi-threaded game software can put the parallelism to good use, and so perhaps can a few specialised applications, such as speech recognition.

Single-processor-core PCs today can take advantage of multi-tasking, in which one thread, for example, deals with display while another does a long-running computation and another goes out to a server. But what to do with eight processor cores all running at 3.6GHz?

Microsoft's Larus says he knows people are probably having trouble imagining how a single user might take advantage of that kind of system. "To be honest, so are we," he says. "This is a subject of very active discussion here."

The many facets of multi-processing

Intel's Pentium Processor Extreme Edition uses two processor cores, each with its own on-chip cache and each running at the same speed.

Using Intel's Hyper-Threading Technology, each core functions as two logical processors, enabling four-thread functionality, in this example balanced between integer and floating-point arithmetic.

The processor can run multiple applications simultaneously with background tasks such as real-time security and system maintenance. The chip also can use Intel's Virtualisation Technology to run multiple operating systems and/or applications in independent partitions.