I constantly tell myself I’m not nuts. And yes, it’s disconcerting when I answer myself, but when some seemingly inconceivable theory I’ve concocted turns into reality years later, I become grounded again. It took almost 10 years, but Sun Microsystems made the first movements of my most insane vision real.
The vision? That the ideal computer for business apps would be built not with one or two or 32 blazingly fast CPUs in an SMP arrangement, but with a large array of very simple, comparatively slow processors.
By way of illustration, I proposed 64 Zilog Z80 microprocessors. The key to the design would be that memory and I/O buses would match the master CPU clock speed as closely as possible. The ideal implementation -- it seems unreachable, but who knows? -- would create a system in which, under typical load, RAM would operate at the processors’ clock speed. Nirvana would be achieved by synchronising the CPUs like pistons on a crankshaft, with no two attempting to access the same bank of RAM at the same time. With that structure in place, access to external memory would be fast enough to shrink the size of on-chip L2 (Level 2) cache to the point of eventual elimination.
What non-gearhead knows or cares about L2 cache and CPU complexity? I liken it to passenger space. A third to a half of the interior of an AMD Opteron CPU is not available for computing. In a car, L2 cache and deep CPU pipelines are the bucket seats, design contours, armrests, air bags, centre console, cup holders, and over-sized boot. If you were to gut your four-door saloon down to its outer shell and rebuild it with nothing but bench seats, you could carpool seven comfortably, nine if everyone in the cabin practices good hygiene. Everybody gets to work on time, and the energy savings are enormous.
A massively parallel system built with slow processors on a fast bus could carry several throughput-constrained tasks to completion, simultaneously, without most of the round-robin stop and go that slows entry to midlevel SMP systems. Oodles of slow CPUs that never wait for RAM? I want that. Peripherals that work asynchronously, managing queues of requests, and move data directly to and from memory -- I want that, too.
Although my ideal remains distant, I see more than a silhouette in Sun’s Niagara. Instead of a cluster of discrete CPUs, Niagara burns eight SPARC cores onto one chip, each capable of executing four threads simultaneously. In ideal operation -- again, unreachable, but who knows? -- 32 execution engines can all pump at once with very fast pathways to RAM and peripherals. It’s the culmination of years of design around Sun’s “throughput computing,” a brilliant concept hampered by a necessarily unglamorous execution. If the cores run too fast or get too fancy trying to predict what a thread will do next, parallelism suffers, and the ideal is lost.
But Sun’s got the right idea. And if you want confirmation from someone other than yours truly, ask Intel; Niagara foreshadows Intel’s strategy. When Intel reminds us that dual core is just its first take on multi-core, it’s telling us that lots of pokey cores on one CPU may not look all that thrilling on paper, but we’ll lose no ground in net performance with current apps. And as developers get serious about multi-threading apps, the parallelism benefits of such architectures will take off.
An armada of slow CPUs? I knew I wasn’t hallucinating.