Andy Bechtolsheim, one among the group of Stanford University students who founded Sun Microsystems, invented the original Sun workstation and guided many subsequent computers into production. Bechtolsheim left Sun in 1995 to start Granite Systems, a Gigabit Ethernet networking company that Cisco later acquired. He then went on to co-found Kealia, an advanced server start-up that Sun acquired last year.
Since then, as Sun's chief architect and senior vice president of network systems, Bechtolsheim has been busy designing the AMD Opteron-based Sun Fire x64 servers -- better known by their code name, Galaxy -- see a review of an early model on Techworld here.
Just prior to the Galaxy launch, Bechtolsheim talked about Galaxy's design, Sun's server strategy, CPU architectures, virtualisation, and the future of SPARC.

Q: What brought you back to Sun? What was the genesis of the Galaxy design?
A: I left Sun in 1995 to start a networking company that Cisco later acquired. At that time, I was fascinated by the Gigabit Ethernet network opportunity, and I didn't see that Sun, as a server company, could pursue that. Similarly, about a year or two ago, when I first heard about the Opteron CPU, I was [drawn] by the market opportunity that would create. Obviously it's too late for a start-up in the server space, so my little start-up company was pursuing a vertical market segment: video servers. But it was clear that the best use of this technology was within a large server company. When Sun announced publicly that they were going to use the Opteron architecture, it was an obvious match.

Sun's previous efforts in the industry standard space were based on OEMing white boxes from Asia. To build differentiation, or to add value to a server, you have to design something better than the mainstream. Since our return to Sun, that's what we've focused on. The real change to the company is we've added an engineering department to focus on building enhanced systems in the industry standard space that are totally Sun designed.

Q: What did you hope to achieve?
A: What we focused on in this Galaxy system is performance, and we are proud to report that these boxes deliver the industry best performance for the industry standard architecture. This was achieved through a combination of the dual-core technology from AMD but also by supporting a higher power version of the Opteron chip in these systems. And [Galaxy] is still vastly more power efficient -- roughly twice as power efficient -- than what the competition has.

This new CPU allows us to get benchmark results that exceed the Xeon MP four-way box. To put this in context, the Intel four-way system is, of course depending on the vendor, a 3U or 4U box. Our box is either 1U or 2U and costs half as much and [consumes] one-third the power. It doesn't look very good for Intel, I have to say, in terms of comparing the Xeon MP to the dual-core Opteron.

Q: Why did you choose Serial Attached SCSI drives over SATA?
A: There's a big difference between SAS and SATA in terms of cost performance and market position and so on between the 3.5-inch technology and the 2.5-inch. On 3.5-inch, SATA is now up to capacity points of 500GB, which is substantially higher than the SAS or SCSI drives. [SATA] runs at 7,200 rpm, and they have enterprise-quality drives that are pushing a million hours MTBF [mean time between failures]. Basically, on the 3.5-inch side, there is a whole new category of disks now known as enterprise-class SATA that have a very, very appealing size and performance per dollar.

Going back to 2.5-inch, the only SATA disks available in the 2.5-inch form factor run at 5,400 rpm, and they're basically mobile disk drives. They were never designed for an enterprise class environment. The controller we have can support SATA from a protocol standpoint, but the capacity today is limited to about 120GB and the performance compared to the SAS drives is just not very good. As a result, we didn't see any customer interest in 2.5-inch SATA, even though we see lots of interest in 3.5-inch SATA.

The SFF [Small Form Factor] SAS disks are currently running at 10,000 rpm, but they actually deliver better performance than conventional 3.5-inch disks at 10,000 rpm. Sometime next year these drives will go up to 15,000 rpm, at which point they'll be the fastest disks on the planet, because on the 2.5-inch disk the arm doesn't have to move as far as on the 3.5-inch disk. They come in 36GB and 73GB capacity points today, and they will expand going forward.

But the real reason we picked these SFF-SAS disks is that this allowed us to move the disk drives out of the airflow of the CPU. The entire left section of the box is a perf pattern that's open for airflow so that the air goes through to cool these hot CPUs. As the industry moves from single-core to dual-core to quad-core, the power is not exactly going down. The power density -- which is how much power these multi-core chips will take -- will actually go up even though the power efficiency, which is how much throughput per power you get, is doubling every time you add more cores.

Q: Is the Galaxy's Opteron CPU still a 95W part?
A: No, the new parts are actually 120 Watts. The box is designed to handle even higher power in the future as those chips come out. But this is a new thing that AMD did for us -- increasing the power -- because we felt strongly that faster is better.
We did all the math based on power consumption and the cost of electricity in California and all that, but what it comes down to is that if you can go 10 per cent faster by having a faster chip, as a result you need 10 per cent fewer systems to have the same throughput. Under any math, you're better off doing that than using more systems with all the memory and disks and operational costs and running slower. The reason there has always been a premium on high-power CPUs is that the value of that speed to the customer of not having to purchase more software licences or more systems or use more rack space is very significant.

Q: Do you see this stage of AMD's technology as competent to handle heavy virtualisation?
A: We have to talk about the exact software here. VMware has mastered the art of virtualisation, so their software works perfectly well on the AMD Opteron and you can run Windows, Linux (Overview, Articles, Company), Solaris on top; it all works just beautifully. The open source effort called Xen is in early stages. The Xen effort will be helped by a future hardware enhancement that AMD and Intel will be putting into CPUs coming out in the next calendar year. [Intel VT and AMD Pacifica virtualisation technology] will make it easier for the Xen effort to offer the same kind of capability as VMware software today.

Q: You say that Galaxy will do the same work with 10 percent fewer servers today. Will that ratio improve dramatically when Pacifica lands?
A: That depends on the number of application kernel interactions. There's no single way to quantify that. The Pacifica architecture makes life easier. It will be easier for the open source Xen or Microsoft Virtual Server to be fully functional and perhaps have some performance advantages [over VMware] in some cases. But I don't think the quantity of performance improvement is what's driving this. Today, or historically, there was a huge cost premium for virtualisation. And still a lot of people chose that route because they could save as much on the hardware. But going forward, I think we could assume that a year from now everybody is going to ship virtualisation as part of the basic offering.

Virtualisation is a very important topic. What's really happening in the market of course is this transition from a two-socket single-core to the two-socket dual core. AMD has it now; Intel will have it next year. Historically the two-socket market -- the two-core market -- was the sweet spot. But [now] the two-socket dual-core is actually the most cost-effective system, which is really a four-way. To take advantage of the four-way, you want to consolidate more workloads on it. But again, this is a very significant transition in the market. Just look at the percentage market share today: four-way systems are less than 10 percent of all the systems shipped. The rest are the two-ways and the one-ways. Whereas a year from now, you would expect that 90 per cent of all systems will be four-way systems.

Q: Is Sun going to continue the message it had for SPARC, that it really doesn't matter how fast you toggle the clock?
A: Clock rate [alone] is completely meaningless. What matters is the amount of work that is accomplished. For example, Opteron has three integer pipelines internally; Xeon only has two. So there's a two-to-three conversion in terms of productivity. On top of that, the lower clock rate helps tremendously to lower power consumption. Power consumption is linearly related to clock rate, and the 30-stage pipeline on Xeon was really, really bad from a power-consumption standpoint. Both the Opteron pipeline and AMD's future pipeline is much shorter than that. So I think that Intel simply went the wrong way on these micro-fine pipelines that tried to maximise clock speed.

But let me go back to clock rate. For an architecture like Opteron, the scaling we see with increasing clock rate is pretty linear. The memory controller is on chip, so there's no other element like the front side bus that's at a certain speed that doesn't improve.

Q: And the memory controller always runs at the CPU clock speed.
A: Yes, and that has been a surprisingly major improvement for AMD. We were comparing some benchmarks from the Xeon MP with the 8MB cache and even there the larger cache does not make up for the fact that the memory is that far away from the CPU.

Q: Or that communication with other CPUs and all I/O devices has to run through a single northbridge.
A: Exactly. The I/O performance on Opteron is also very good. Meaning we can support a large number of Fibre Channel adapters or any other kind of I/O -- InfiniBand going forward, 10 Gigabit Ethernet -- at wire speed given the memory bandwidth, which is more difficult, let's just say, on an Intel system. Again, Intel is working overtime to correct things, so I don't want to turn this into an AMD-versus-Intel discussion. But we're very happy with the performance we're getting out of Opteron. Certainly in technical markets, or within any market where the primary decision criterion is performance, Opteron wins hands down.

Q: Tell me about the throughput of the system. What's the speed of the HyperTransport in this box?
A: The system we're shipping is with the full gigahertz HyperTransport. And that makes a difference, by the way. We saw a significant increase from some earlier systems that were not running at the full speed. The memory speed, 400MHz, is also very important, particularly on floating point and memory intensive applications.

On the enterprise systems we always use two HyperTransports for I/O, and again this made a difference in terms of the total I/O capacity we can get. It's tough to talk about peak performance here -- peak I/O bandwidth -- because I/O bandwidth is limited by the PCI slots, but we are not limited by the internal HyperTransport bandwidth in any scenario.

Q: You were talking earlier about wire speed I/O and interfaces like InfiniBand. How special is your implementation of Ethernet in terms of performance?
A: That's industry standard, but we have wire-speed performance on Gigabit Ethernet of course. When I was talking about InfiniBand, what I really meant to say was that today that's a niche market, but the OpenIB effort -- which is a way to do a horizontal stack that's part of the Linux operating system rather than being provided by third-party vendors -- that's coming along pretty well. We think roughly a year from now there will be a stable, high-functionality, OpenIB stack available, which really means that it will have the full support of the Linux operating system.

We're working on a similar stack for Solaris that should be available next year as well. At that point you could run your storage over InfiniBand, your cluster, your MPI …. It's a great technology for high-performance clustering and high-performance storage. Obviously Fibre Channel is the entrenched interface there, and will be around for many, many years. But quite frankly, InfiniBand provides significantly better cost performance.

Q: Could you have done Galaxy with Intel technology?
A: Intel has 80 per cent or whatever of the market. Lots and lots of people are reselling their technology. We could have built exactly the same product as anybody else, but we couldn't have built a better product. What we have been trying to do here is to deliver superior cost performance to the market, and AMD enabled us to do that. That doesn't mean we're anti-Intel. If Intel has a superior chip in the future, we'll consider using that. But in the meanwhile AMD looks really good.

Q: Moving to in-house engineered systems allows Sun to tune the motherboard and give it unique characteristics while staying within standards. Do you envision future Sun software that takes advantage of your unique engineering?
A: Yes. This is largely in the area of the fault-management architecture, which is how the service processor communicates with the operating system. We can add more features there to Solaris than what's available today under Linux. That doesn't mean it couldn't be added, but somebody has to do the work in the Linux world or on the Microsoft side to perform the information exchange between the main CPU and the management processor. These are things like [allowing] the management processor [to] see the bad pages or the bad memory, disk, etc. that the operating system discovers by being the operating system.

Q: Reports of your return have been framed as a white knight scenario, where you are described as the creator of the server that will save the company. That's ridiculous of course, but how would you describe the mood and spirit of the company right now?
A: At the 100,000-foot level, the vast majority of Sun's business continues to be SPARC. What's exciting there is the new [multi-threaded] architecture for SPARC -- it's called Niagara internally -- delivers much better throughput than anybody thought would happen. All the benchmarks are coming in, and it's a big positive surprise to people. That new chip will reinvigorate the SPARC side of the house as well, and it confirms the design decisions that were made there, in terms of the multi-core, the multi-threading, to really gain an unbelievable amount of throughput per power.

Now, SPARC does not have the same single-threaded throughput as the Opteron chip, so for technical kinds of applications the Opteron chip is probably the choice. But we see very positive customer response to these new SPARC chips, and on throughput per power, throughput per density, they beat Opteron hands down. It's not that SPARC is at a dead end here.

Instead, we are focused on enhancing these multi-core, multi-threaded chips to have even more throughput going forward. One of the advantages of RISC architecture compared to the industry architecture is that each core is smaller, so you can have more cores per chip with SPARC than you can have with Opteron. That advantage is not going to change. SPARC is alive and well, and the SPARC business is not just stabilising but showing some improvements along these lines.

I think that is improving the mood at Sun, but it is true that if a company sees declining market share against another technology that it hadn't participated in, that does not create a great feeling. The way we are going to get out of this is by having both good products in the industry standard space as well as reinvigorated products on the SPARC side.