As 2013 rolls in and the economy stabilises, many IT organisations are looking to upgrade their computational and storage systems. Like any IT purchasing decision, there are tradeoffs to consider and choices to make regarding hardware features and the technology available. When it comes to storage servers, the first step is understanding your CPU options.

Intel vs. AMD

For at least this year, the two server CPU choices remain Intel and AMD. ARM might solve some of the computational parts of some of the problems, but in 2013, ARM won't have enough I/O bandwidth with 10 Gigabit Ethernet ports and storage to make it a viable alternative. This might change for 2014, but it's too soon to predict as development of PCIe buses with enough performance capability is complex.

The latest AMD CPUs have 16 cores, but only if you are running integer operations. When it comes to floating-point operations, you have only eight cores. This combined with the fact that the latest Intel server processors can read and write data from memory significantly faster than AMD processors mean that AMD processors should be relegated to operations with low computational intensity that do not require high-memory bandwidth - you might think of things like VMs, but more on why this is not a good idea later.

Communications between CPU sockets

Another place that Intel has a major advantage is communications between CPU sockets. The current crop of Intel server CPUs support 25.6 gigabits per second (Gbps) of I/O bandwidth between CPU sockets over the Quick Path Interconnect (QPI).

This performance combined with the per-socket memory bandwidth performance exceeds the current performance of AMD CPUs. On multi-socket machines, this has a dramatic impact on the performance for all of the sockets because a process might be making a request for which memory has been allocated on another socket.

PCIe bus drives Intel ahead

PCIe is where the rubber meets the road on why the latest Intel processors are far ahead of their AMD competitors. The Intel technology on the latest server CPUs runs PCIe 3 with 40 lanes on each CPU.

That means that the PCIe bus and the CPU are capable of 40Gbps of I/O bandwidth. This is far greater than the bandwidth of available on AMD processors. So if you need to do a lot of network I/O or disk I/O, PCIe 3 is the better choice because it has far higher bandwidth than PCI 2.0 and the performance of the bus will double, but also the Intel CPU supports more PCIe lanes.

It's Intel's year but there are still issues

There is one problem with the new Intel CPUs that becomes more noticeable with quad-socket configurations. As mentioned earlier, the PCIe bus is on the CPU socket so with four sockets you have four PCIe buses with 40 lanes each for a total of 160 lanes of 1Gbps PCIe bandwidth. That is a lot of I/O bandwidth, but looking a bit deeper there is a problem:

  1. The QPI connections between sockets is a dual-channel 12.8Gbps channel for a total performance of 25.6Gbps.
  2. The PCIe express bandwidth of a socket is 40x 1Gbps per lane or 40 Gbps of PCIe bandwidth to the socket.

Problems quickly arise when PCIe bandwidth exceeds 25.6Gbps and the process requesting access to the PCIe bus is not on the socket with the bus where the access is being requested. Some of the workarounds attempted would lock processes on sockets with the PCIe bus that needs to be read or written. But it did not work for all applications. For example, those with data coming in and going out of multiple locations such as a striped file system are affected because you cannot break the request and move each request to each PCIe bus.

The real-world performance for general purpose applications running on a four-socket system is likely an estimated 90 percent of the QPI bandwidth between sockets (or 23Gbps) unless the data goes out on the socket with the PCIe bus. Every fourth I/O, if they are equality distributed, will run at 40Gbps, so the average performance would be (3x23Gbps +40Gbps)/4 or an average performance of about 27.25Gbps per socket for a quad-socket system.

This is, of course, the average based on equal distribution of the processes and I/O to the PCIe bus. A process that has PCIe processor affinity will significantly improve that average, but it is often difficult to architect and meet the requirements of putting every task on a PCIe bus and ensuring that the process runs on the CPU with that bus. The probability of this limitation is higher with a quad-socket system than with a dual-socket system.

The diagram below shows an example of a dual-socket system that, though having the same issues, reduces the potential of hitting that architectural limitation.

My estimate for performance for a dual-socket system is (23Gbps +40Gbps) or average socket performance of 31.5Gbps. On a dual-socket system it is much easier to architect the system so that you can put the right I/O on the right CPU and achieve near-peak performance.

CPU conclusions are counter-intuitive

New Intel systems have far more I/O bandwidth than previous systems and they have more than anything available from AMD. ARM is not currently competitive if you need to move lots of data in and out of the system.

The current Intel line quad-socket systems will average about 27.25Gbps unless significant work is done to architect the system to connect with processors and PCIe buses. The IOPS performance of the system will, of course, be higher as IOPS is not impacted by QPI bandwidth limitation.

The dual-socket systems are easier to get higher performance, and the average system performance is over 4.25Gbps. So my conclusion is you are better off using dual-socket systems for high I/O bandwidth requirements versus a quad socket. This, of course, is clearly counterintuitive, but is the best strategy given the current Intel architecture.

You will mostly likely see Ivy Bridge server processors in 2013 and the QPI bandwidth will go way up so with Ivy Bridge quad socket systems likely make sense. More on this after the Ivy Bridge serve processor are released.