This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
The high-performance computing (HPC) scientific/academic sector is accustomed to using commodity server and storage clusters to deliver massive processing power, but comparable large-scale cluster deployments are now found in the high-end enterprise as well.
Large Internet businesses, cloud computing suppliers, media and entertainment organisations, and high-frequency trading environments, for example, now run clusters that are on par and in some cases considerably larger than the top 100 clusters used in HPC.
What differentiates the two environments is the type of networks allied to the application programming models and the problem sets used. In the scientific/academic sector, it is typical to use proprietary solutions to achieve the best performance in terms of latency and bandwidth, while sacrificing aspects of standardisation that simplify support, manageability and closer integration with IT infrastructure. Within the enterprise the use of standards is paramount, and that means heavy reliance upon Ethernet. But plain old Ethernet won't cut it. What we need is a new approach, a new "maverick fabric."
Such a fabric should have a way to eliminate network congestion within a multi-switch Ethernet framework to free up available bandwidth in the network fabric. It also should significantly improve performance by negotiating load-balancing flows between switches with no performance hit and, use a "fairness" algorithm that prioritises packets in the network and ensures that broadcast data or other large frame traffic, such as localised storage sub-systems, will not unfairly consume bandwidth.
Adaptive routing and loss-less switching
A fundamental problem with legacy Ethernet architecture is congestion, a byproduct of the very nature of conventional large-scale Ethernet switch architectures and also of Ethernet standards. Managing congestion within multi-tiered, standards-based networks is a key requirement to ensure high utilisation of computational and storage capability. The inability to cope with typical network congestion causes:
- Fundamental collapses in network performance, with systems efficiency as low as 10 percent
- Networks that cannot scale in size to match application demands
- Slow and unpredictable network latency, reducing business responsiveness
- Unacceptably high cost of ownership due to bandwidth over-provisioning
But the latency of proprietary server adaptors and standard Ethernet is only one hindrance to achieving the performance necessary for a wider exploitation of Ethernet in HPC environments. Legacy Ethernet switches traditionally have not been conducive to exploitation at large scale given that:
- Underlying standards have not supported loss-less transmission; the main intent of TCP is to support packet re-transmission and best efforts
- Heavyweight algorithms such as Spanning Tree Protocol (STP) to avoid deadlock have encumbered the use of flat networks, and therefore have often required complex tiered Layer 2 and 3 switches to support scale-out architectures; this imposes significant latency penalties when in operation and also necessitates significant bandwidth over-subscription
- Individual switch silicon has imposed severe latency on individual packets and when compared to the latency at the server side, adds significantly to overall round-trips in large systems
- Congestion within switches incurred by hotspots in the network can cause catastrophic drop-off in overall bandwidth
Parity among packets
Conventional large-scale Ethernet deployments have relied upon three-tier architectures of switch, distribution (or aggregation) and core deployments in order to control particular network operations. Such designs have inhibited scalability due to systemic constraints in the architecture: Network resources soon become over-committed, especially in the presence of device-level (east-west) communication.
Such constraints have necessitated a move away from a three-tier model to a flatter network based on leaf-switches, providing access to devices (storage and server), and spine-switches, creating a rich multi-path fabric in which potentially all the available bandwidth can be used to sustain device level communication irrespective of the location of these devices.
In practice however, such networks cannot deliver complete isotropy due to the inability to manage congestion as transmission and receive-flows change rapidly in operation. Typically, congestion within the network is formed through either egress port buffering, whereby the volume of traffic attempting to access an attached device is greater than the available bandwidth over the given egress interface, or within the network, when the aggregate traffic volume taking a particular path is greater than the available bandwidth on that path.
Both of these scenarios will cause traffic to be buffered within the network leading to head-of-line (HoL) blocking wherein, traffic that is not necessarily contributing to the congestion, is affected. The impact of this can be seen in additional latency or jitter or worse, frame loss.
A separate issue exists in that the paths taken by traffic within the network are often based upon a static mapping mechanism, which is unaware of network load. Typically this is based upon a hashing mechanism that will always result in a given traffic flow following the same path regardless of congestion that may lie ahead on that path. To overcome this hurdle, network architects are often forced to over-provision bandwidth leading to under-utilization of the available resource, which is both inefficient and costly.
When considering the sources of all incoming packets to a switch, conventional Ethernet will allocate outgoing bandwidth fairly among them. However, critically, it pays no regard to the journey these packets have made. An ingress port may have a node attached or it may be the final hop of a large network connecting thousands of nodes. The net result is that large areas of network workloads can be locked out for considerable periods of time and expensive links remain irretrievably idle.
Network convergence and congestion avoidance
The principle of convergence and the trend within the commercial arena to combine storage and networking traffic on the same physical network raises significant challenges. Can these elements combine further to embrace the necessary protocols to support the cluster and interconnect the LAN and storage in a single unified fabric?
The concept of consolidation is simple but the commensurate benefits to the end user are considerable - in administration, vendor neutrality, power budget and flexibility, all leading to a lowering of cost and reducing risk for IT departments and cloud providers.
A "maverick fabric" architecture preserves single-switch performance characteristics and throughput by instantiating a common fabric between individual switches connected in an arbitrary topology. A multi-Gigabit Ethernet fabric is maintained throughout the distributed switch deployment comprising, in theory, an infinite number of ports. It also supports multi-path, loop-free routing techniques that combine to provide distributed load-balancing across multi-chassis configurations, effectively removing network congestion while maintaining strict in-sequence parity for reliable and efficient packet delivery.
Embracing the "maverick fabric" concept to achieve optimization of a network free from congestion offers considerable advantages to the delivery of HPC-quality services. Avoiding gateways between one fabric from another provides a unique opportunity to move storage from islands of information per cluster to a more distributed enterprise, thus allowing HPC concepts to take more of a profitable role in the IT infrastructure of a large-scale business.
The Gnodal ASIC-based Ethernet switch architecture features a congestion-aware performance and workload engine that allows for ultra-low latency transmission while utilising a dynamic, fully adaptive load-balancing mechanism to equitably arbitrate a pre-emptive pathway for large data-sets, high computational applications and massive storage demands prevalent in HPC, Cloud and Big Data environments, thus utilising virtually 100 percent of the Gnodal fabric's available bandwidth.
For more information, visit www.gnodal.com.