Server virtualisation has, without doubt, taken the IT industry by storm. It provides a cost-effective way to dramatically reduce downtime, increase flexibility, and use hardware much more efficiently.
However, small and medium businesses often find it hard to evaluate whether virtualisation is an appropriate fit and, if it is, how to adopt it with a small IT staff and limited funding. It's easier for larger companies with more developed IT staffs to figure out, but it can still be a challenge.
Whether you're big or small, this six-stage virtual case study explores the key considerations and deployment approaches you should examine when virtualising your servers. Each part covers a key stage of the virtualisation process, using the fictional Fergenschmeir to lay out the issues, potential mistakes, and realistic results any company should be aware of -- but that you're not likely to see in a typical white paper or case study.
So follow along with Eric Brown, Fergenschmeir's new infrastructure manager, his boss Brad Richter, upper management, and Eric's IT team to find out what did and didn't work for Fergenschmeir as the company virtualised its server environment.
Stage 1: Determining the rationale
The idea of implementing production server virtualisation came to Fergenschmeir from several directions. In May 2007, infrastructure manager Eric Brown had just hired Mike Beyer, an eager new summer intern. One of the first questions out of Mike’s mouth was, “So how much of your server infrastructure is virtual?” The answer, of course, was none. Although the software development team had been using a smattering of EMC VMware Workstation and Server to aid their development process, they hadn’t previously considered bringing it into production. But an innocent question from an intern made Eric give it more serious thought. So he did some research.
Eric started by talking to his team. He asked about the problems they’d had, and whether virtualisation could be a solution. There were obvious wins to be had, such as the portability of virtual guest servers. Additionally, they would no longer be dependent on specific hardware, and they would be able to consolidate servers and reduce IT overhead.
The actual business motivation came a month later. The server running Fergenschmeir’s unsupported, yet business-critical, CRM application crashed hard. Nobody knew how to reinstall the application, so it took four days of downtime to get the application brought back up. Although the downtime was largely due to the original software developer being defunct, this fiasco was a serious black mark on the IT department as a whole and a terrible start for Eric’s budding career at Fergenschmeir.
The final push toward virtualisation was a result of the fact that Fergenschmeir’s CEO, Bob Tersitan, liked to read IT industry magazines. The result of this pastime was often a tersely worded e-mail to Brad that might read something like, “Hey. I read about this Web portal stuff. Let’s do that. Next month? I’m on my boat -- call cell.” Usually Brad could drag his feet a bit or submit some outlandish budgetary numbers and Bob would move on to something else more productive. In this case, Bob had read a server virtualisation case study he found on Techworld, and the missive was to implement it to solve the problems that Fergenschmeir had been experiencing. Bingo! Eric had already done the research and now had executive OK to go forward. The fiasco turned into an opportunity.
Stage 2: Doing a reality check
But Eric was concerned that there was very little virtualisation experience within his team. The intern, Mike Beyer, was by far the best resource Eric had, but Mike had never designed a new virtualisation architecture from the ground up -- just peripherally administered one.
Eric also faced resistance from his staff. Eric’s server administrators, Ed Blum and Mary Edgerton, had used VMware Server and Microsoft Virtual Server before and weren’t impressed by their performance. Lead DBA Paul Marcos said he’d be unwilling to deploy a database server on a virtual platform because he had read that virtual disk I/O was terrible.
Eric and his CTO Brad Richter had already assured CEO Bob Tersitan that they’d have a proposal within a month, so, despite the obstacles, they went ahead. They started by reading everything they could find on how other companies had built their systems. Eric asked Mike to build a test platform using a trial version of VMware’s ESX platform, as that seemed to be a popular choice in the IT-oriented blogosphere.
Within a few days, Mike had an ESX server built with a few test VMs running on it. Right away, it was clear that virtualisation platforms had different hardware requirements than normal servers did. The 4GB of RAM in the test server wasn’t enough to run more than three or four guest servers concurrently, and the network bandwidth afforded by the two onboard network interfaces might not be sufficient for more virtual servers.
But even with those limits, the test VMs they did deploy were stable and performed significantly better than Eric’s team had expected. Even the previously sceptical Paul was impressed with the disk throughput. He concluded that many of the workgroup applications might be good candidates for virtualisation, even if he was still unsure about using VMs for their mission-critical database servers.
With this testing done, Brad and Eric were confident they could put a plan on Bob’s desk within a few weeks. Now they had to do the critical planning work.
Stage 3: Planning around capacity
Brad started the process by asking his teams to provide him with a list of every server-based application and the servers that they were installed on. From this, Eric developed a dependency tree that showed which servers and applications depended upon each other.
Assessing server roles
As the dependency tree was fleshed out, it became clear to Eric that they wouldn’t want to retain the same application-to-server assignments they had been using. Out of the 60 or so servers in the data centre, four of them were directly responsible for the continued operation of about 20 applications. This was mostly due to a few SQL database servers that had been used as dumping grounds for the databases of many different applications, sometimes forcing an application to use a newer or older version of SQL than it supported.
Furthermore, there were risky dependencies in place. For example, five important applications were installed on the same server. Conversely, Eric and Brad discovered significant inefficiencies, such as five servers all being used redundantly for departmental file sharing.
Eric decided that the virtualised deployment needed to avoid these flaws, so the new architecture had to eliminate unnecessary redundancy while also distributing mission-critical apps across physical servers to minimise the risks of any server failures. That meant a jump from 60 servers to 72 and a commensurate increase in server licences.
Determining virtualisation candidates
With the architecture now determined, Eric had to figure out what could be deployed through virtualisation and what should stay physical. Figuring out the answer to this was more difficult than he initially expected.
One key question was the load for each server, a key determinant of how many physical virtualisation hosts would be needed. It was obvious that it made no sense to virtualise an application load that was making full use of its hardware platform. The initial testing showed that the VMware hypervisor ate up about 10 percent of a host server’s raw performance, so the real capacity of any virtualised host was 90 percent of its dedicated, unvirtualised counterpart. Any application whose utilisation was above 90 percent would likely see performance degradation, as well as have no potential for server consolidation.
But getting those utilisation figures was not easy. Using Perfmon on a Windows box, or a tool like SAR on a Linux box, could easily show how busy a given server was within its own microcosm, but it wasn’t as easy to express how that microcosm related to another.
For example, Thanatos -- the server that ran the company’s medical reimbursement and benefit management software -- was a dual-socket, single-core Intel Pentium 4 running at 2.8GHz whose load averaged at 4 percent. Meanwhile, Hermes, the voicemail system, ran on a dual-socket, dual-core AMD Opteron 275 system running at 2.2GHz with an average load of 12 percent. Not only were these two completely different processor architectures, but Hermes had twice as many processor cores as Thanatos. Making things even more complicated, processor utilisation wasn’t the only basic resource that had to be considered; memory, disk, and network utilisation were clearly just as important when planning a virtualised infrastructure.
Eric quickly learned that this was why there were so many applications available for performing capacity evaluations. If he had only 10 or 20 servers to consider, it might be easier and less expensive to crack open Excel and analyse it himself. He could have virtualised the loads incrementally and seen what the real-world utilisation was, but he knew the inherent budgetary uncertainty wouldn’t appeal to CEO Bob Tersitan and CFO Craig Windham.
So, after doing some research, Eric suggested to Brad that they bring in an outside consulting company to do the capacity planning. Eric asked a local VMware partner to perform the evaluation, only to be told that the process would take a month or two to complete. The consultants said it was impossible to provide a complete, accurate server utilisation analysis without watching the servers for at least a month. Otherwise, the analysis would fail to reflect the load of processes that were not always active, such as week and month-end report runs.
That delay made good technical sense, but it did mean Eric and Brad couldn’t meet Bob’s deadline for the implementation proposal. Fortunately, Craig was pleased that an attempt to make the proposal as accurate as possible was being made and his support eventually made Bob comfortable with the delay.
The delay turned out to be good for Eric and Bob, as there were many other planning tasks that hadn’t even come close to completion yet, such as choosing the hardware and software on which they’d run the system. This analysis period would give them breathing room to work and try to figure out what they didn’t know.
When the initial capacity planning analysis did arrive some time later, it showed that most of Fergenschmeir’s applications servers were running at or below 10 percent equalised capacity, allowing for significant consolidation of the expected 72 server deployments. A sensible configuration would require eight or nine dual-socket, quad-core ESX hosts to comfortably host the existing applications, leave some room for growth, and support the failure of a single host with limited downtime.
Stage 4: Selecting the platforms
The virtualisation engine
It was obvious that any hardware they chose had to be compatible with VMware ESX, the virtualisation software they had tested, so infrastructure manager Eric Brown’s team started checking the VMware hardware compatibility list. But server administrator Mary Edgerton stopped the process with a simple question: “Are we even sure we want to use VMware?”
Nobody had given that question much thought in the analysis and planning done so far. VMware was well known, but there were other virtualisation platforms out there. In hindsight, the only reason Eric’s team had been pursuing VMware was due to the experience that the intern, Mike Beyer, had with it. That deserved some review.
From Eric’s limited point of view, there were four main supported virtualisation platforms that he could chose from. VMware Virtual Infrastructure (which includes VMware ESX Server), Virtual Iron, XenSource, and Microsoft’s Virtual Server.
Eric wasn’t inclined to go with Microsoft’s technology because, from his reading and from input from the other server administrator, Ed Blum, who had used Microsoft Virtual Server before, it wasn’t as mature nor did it perform as well as VMware. Concerns over XenSource’s maturity also gave Eric pause, and industry talk that XenSource was a potential acquisition target created uncertainty he wanted to avoid. (And indeed it was later acquired.)
Virtual Iron, on the other hand, was a different story. The two were much closer in terms of maturity, from what Eric could tell, and Virtual Iron was about a quarter the cost. This gave Eric some pause, so he talked over the pros and cons of each with CTO Brad Richter at some length.
In the end they decided to go with VMware as they had originally planned. The decision came down to the greater number of engineers who had experience with the more widely deployed VMware platform and the belief that there would also be more third-party tools available for it. Another factor was that CEO Bob Tersitan and CFO Craig Windham had already heard the name VMware. Going with something different would require a lot of explanation and justification -- a career risk neither Eric nor Brad were willing to take.
The server selection
After the question of platform had been solved, Eric had received the initial capacity planning analysis, which indicated the need for eight or nine dual-socket, quad-core ESX hosts. With that in mind, the IT group turned its focus back to selecting the hardware platform for the revamped data centre. Because Fergenschmeir already owned a lot of Dell and HP hardware, the initial conversation centred on those two platforms. Pretty much everyone on Eric’s team had horror stories about both, so they weren’t entirely sure what to do. The general consensus was that HP’s equipment was better in quality but Dell’s cost less. Eric didn’t really care at an intellectual level -- both worked with VMware’s ESX Server, and his team knew both brands. Ed and Mary, the two server administrators, loved HP’s management software, so Eric felt more comfortable with that choice.
Before Eric’s team could get down to picking a server model, Bob made his presence known again by sending an e-mail to Brad that read, “Read about blades in Techworld. Goes well with green campaign we’re doing. Get those. On boat; call cell. -- Bob.” It turned out that Bob had made yet another excellent suggestion, given the manageability, power consumption, and air conditioning benefits of a blade server architecture.
Of course, this changed the hardware discussion significantly. Now, the type of storage chosen would matter a lot, given that blade architectures are generally more restrictive about what kinds of interconnects can be used, and in what combination, than standard servers.
For storage, Eric again had to reconsider the skills of his staff. Nobody in his team had worked with any SAN, much less Fibre Channel, before. So he wanted a SAN technology that was cheap, easy to configure, and still high-performance. After reviewing various products, cross-checking the ESX hardware compatibility list, and comparing prices, Eric decided to go with a pair of EqualLogic iSCSI arrays -- one SAS array and one SATA array for high- and medium-performance data, respectively.
This choice then dictated a blade architecture that could support a relatively large number of gigabit Ethernet links per blade. That essentially eliminated Dell from the running, narrowing the choices to HP’s c-Class architecture and Sun’s 6048 chassis. HP got the nod, again due to Mary’s preference for its management software. Each blade would be a dual-socket, quad-core server with 24GB of RAM and six Gigabit Ethernet ports. Perhaps the IT team would increase the amount of RAM per blade through upgrades later if the hosts became RAM-constrained, but this configuration seemed to be a good initial starting place.
The network selection
The next issue to consider was what type of equipment Eric’s team might need to add to the network. Fergenschmeir’s network core consisted of a pair of older Cisco Catalyst 4503 switches, which drew together all of the fibre from the network closets, and didn’t quite provide enough copper density to serve all of the servers in the data centre. It was certainly not enough to dual-home all of the servers for redundancy. The previous year, someone had added an off-brand gigabit switch to take up the slack, and that obviously needed to go.
After reviewing some pricing and spec sheets, Eric decided to go with two stacks of Catalyst 3750E switches and push the still-serviceable 4503s out to the network edge. One pair of switches would reside in the telco room near the fibre terminations and perform core routing duties, while the other pair would sit down the hall and switch the server farm.
In an attempt to future-proof himself, Eric decided to get models that could support a pair of 10G links between the two stacks. These switches would ultimately cost almost as much as getting a single, highly redundant Catalyst 6500-series switch, but he would have had to retain the massive bundle of copper running from the telco room to the data centre, or extend the fibre drops through to the data centre to make that work. Neither prospect was appealing.
All told, the virtualisation hardware and software budget was hanging right around $300,000. That included about $110,000 in server hardware, $40,000 in network hardware, $100,000 in storage hardware, and about $50,000 in VMware licensing.
This budget was based on the independent consultant’s capacity planning report, which indicated that this server configuration would conservatively achieve a 10:1 consolidation ratio of virtual to physical servers, meaning eight physical servers to handle the 72 application servers needed. Adding some failover and growth capacity brought Eric up to nine virtualisation hosts and a management blade.
This approach meant that each virtualised server -- including a completely redundant storage and core network infrastructure but excluding labour and software licensing costs -- would cost about $4,200. Given that an average commodity server generally costs somewhere between $5,000 and $6,000, this seemed like a good deal. When Eric factored in the fact that commodity servers don’t offer any kind of non-application-specific high availability or load balancing capabilities, and are likely to sit more than 90-percent idle, it was an amazing deal.
Before they knew it, Eric and Brad had gotten Bob’s budget approval and were faxing out purchase orders.
Stage 5: Deploying the virtualised servers
About a month after the purchase orders went out for the hardware and software selected for the server virtualisation project, the Fergenschmeir IT department was up to its elbows in boxes. Literally.
This was because server administrator Mary Edgerton ordered the chosen HP c-Class blades from a distributor instead of buying it directly from HP or a VAR and having it pre-assembled. This way, she could do the assembly (which she enjoyed) herself, and it would cost less.
As a result of this decision, more than 120 parcels showed up at Fergenschmeir's door. Just breaking down the boxes took Mary and intern Mike Beyer most of a day. Assembling the hardware wasn't particularly difficult; within the first week, they had assembled the blade chassis, installed it in the data centre, and worked with an electrician to get new circuits wired in. Meanwhile, the other administrator, Ed Blum, had been working some late nights to swap out the core network switches.
Before long, they had VMware ESX Server installed on nine of the blades, and VirtualCenter Server installed on the blade they had set aside for management.
Unexpected build-out complexity emerges
It was at this point that things started to go sideways. Up until now, the experience Mike had gained working with VMware ESX at his college had been a great help. He knew how to install ESX Server, and he was well versed in the basics of how to manage it once it was up and running. However, he hadn't watched his college mentor configure the network stack and didn't know how ESX integrated with the SAN.
After a few fits and starts and several days of asking what they'd later realise were silly questions on the VMware online forums, Ed, Mary, and Mike did get things running, but they didn't really believe they had done it correctly. Network and disk performance weren't as good as they had expected, and every so often, they'd lose network connectivity to some VMs. The three had increasing fears that they were in over their heads.
Infrastructure manager Eric Brown realised he'd need to send his team out for extra training or get a second opinion if they were going to have any real confidence in their implementation. The next available VMware classes were a few weeks away, so Eric called in the consultant that had helped with capacity planning to assist with the build out.
Although this was a significant and unplanned expense, it turned out to be well worth it. The consultant teamed up with Mary to configure the first few blades and worked with Ed on how best to mesh the Cisco switches and VMware's fairly complex virtual networking stack. This mentoring and knowledge transfer process proved to be very valuable. Later, while Mary was sitting in her VMware class, she noted that the course curriculum wouldn't have come anywhere near preparing her to build a complete configuration on her own. Virtualisation draws together so many different aspects of networking, server configuration, and storage configuration that it requires a well-seasoned jack-of-all-trades to implement successfully in a small environment.
Bumps along the migration path
Within roughly a month of starting the deployment, Eric's team had thoroughly kicked the tyres, and they were ready to start migrating servers.
Larry had done a fair amount of experimenting with VMware Converter, a physical-to-virtual migration tool that ships with the Virtual Infrastructure suite. For the first few servers they moved over, he used Converter.
But it soon became clear that Converter's speed and ease of use came at a price. The migrations from the old physical servers to the new virtualised blades did eliminate some hardware-related problems that Fergenschmeir had been experiencing, but it also seemed to magnify the bugs that had crept in over years of application installations, upgrades, uninstalls, and generalised Windows rot. Some servers worked relatively well, while others performed worse than they had on the original hardware.
After a bit of digging and testing, it turned out that for Windows servers that weren't recently built, it was better to build the VMs from scratch, reinstall applications, and migrate data than it was to completely port over the existing server lock, stock, and barrel.
The result of this realisation was that the migration would take much longer than planned. Sure, VMware's cloning and deployment tools allowed Ed, Mary, and Mike to deploy a clean server from a base template in four minutes, but that was easy part. The hard part was digging through application documentation to determine how everything had been installed originally and how it should be installed now. The three spent far more time on the phone with their application vendors than they had trying to figure out how to install and configure VMware.
Another painful result of their naïveté emerged: Although they had checked their hardware against VMware's compatibility list during the project planning, no one had thought to ask the application vendors if they supported a virtualised architecture. In some cases, the vendors simply did not.
These application vendors hadn't denied Fergenschmeir support when their applications had been left running on operating systems that hadn't been patched for years, and they hadn't cared when the underlying hardware was on its last legs. But they feared and distrusted their applications running on a virtualised server.
In some cases, it was simply an issue of the software company not wanting to take responsibility for the configuration of the underlying infrastructure. The IT team understood this concern and accepted the vendors' caution that if any hardware-induced performance problems emerged, they were on their own -- or at least had to reproduce the issue on an unvirtualised server.
In other cases, the vendors were ignorant about virtualisation. Some support contacts would assume that they were talking about VMware Workstation or Server as opposed to a hypervisor-on-hardware product such as VMware ESX. So they learned to identify the less knowledgeable support staff and ask for another technician when this happened.
But one company outright refused to provide installation support on a virtual machine. The solution to this turned out to be hanging up and calling the company back. This time they didn't breathe the word "virtual," and the tech happily helped them through the installation and configuration.
These application vendors' hesitance, ignorance, and downright refusal to support virtualisation didn't make anyone in Fergenschmeir's IT department feel very comfortable, but they hadn't yet seen a problem that they could really attribute to the virtualised hardware. Privately, Eric and CTO Brad Richter discussed the fact that they had unwittingly bought themselves into a fairly large liability, but there wasn't much they could do about that now.
Stage 6: Learning from the experience
In the end, it took about a month and a half to get the VMware environment stable, train the team, and test the environment enough to feel comfortable with it. It took another three months to manually migrate every application while still maintaining a passable level of support to their user base.
Toward the end of the project, infrastructure manager Eric Brown started leaning on outsourced providers for manpower to speed things up, but his team did most of the core work.
In the months following the migration, Eric was pleasantly surprised by how stable the server network had become. Certainly, there had been several curious VMware-specific bugs, mostly with regard to management, but nothing to the degree that they had been dealing with before they rationalised the architecture and migrated to the virtual environment.
The painful act of rebuilding the infrastructure from the ground up also gave Eric’s team an excellent refresher on how the application infrastructure functioned. Eric made sure they capitalised on this by insisting that every step of the rebuild was documented. That way, if another fundamental technology or architecture change was ever needed, they’d be ready.