Harvard University's data network supports 125,000-plus users, its Border Gateway Complex routes about half a million IP addresses, and the network carries around 150TB to 200TB of data per day. Jay Tumas, who oversees the operations centre at the heart of the network, gave us a peek behind the scenes.

Give me a thumbnail sketch of Harvard's network.
The Harvard Core Network (HCN) serves an extremely diverse user population in metro Boston and beyond. We have everything from dual Gigabit Ethernet feeds serving the entire Harvard College network with tens of thousands of clients and a Class B chunk of address space, to a channellised T3 circuit serving remote affiliates in Washington DC, or a T1 serving a remote library repository in central Massachusetts. The [University Information Systems] NOC is the primary maintenance organisation for the Northern Crossroads (NoX), New England's Internet2 aggregation point, which serves over a million users.

With its scope encompassing close to 1000 buildings, we solicit advice from all connecting members to solidify customer demarcs, network ownership and funding models. The 120-plus connecting members may manage their own LANs and data centres, or they may have outsourced everything from network maintenance to Windows client updates to us.

I sometimes hear people refer to organisations such as Harvard as having networks that are like phone company networks. Given your background at New England Telephone, is Harvard's network really like a phone company's?
Data networks in the 90s were notoriously undocumented, with physical plants that looked more like spaghetti than anything that an institution would want to trust with its critical data. Harvard and other research and medical institutions began to realise that this network that was quickly becoming part of its critical infrastructure was largely an unknown quantity and that this had to change. So it did.

You will now find that many institutions have carefully documented their physical plant with tools from GIS systems linking underground conduits to fibre inventories, to tagging each end of every fibre and copper cable in their production network. These cycles can be expensive, especially when faced with the daunting challenge of documenting and inventorying a large-scale production network such as Harvard's, but they are cycles well spent and which prove invaluable as we all strive to make our networks as physically robust as our routing protocols are logically robust.

Harvard, as of late, has been exhibiting another telco trait - considering the network as part of the university's critical infrastructure.

As such, its construction is considered during the initial planning phases of building renovation, new construction and campus expansion projects. The data networks that are being built today, at Harvard and similar institutions, are being built to host a variety of IP-based traffic. Most every physical-plant control device, whether it be security cameras, chilled water-valve actuators or parking garage card readers, are being designed to work with the IP network. There is no better way for the network to provide ROI to the university than to provide a robust, high-availability piece of physical infrastructure that not only supports the data communications requirements of the research and academic communities but also serves as platform that fosters convergence of other plants' control and communications requirements.

What lessons did you take from your time at New England Telephone that you've been able to apply at Harvard?
I learned how to maintain a robust network. Here are a few concepts that I brought with me:

A test lab. The telcos had Bellcore (now Telcordia) to ensure their roll-out of critical infrastructure went smoothly. You need a lab, too. There is no better way to ensure your architecture or code upgrades proceed smoothly than to have your own lab environment to test your future configurations. It's best not to cheap out when selecting lab equipment either. You should build a lab that mirrors your production environment to ensure you are comparing apples to apples. A great way to accomplish this is to use your network spares in your lab. This keeps your spare chassis and blades hot, so you know they are good, and ensures that you are testing with configurations compatible with your production environment.

Document everything. This includes assets, processes and procedures. The telcos realised this early on and documented everything from proper office etiquette of the day to power plant maintenance in a voluminous set of manuals called the Bell System Practices. You don't need to go to those extremes, but a document containing current architecture descriptions, maintenance procedures, hardware inventory and access procedures is a good start. We started the NOC document about nine years ago. While its roughly 160 pages cover the bulk of our operational processes, vendor contacts and other information vital to supporting the HCN both on and off hours, there is always more that can be added. You've got to match the shear size of the document with what your staff can keep current.

Organise your plant. No one did this like Ma Bell. Through the thousands of COs [central offices], tens of thousands of frames and cross-connect systems, and probably millions of miles of cross-wire, a trained CO technician can go to any CO and put his finger on any circuit in the building. This feat was brought to us by an inventory system called TIRKS (Trunk Inventory Record Keeping System). In data networking there is little opportunity to keep a system that complex. However, you should demand that all is inventoried and labelled.

I have moved the NOC data centre twice in 10 years. The first move we performed in the back of our pick-up trucks. So we skimped in that realm; however, we made sure to improve our plant structure by installing overhead cable trays, well-designed data cabinets and cable-management systems. The last move improved our data-centre plant organisation even more with the implementation of multilevel, under-the-floor cable trays, strict cable installation, tie-down and tagging requirements. We even invested in glass 2-by-2 floor tiles so we can display the results.

Exercise your DR architecture. Perform real-world power-failure scenarios to test your power backup, whether it is emergency power supplied by the building infrastructure or a room or rack UPS system. Disconnect the commercial power and allow the emergency power source to handle the production load as it would in the event of an emergency. Make sure you know how long your emergency sources will supply power to your network equipment, and keep in mind that as you add blades to those chassis the amount of time a UPS will be able to power the attached gear can significantly decrease. Also, if you have a DR plan for your data centre that includes a remote data centre linked back to campus, ensure that you simulate or estimate actual server loads on your connecting infrastructure.

Keep your customers informed. Come up with agreed-upon notification procedures for your internal and external customers in the event of a network outage, or if emergency maintenance is required and the network will be unstable during a particular window. If you have a customer portal, archive the events so they can be accessed by all who may need to correlate some sort of local failure or access problem to a core network outage.

How do you gain visibility into what's going on in a network of this size?
We have long polled network interfaces using SNMP to count the octets crossing interfaces from which we create real-time bandwidth-capacity graphs as a baseline to measure our overall network use.

This data serves as an auditing tool every time we bring in a vendor with the latest and greatest network accounting suite, because if the application can't detail actual network resource usage, then the rest of its space-age graphics and modelling capabilities are useless.

To complement our locally developed, SNMP-based tool kit, we use commercial applications that rely on other data sources to get at our overall network usage:

  • QRadar from Q1 Labs - It serves as our primary network-traffic anomaly-detection system. It uses flow-based knowledge gained from live traffic surveillance performed out-of-band and presents a real-time analysis of current active threats on the network. It's also intelligent enough to interface with our NOC Portal. So when a network administrator logs into the portal and observes that our IDS infrastructure indicates we may have some compromised systems on a local network, he can log into QRadar and observe all network traffic specific to his address space. QRadar also presents anomalous data in reference to the total traffic, so it can be used secondarily as a traffic accounting system to display utilised resources across the network.

  • Peakflow SP from Arbor Networks - Our primary traffic-capacity planning tool, it derives its information from NetFlow traffic data generated from the University Border Gateway Complex. I look to this app for customer bandwidth statistics across my border. It does an outstanding job of slicing Layer 3-7 traffic data, which assists greatly when customers wonder, 'What does my network's traffic profile look like?' Its traffic-engineering capabilities are enhanced by the fact that the application acts as a BGP peer to the university border. This allows for target [autonomous system] analysis, so when it comes time to look at commercial ISPs, we can make sure that we are selecting a carrier that best serves Harvard's network community.

  • Orion from SolarWinds - This web-based, network fault-management system collects data from SNMP-enabled devices across our network and provides an accurate, low-cost view into it. It posts up nicely with our SNMP-generated traffic graphs, and presents us with a wealth of vital info like CPU and memory use, configuration info and interface-specific traffic stats.

    We gain all this visibility with out-of-band management architectures, using a variety of vehicles to get at the traffic data. Nothing should be placed in the packet's path that's not absolutely necessary.

    How much of what you're using to manage and secure the network is built in-house vs. bought from vendors?
    About 50/50.

    Give me a few examples of home-grown tools, how you're using them and why they beat what's available commercially?
    SNMPoll is our primary network-monitoring and alerting system. It's a simple Perl program that uses topology-aware SNMP polling for ifOperStatus and sysUptime from more than 450 network devices and 1,500 interfaces every minute. If an anomaly is discovered, the appropriate engineers are alerted via an e-mail to their Treo 650s. The alerting e-mail contains a secure weblink, allowing engineers to quickly request additional information related to the event. The alerts also contain a live link to an application called MobileNOC, a Treo-based version of the NOC Portal specifically for [speeding] information queries and remote troubleshooting.

    SNMPoll relies on another program, SNMProwl, to do core network wide topology discovery. A variety of shell scripts and applications use SNMProwl's data for other purposes, such as automatically building a private DNS zone for easy management of all core router and switch interfaces. Another Perl program, d3m0n, monitors other SNMP objects of particular interest. They include UPSs, environmental probes, BGP sessions, critical routes, data-centre content switches; power, fan and temperature in our chassis; interface errors and anything else we feel the need to poke at to improve service delivery.

    PacketFence is an open source, network-based solution to the problems posed by open academic networks. It provides passive or in-line operation, network registration, worm/bot detection/isolation, user-directed mitigation and proactive vulnerability scans. Its lineage can be traced to another utility called Mousetrap, a set of Perl scripts developed by the UIS Network Security Team to trap users via DHCP scope manipulation.

    The scripts worked quite well until the summer of 2003. As the Blaster and Nachi worms rampaged through the residential networks of academic institutions around the world, and infection rates within many residential networks approached 80 percent, we realised something more was necessary. In September 2003, PacketFence was born. After one year of continuous development, it was recently open-sourced and is in production on several large academic networks. PacketFence operates by manipulating the address resolution protocol cache of client systems.

    Our Critical Alerts DashBoard Security Event Manager provides local network security administrators with better overall visibility by delivering archived and real-time security data from the core network [intrusion-detection system], border anomaly-detection systems and centralised syslog infrastructures. The admin receives a graphical representation of the subdomain address space that dynamically changes depending on the "temperature" of their security environment. Just like at a telco - red is bad, and green is good. There's a recent alerts listing, an interactive graph displaying overall alert volume for your networks.

    Finally, our NOC portal was developed primarily to streamline customer-service delivery and enhance the information-sharing capabilities of all these other management and accounting tools. Customers use their university logons to access the portal. Depending on who they are, they are displayed a unique view allowing them access to the tools and information they require to manage their organisations' network presence. Everything from current network equipment installation standards, to an access-control list/FW Ruleset maintenance interface is available for their use. All of our vendor-supported network management systems are portalised.

    Tumas has been at Harvard for 10 years since being hired as network operations manager for the University Information Systems (UIS) Network Operations centre (NOC), the ISP for the university's 100-plus departments, faculties and affiliates. He manages 18 staff across five network operation centre groups: Network Engineering and Planning, Network Security and Incident Response, Systems and Services, Triage and Converged Services. His previous job was as an operations manager in technical support for New England Telephone.