I work in what I'd like to think is a reasonably large and complex SAN environment. Currently that consists of 2 pairs of EMC 8x30 subsystems and a pair of Sun StorEdge 9980V subsystems. The EMC frames run SRDF and Timefinder, although there is no mirroring between the two 9980Vs. Total disk space is about 70 terabytes. Our SAN infrastructure consists of sixteen McData 6064 director-class switches with a total of just over 1000 ports. We also have a "legacy" Brocade environment consisting of twelve 2800 switches. The SAN environment currently supports around 300 hosts on platforms including Solaris, AIX and Windows 2000. Configuration of the environment is managed via ECC, ESN Manager and Storage Navigator. Production hosts have a DR failover host located in a remote data centre. There are other parts of the SAN used for SRDF replication and tape connectivity but these are not particularly relevant to this discussion.
The environment itself is very fluid with new hosts commissioned each week. There are also a large number of disk allocations and de-allocations performed, some of these as part of the active migration from the Brocade infrastructure.
Our implementation of SANs has effectively placed all our storage resource eggs in one very large basket. The complexity of disk allocations at the host level means that the loss of even a single LUN to a single host could be catastrophic. Managing our environment therefore represents a serious challenge.
We have to ensure that every member of our team knows which disks are free, or in use, and which ports are allocated to hosts, both from the perspective of fault diagnosis and allocation/de-allocation. We also depend on other groups to perform cabling work and to configure the disks to the host, so we need to ensure the information provided is always accurate.
So, what do we do to reduce the risk and impact of any change? Firstly, we maintain accurate records. Secondly, we apply numbering and naming standards to port and disk assignments.
Originally, we used spreadsheets to maintain a picture of how disks had been allocated to hosts. While this was suitable for small configurations, spreadsheets are too cumbersome and unwieldy for the existing disk subsystems with more than 7000 LUNs.
All disk subsystem and SAN configuration information is now stored in an Access database, which is accessed by specially designed forms that map our configuration. This allows us to view the disks assigned to a host and the ports used to connect to the SAN. As a result, we can quickly highlight unused disk space and mark disks as reserved for use by new hosts that are being commissioned. Configuration information is easy to manipulate and can be used for tasks such as capacity planning and billing.
While such a database configuration is fine, it is only as useful as the accuracy of the information it contains. We have therefore developed a set of validation scripts. These scripts interrogate the SAN configuration, using CLI tools, and highlight any discrepancies between the actual and documented configuration.
By enforcing standards on port and LUN assignments, we can help reduce the risk of making configuration errors within our environment. This is most likely to occur when decommissioning rather than allocating resources. Some of the standards we apply include: 1. Allocation of development and production hosts to separate edge switches. 2. Dedication of disk subsystem FAs (Fibre Adaptors) to access either development or production LUNs. 3. Unique zones for every host-to-FA definition. Most hosts will have their disks allocated to only a single pair of FAs and therefore only have 2 zones, however other hosts have considerably more. 4. For 9980V allocations, each host is assigned a separate Host Group, isolating the LUNs from other hosts defined to use the same FA. 5. Each host has dual-pathing. The port connections to the two fabrics are made to the same port number on each switch, including the DR host at the remote data centre. Most important of all, during this change management process, is to ensure that a single product is used to manage the allocations. Therefore all zoning and EMC LUN masking is performed in ESN Manager rather than manually via command line options.
The modern enterprise-class SAN environment centralises storage into a small number of critical subsystems. Timely management of these environments requires accurate configuration information and in-house standards to mitigate the risk of human error.