25 August 2003 After recovering from our disk problem, I've had to get back to the daily routine of delivering storage to new products. It's amazing what a time consuming process this is, as we have to be 100% correct on our cabling and allocations. We allocate and cable during the day on our production environment. With the rate of change we have, if we didn't do this, we'd never get any work through. Documentation and audit trailing work consumes the most time, followed by liaison with other teams to co-ordinate our work. We've implemented a number of processes to simplify the interaction with other teams.
The DMX install is definitely happening. As part of ECC5 we will lose the Volume Logix commands and use the new symmacl command to administer LUN Masking. I'll write my views on that once I've got my hands on an active installation.
18 August 2003 One of our disk subsystems suffered a major problem this week. Due to a microcode bug, we lost some disks on a backend fibre loop. It's interesting to see how these problems develop and how they get resolved. Once we'd managed to isolate the real failing disk, we were able to re-created the falsely lost disks and re-establish the subsystem. My primary aim in these instances is to ensure we have no data loss. We were lucky this time, as the failed disks were part of a number of RAID-5 groups and the disk loops span RAID groups. However, data loss is the worst thing that can happen and that has to be avoided at all costs.
A problem like this brings the process of microcode upgrades into the discussion. Should we upgrade regularly or perhaps only when we are likely to suffer a known problem? It's a tricky choice and I'd probably fall on the side of regular updates. We run multiple subsystems that aren't connected so we can upgrade serially and that way detect any problems before the code is applied to all disk arrays.
11 August 2003
The last few weeks have been spent looking at a number of problems. First, we have had minor issues with our SAN. Some problems have been extremely difficult to resolve as we can't determine easily whether the problem is a GBIC issue, a cabling issue or related to the host or disk subsystem. Effectively the process is trial and error and this can be very time consuming.
Although we resolved our problem with ESN Manager and 2 directors that couldn't be added to the configuration, we still have a single director which crashes the configuration. This is a complex problem as the GBIC, port, cabling and even fibre adaptor into the disk subsystem have all been replaced, yet not solved the problem. Putting a fibre channel analyser on this port is the next stage to resolve this problem.
This highlights one very interesting scenario and that is, as the SAN grows we have an increased number of problems and so more issues to resolve. Where those problems affect a large number of hosts, for instance with a core switch, the impact to servers is substantial. This leads me to think there should be a cutoff point on the number of hosts on a single fabric infrastructure and therefore a tradeoff between management ease and problem impact.
Storage demand continues to rise. We are now looking at installing a couple of EMC DMX frames as part of the drive to meet demand. This will be interesting as it requires us to install ECC Version 5 to support the new subsystems. Anyway, ESN Manager, our current management tool can't hack the number of switches we're throwing at it and upgrade is inevitable (and probably fun too).
14 July 2003
We're about to expand our SAN again! The decision has been made and we're going for expanding our single fabric pair. Increasing demand has required a radical increase in the port capacity of our main disk SAN. Currently we have 16 McData 6264 directors in 2 fabrics. We intend to double this to 32 per fabric, with a twin-core switch design. We need to do this as the main core switch we have (we use core/edge design) can't take any more disk connections. We had 2 options, migrate to a 140 port director or install an additional core switch. The preferred route was to use twin core directors. This takes the infrastructure to just over 2000 ports. Having a twin core design wastes another set of ISLs from each edge switch. It is more wasteful than the mammoth core switch concept, however we would have the ability to partition the SAN at a future date if required.
Increasing the infrastructure will be a significant piece of work. Most challenging will be maintaining the configuration. So far we've managed this process with 100 percent success. The first piece of work will be installing the new switches and bringing them into the fabric, testing alerting and connectivity. Timescales are tight - it all has to be done by last week......
1 July 2003
I can't believe we are already half way through the year. Workload seems as great as ever and capacity and demand increase at a tremendous rate. For example, we're now managing nearly 600 hosts compared to 400 8 months ago. These hosts are using nearly 70TB of storage and another 8TB is being ordered this month. As our SAN has grown we are starting to see some difficult to solve problems which are appearing more regularly. I say difficult to solve, but they're not particularly difficult, however in an environment that is 24x7 and mission critical, taking resources down to swap components is not something you can schedule for prime time on a Monday morning. Instead we are looking at late evenings and weekend slots, with a step by step resolution plan that is taking a number of weeks to resolve. As the environment grows, this is only to be expected, however it means less time is being spent on projects and more on problem resolution.
So, as the SAN reaches the size of a 200-pound gorilla, do we let it expand to the size of an African elephant or should we give birth to a new SAN child and let them sit side by side together? At the moment I'm not sure. If we expand the current SAN, then we need to introduce additional core directors, which means ports lost to extra ISLs. If we split into two SANs, we create more manageability problems, especially for deciding how to connect our disk subsystems to each infrastructure.
The jury's still out. I think I have some more deliberating on the pros and cons before a final decision can be made. Whatever the outcome, the management challenge will remain
20 June 2003
The last couple of weeks seem to have been dedicated to resolving performance issues on our McData switches. Quality of cable seems to be the main issue. On a number of ports we receive transmission errors of varying types, mostly CRC errors. The McData switches monitor these errors and raise alerts if the error rate exceeds a pre-defined threshold. For a number of ports that serve Notes data on Win2K servers we see server freezes and at the moment this seems directly related to the errors those ports connected to our disk subsystem receive. Moving the affected hosts to another connection to the disk subsystem seems to resolve the problem, confirming the view that cabling is the issue.
That has lead to discussions on how we can best locate and remedy the errors before they cause system impact. Obviously we are happy to accept a certain number of transmission errors, however it seems that those ports which display any transmission errors currently are likely to have reported alerts in the future, simply due to the increased traffic those ports will receive as we increase our infrastructure.
So, we are looking to use the telnet CLI to obtain errors details. This has a number of benefits. First, we can set the output to be in comma delimited format, making it easier to import into a database. Second, we can reset ports stats, so once we are happy we have collected the latest details, we can reset and collect the next day. Regular collection allows us to relate issues to a particular day, or a generic trend we see with a specific port.
All of this means more scripting.... 6 June 2003
This week has been all about getting our Netapp filers to work and understanding features such as vfilers, SnapMirror and clustering. Although the features on their own are fairly straightforward to implement, the problem is integrating these features together and coming up with an applicable set of standards.
For example, resources such as volumes can be assigned to vfilers, however Snapmirroring (asynchronous data replication between filers) is performed at the physical filer level.
The discovery on SnapMirroring has led us to rethink how we assign our Network interfaces to production and management uses. Should we dedicate an interface for SnapMirroring? It's looking like we should. SnapMirroring is certainly fast, even across a 100Mb/s link, so we don't want it impacting production data access. The final test of our configuration will be to failover a clustered filer to the backup filer and replicate from either one to a third filer. That's the challenge for next week
30 May 2003
I've spent most of this week on support issues and disk allocations. Today alone, I allocated nearly a terabyte of space for new and existing hosts. I've also started to configure the Netapp Filers. I had one interesting issue this week. We'd powered up one of the new machines for testing but didn't have enough power in the installed rack to keep it running, so it was closed down and left for about 10 days. The NVRAM battery had reached a critically low status and when we eventually brought the box back up, the filer shutdown until the NVRAM battery was recharged! It appears that the battery is OK and was probably just flat. The vfiler configuration is taking some thought. Each vfiler requires a "/etc" directory and this has to be on a disk device or qtree that will not be destroyed during the life of the vfiler. I think we'll dedicate the root volume for the physical filer as the place for the qtrees and have one per vfiler on that volume.
23 May 2003
We haven't solved our switch issue. EMC aren't confident that resetting the switches will clear the blockage. They think the problem may be elsewhere on the fabric and is not a simple ESN Manager issue. Even if we wanted to, we're being recommended against power-cycling the switches due to another but that's been brought to our attention. This causes loss of connectivity to the switch due to some time counter related problem. Flipping between CTPs (the NIC interface) is supposed to resolve it.
Another interesting problem raised its head this week. A Solaris host which is booted from EMC SCSI disks crashed and we couldn't get it back online despite rebuilding the O/S. After numerous combinations of local and EMC disk and cabling swapping, we determined the problem to be an internal SCSI CD-ROM drive, which when replaced, allowed us to boot from the EMC or local O/S. The conclusion so far is that this somehow affected the SCSI bus, although any diagnostics we did showed no hardware problems. At the moment we are waiting to see if we have a recurrence of the problem.
19 May 2003
Now our Netapp filers are installed, it's time to start configuring and investigating how V-filers will work. Virtual filers operate above the physical filer level using a product called Multistore (which is separately licensed and unsurprisingly, not free). I'm still not clear how all this virtualisation is going to work - there are virtual volumes (qtrees), VIFs (Virtual Network Interfaces) and now virtual filers. Additionally, we have to ensure clusters failover to their cluster spare. We have a couple of weeks of testing (playing) to allow us to discover all of the pitfalls of bad configuration and I'm sure we'll discover them all!
16 May 2003
I'm still progressing the problem with ESN Manager and the two problem directors which have changed their IP address. It transpires that there is a procedure to ensure that the World Wide Name of a director remains the same after replacement - but that wasn't done. Deletion and addition of the directors to ESN Manager hasn't resolved the problem. I think we've ended up with a configuration were the two replaced switches have zone sets that don't match the rest of the fabric, but the zone set name on the switches are the same. Consequently, ESN Manager incorrectly believes them to exist on two separate fabrics. I'm hoping we can resolve this problem during the coming week.
12 May 2003
I moved some disks today between our "old" and "new SANs. The old infrastructureis Brocade based and we're looking to move all hosts to our new McData strategic SAN which has much more scalability and performance. We also have a problem with our management software, ESN Manager, which is refusing to view two switches which had hardware replacements and have changed World Wide Names. Consequently we can broadcast new zoning information as the interswitch links (ISLs) ensure any changes are propogated across the fabrics, however ESN Manager won't make any Volume Logix assignments to the switches which were discovered incorrectly. That led me to use the fpath command to manually set the Volume Logix assignments. It was a lot simpler than I thought and made me think this may be a better solution than performing a discovery of the environment (which currently takes 20 minutes).
The deletion of old hosts continues to be a problem is ESN Manager. As far as I can tell, ESN Manager looks for potential devices from a number of sources; zoning details, discovered devices in the fibrezone database, volume logix assignments and the login history table on the Symmetrix FAs. I've deleted some hosts, being careful to ensure there are no untidy entries, however they keep returning and I can only assume ESN Manager is using the FA login history. The problem is, there appears to be no way to reset this information. If anyone has any ideas.... 7 May 2003
Our Netapp filers are due for installation this week. Over the last couple of weeks we've been determining the best configuration strategy for them based on the initial requirement to move data from our existing Solaris NFS servers. Getting the right structure isn't easy; There is the physical layer of disk/RAID to consider, plus configuring virtual filers on top of the physical hardware. More important is ensuring we have configured the systems and established standards to allow for future expansion. One aspect of the filers that I'm hoping we have no problems with is NDMP to our Veritas environment. I envisage performance issues here as we have millions of files per network share to be backed up. 6 May 2003
I've been doing some digging into Netbackup and reviewing our environment, which is now getting quite large. It's part of the usual health check and a log term goal to develop processes to ensure we are running a tight ship. I've already determined a number of tapes that appear to be expired but not released, although I've yet to create a scrip to make an exact determination on numbers. I'll post the script once I've written it. I'm sure I'll uncover more as I go on. Netbackup uses two databases to keep track of volumes and I believe the two have become out of sync with each other. The script should compare the output of bpmedialist and vmquery and see what we get.
5 March 2003
I'm on a course this week. Netapp. Seems a reasonable product. We're planning to migrate the data from an existing Sun NFS server to 2 clustered pairs of filers from Netapp. The Snapshot function seems quite nice, however it's not a unique as described on the course. StorageTek provided a Snapshot feature on their Iceberg subsystem. That allowed a copy of an MVS volume to be created by replicating the pointers storing the tracks of the volume to another volume, thereby creating a copy of the data at no additional storage space. The increase of space only occured when either copt was updated. The technology was inspired and allowed some very clever Y2K testing to be performed, easily creating and re-creating copies of systems to build test environments. Netapp currently only offer read-only snapshots - it will be a truly impressive product if data could be snapped and make read-write. I'm looking forward to seeing what we can achieve over the coming months.
28 February 2003
I'm almost there with my configuration database. I can account for all my SAN ports, disk space, hosts and cross reference unused disk space, allocations by business unit and produce the chargeback figures automatically. Next I will link this in to scripted output from configuration checking.
I've hit a bit of a quandry. Our host settings for Solaris servers are based on EMC recommendations as we connect to EMC disks. Unfortunately we now have some hosts we connect to both EMC and HDS disk. So, should I take the EMC recommended settings, or the HDS settings? If I have a problem, I can be certain EMC will not support me if their recommended settings are not correct. So what do I do? 21 February 2003
Is tape dead? Fifteen years ago, tape was an integral component of any storage strategy. Disks failed and we didn't have the same proliferation of RAID technology we have today. Tapes had capacties of around 1.2-2.4GB, matching closely the disks they back up (for example 2.8GB disk sizes). Today, we have tapes of between 40-200GB in capacity but tape access speed has not increased at the same rate. Disk subsystems have, however changed dramatically. They are now cached, RAID is standard and access times and speeds are dramatically improved. So, can disk replace tape? There have been a number of tape replacement or management techniques to virtualise and redirect tape requests to disk, including VTS products from StorageTek and IBM amongst others. I'd like to see us implement an Open Systems virtual tape subsystem - I will be looking at the options over the coming weeks.
19 February 2003
SANs provide a great way of centralising your DASD allocations into one manageable environment. However I think we are starting to both see the benefits but encounter some of the problems with centralising storage through a single infrastructure. On the benefits side, we can deliver large volumes of storage quickly and efficiently, for instance hundreds of gigabytes per request is as easy as 20 or 30GB. On the negative side, having one single huge fabric creates problems. Recently, we have seen a large number of issues when we perform fabric refreshes. We have started to also see problems when we commit SDR changes on our EMC 8730 and 8830 frames. It's difficult to see what is causing this problem as the only change we have recently made is to upgrade our FAs on these frames from 2 to 4 port cards. It is likely that we will have to restrict our work to out-of-hours implementations, which in reality we can't do, due to the demands on delivering new storage requests.
It's interesting to contrast the Mainframe and Open Systems world. Mainframes centralised the storage, tape and processor infrastructure into large DASD farms, tape silos and small numbers of processor boxes. In our Open Systems environment, we have centralised the storage via the SAN, centralised the tape via network backup products like Netbackup, but not centralised the processor, with many different hosts, multiple HBA vendors, different operating system levels, driver levels and software levels, making it extremely difficult to standardise on these discrepancies. It also means that should we need to implement fixes to match a new firmware level on our SAN, we have a huge upgrade task to bring server hosts uplevel as well. This has to be one of our biggest challenges; it must also be the biggest challenge for other large IT environments. Yet again accurate documentation is a must... 12 February 2003
Our storage environment grows daily. Today we ordered another 30TB plus another 100 ports on the SAN. I had thought our storage was growing at around 1TB a month, but in reality we are probably doing double that at the moment. We also have a very heterogeneous and complex environment with Windows 2000, Solaris, AIX and Netware hosts all on the same infrastructure with Brocade and McData switches using EMC and HDS DASD. All we need now is a little IBM for good measure.
Our Netbackup system continues to be a system in high demand. It is getting more difficult to schedule any outage for maintenance and we are going to have to look at other high availability options in order to get any chance to do downtime work. I need to look at the options as we already have a backup server available, however it is not clustered. One irritating thing we descovered this week; Netbackup 4.5 doesn't support the X-Windows Client, which means using the Java version. This has proved to be excruciatingly slow even across a 10Mb/s connection and totally unsuitable for remote support. We've decided to look at the Windows client as that may provide what we want. I'll post more details as we discover them. 6 February 2003
Fabric migrations; This is our next big job. Removing our old Brocade SAN and migrating to the new McData infrastructure is proving tricky and there are a lot of dependencies. Cabling is our first big issue, and the shortage of suitable SC/LC convertor cables - which are pricey. In reality the actual host move is not difficult, however with 50 hosts to move and check independently, this is a time consuming process.
How would you identify bottlenecks in your SAN environment? I was posing this question to myself yesterday in order to find a simple way of determining when we have a SAN capacity problem. As our McData switches are non-blocking, they have a port-to-port capacity of 2Gb/s, so we should never max out across a switch. Most of our HBAs are 1Gb/s with some at 2Gb/s, so the HBA may be an issue on older systems., but it will not impact the SAN. As we have a core/edge methodology, the only point in the infrastructure where we have shared resources are Inter-Switch Links (ISLs) and Fibre Adaptors (FAs) on our disk frames. I've decided these should be our monitoring points for any initial performance issues. 3 February 2003
How do you document your storage? Although there are lots of products for mapping what storage is available, the key issue for me is being able to reserve out storage, assign it for future allocations and be able to plan free and in use capacity. I've yet to see anything in the Open Systems world that allows me to do this, so I'm developing a straightforward MS Access database to map the storage and SAN fabric infrastructure we have in place. At some time in the future we can move this to a more central database product and make Storage information available on our intranet. A key feature is validation; knowing the database matches the configuration. We'll see how things go over the coming months.
31 January 2003
A tough couple of days. A SAN fabric refresh caused us to lose a major application due to an HBA failing to recover correctly after the refresh. This caused the host to hang and we had to reboot. Having multiple vendor HBAs, switches and disk subsystems causes lots of support problems. For example, HBA card manufacturers use differing formats to specify WWNs, binding and LUN information. Messages is logs are all different. I suppose it is only to be expected that different vendors will implement their own parameter system. We will focus on a single manufacturer and support headaches will definitely be reduced.
As our SAN grows, the implications of a SAN outage, whether a fabric interruption or device failure become more serious. Although the SAN gives improvements in terms of centralisation, reduced management overhead and a certain degree of centralisation, there's also the issue of feeling like we're "putting all of our eggs in one basket". I see a requirement to improve our processes and reduce some of the risk. 28 January 2003
Today was a day for Netbackup. We're having poblems balancing our workload across multiple storage units and it seems until we upgrade to version 4.5, things will stay that way. Version 4.5 introduces storage pooling, in the meantime, we'll have to use good old-fashioned manual balancing. Getting back to our main Netbackup problem, Netware 6 clusters, we've worked out that changing our class definition to permit multiple streams means we can track better the difference between a backup that failes due to missing resources and one that fails due to open files. The script should be more accurate until we get a more permanent solution (but I'm not holding my breath). 27 January 2003
Well, the migration from 2 to 4 port FA cards on our EMC frame did not go well over the weekend. Despite disks disappearing and re-appearing on the SAN correctly, some hosts did not failover their HBA paths and as a consequence we had to do some data recovery - fortunately on a development host. The lesson learned today is to get to know DMP a lot better - mind you I also doubt that some of our DMP configurations are correct and that gives more headaches.
The Open Systems environment seems to be getting more complex. The interrelationships between components means there are more pitfalls to catch out even the most wary of Storage Managers. All the talk of bringing together the Storage Environment still seems to me to be a long way off. Even just managing the documentation of our environment takesa huge percentage of the day. Personally, I'd like to see a tool that allows me to both map out my current environment and to reserve disks and port for future usage so I can capacity plan without having to maintain separate spreadsheets and documentation. Everything today seems to focus on just giving a view of the environment rather than understanding the role the Storage Manager has to perform. 24 January 2003
The migration from the "old" SAN to the "new" SAN continues. Our major issue seems to be the difference between the connector types for the fibre cabling. The older Brocade switches in our existing environment use the SC size connections, however the Brocades use the LC size, introduced permit more ports to be fitted into a smaller space. This proves to be a pain for the older hosts and HBAs but a big earner for the supplier!
Comatose resources on Netware; Our new Netware 6 cluster keeps losing storage resources when we failover between nodes. We think it may be because we have allocated meta volumes down more than one pair of Fibre Adaptors on our EMC frame. After reconfiguring to a single pair, we wait to see the results. Talking of Netware 6, backups are proving to be a pain. Veritas Netbackup (Datacenter 3.4.1) can't determine where a storage target is located and so we have to attempt to backup all targets from all cluster members then use a script to determine whether all targets have been backed up everywhere. Not an ideal solution, but it works.