Veritas Netbackup is a powerful network-centric backup product. Like all tools, it requires careful configuration in order to get the most performance from it.
Take a company, such as mine, that uses Netbackup as its enterprise-wide backup product across all open system platforms (AIX, Solaris and Windows of various flavours). We recently upgraded the infrastructure to double the number of media servers and to add new drives, but we were still seeing performance bottlenecks at certain times of the backup schedule. How should this problem be resolved?
Firstly, let’s set the scene. We use Netbackup 3.4.1 across 2 datacentres, each having 4 media servers per site. Full backups are performed on a weekly basis (each Friday) and incremental backups are performed on other days.
There are a number of features in the way Netbackup uses available resources which have a direct impact on any configuration work we perform and these include:
• A class cannot use more than one media server (a class represents a group of servers). • A class can only backup one data type, for example Unix filesystem or Oracle database. • Backups of mixed expiration dates cannot reside on the same tape (unless specifically overridden by the administrator) and would not be a desirable configuration. • Netbackup will not request a new tape and drive if an appropriate tape is already mounted and available on a drive. This must include same expiration type and tape pool but not class. • Netbackup will allow multiple data streams concurrently to a single tape device.
Our aim was to remove the bottlenecks within the configuration and to have as much work going through the system as possible.
A review of existing backups highlighted a number of problems. Firstly, backups for Unix filesystem, database, Netware and Windows 2000 were all using separate volume pools. In addition, the three database types in use (Oracle, Informix and Sybase) were also using separate pools. This meant almost each backup type requested a separate tape and drive, causing significant queuing for certain workloads. Another problem was that too much work was being concentrated on a single media server, when other media servers were idle but couldn’t be used, due to the restriction of the one-to-one relationship between class and media server.
How did we fix the problem?
Our first task was to reduce the number of tape pools. We decided to retain only three pools per site from a previous eleven: one for Unix (both filesystem and database backups), one for Windows and Netware, and one for Lotus Notes.
Next, we reduced the number of classes for ease of management and grouped classes using the same volume pool onto the same media server. This gave us a media server for all Unix backups, a server for Windows/Netware and a third for Lotus Notes. We then had a spare server for future use.
Third, we brought expiration types for backups into line with each other on the same media server, so Unix database and filesystem backups expire at the same time. That way they can share the same tape during backups. And finally, we experimented with increasing the multiplexing on concurrent backups to one drive for each of the backup types. We managed to configure 8 concurrent backups for each type, without a negative impact on performance throughput.
Making these changes to Netbackup dramatically improved our workload throughput, eliminating all of the previous workload bottlenecks. Currently, full backups are all started at 19:00 each Friday. As we allow Netbackup to manage the backup queues, the final Unix backups complete during Saturday or early morning Sunday, which is within an accepted service level.
Although these changes resolved the majority of our performance and throughput problems, there was one other requirement not already discussed, that needed to be implemented.
Most backup work is not time-dependent, however some backups are integrated into batch scheduling and therefore need to run almost immediately. This is not a simple task to achieve within Netbackup without permanently dedicating tape drive resources, which was something we wanted to avoid. Our approach was to create another set of classes, with the same name as the standard classes, but with the “P” suffix to indicate “priority” work. These classes were given a higher priority within Netbackup, causing them to be selected first when processing the backup queue. Although not a perfect solution, it does allow priority workload to be scheduled more quickly than in the previous configuration.
In summary, we designed our Netbackup configuration to match the restrictions imposed by the product. This allowed us to manage an increasing workload. We now only need to monitor failures and the time the last backup completes to ensure we are meeting our SLAs.