Business Continuance
Home Systems Experience Biography Contact Us

 

Automated Documentation
Business Continuance
Performance Tuning
System Administration
Web Services

All too frequently, business continuance and disaster recovery are associated with backups. 

Unfortunately, most people are under the unrealistic impression that if they have a backup of their system, they can recover their system.  All to frequently, when people think that they have good backups they don't.  Consider the following issues:

  • Keep documentation up to date.  In a production environment, time is of the essence - configuring a UNIX system from scratch may take days or may be impossible to reconfigure back to original specifications.  One way to ensure that you have updated documentation is to use an automated documentation system to keep track of the configuration.

 

  • Don't rely on 'Recovery' tools.  Use recovery tools such as Hewlett-Packard's Ignite.  Just don't rely on them to work.  There are limitations on what they will do and certain criteria have to be met - IP address range, etc.  While they are a very worthwhile asset they are not a panacea for system recovery.

 

  • The operating System must be INSTALLED.  A UNIX-based system cannot just be copied back onto a system - it must be installed. Backup programs must be reinstalled before any data can be recovered.  Do you have a copy of the media originally used to install the system?  Do you have all the patches? How about all the licenses?

 

  • The configuration of a UNIX system changes frequently.  It changes whenever the system administrator makes any change to the system. Notes are rarely kept in a neat, orderly fashion - especially if the system administrator is busy. Many times, the UNIX operating system itself is modified to accommodate data bases, distribution applications and the like. This information must be carried over to the new system.  

 

  • UNIX servers are not the same.  In the case of a real disaster, most companies will try to restore at a remote site. However, the equipment won't be exactly the same. Systems are not the same - even within the same brand or model line.  Their architectures vary and require different components of the operating system or patches to function.  If nothing else, the addressing on the drives won't be the same, so merely copying old configuration data (assuming it is complete) won't do the trick.

 

  • Know the relationship between disks and volumes.  Most enterprise-level UNIX servers use a disk management system called logical volume manager (LVM).  In many cases the volume manager is native to the operating system (as in the case of AIX and HP-UX).  On other systems, third party products are used. This management system spreads UNIX filesystems across multiple disks. To manage the relationship of disks to the file systems the management system maintains a relationship of the file systems to the physical disks. Disks will have different "names" on different systems so merely copying system data won't provide the necessary information to properly set up disks and rebuild filesystems.

If you are responsible for the disaster recovery plan for your UNIX system, make sure that you always have current information on:

  • Logical volumes, volume groups and filesystems, including type, size, mirrors, etc.
  • Kernel parameters and system input-output configurations.
  • Configuration files (there are a lot of them) such as the services file.
  • Startup and shutdown information - including scripts, order, etc.
  • cron jobs.
  • Software licenses and licensing information.
  • Know the other systems, printers, etc. that the server communicates with beforehand and make the re-establishment of those communications part of your plan.  You will probably be surprised at the number and variety of systems that the server is communicating.
  • Testing procedures for every application running on your system.
  • Procedures for storing tapes, recalling tapes and personnel, etc. Note: Make sure they work. Just because it is written doesn't mean it works.
  • The real cost of having your system down for 1 day. This cost should include lost sales, cost of idle programmers and support personnel, etc. This figure will help keep the cost of disaster recovery preparation and maintenance in perspective.

One final point:  The devil is in the details!

Trident Systems has recovered HP 9000 production systems several times, including test sites at Hewlett-Packard's Performance Center in California and Sungard in Philadelphia, PA. Many companies claim they have some mystical shortcut to disaster recovery. Usually these shortcuts come with numerous caveats or unrealistic assumptions. These assumptions are shortcuts to disaster.

Don't wait until disaster strikes! Be prepared!

 
Copyright 1998-2007 Trident Systems, LLC