On Jan 21, 2009, at 4:35 PM, Fred Johanson wrote:
We've got two machines which are identical in every way, same hardware, same AIX, same TSM level, same options set, same storage pools, domains, everything except one has the DB and Log on local disk. On this box things run very slowly: expiration may take a week, filespace deletion creeps, and we see this as the normal behavior
Fred - In my experience, it's seldom the case that two machines of the same model are identical, even if ordered at the same time. Vendors are fond of outfitting computers with "equivalent" components, from multiple suppliers, during manufacturing. It's common to see same type, but different OEMs for memory, cards, and disks. The cartons containing the computers may well have come from different manufacturing periods or different assembly plants. There can easily be variations in configuration methods for differing components; and sometimes there are factory errors in placing jumpers on drives and cards. Where the OS perceives hardware elements as different, differing device driver software may be called upon, and that software may have differing characteristics as it evolves. Your staff will have to perform all the usual configuration and performance reviews involved in chasing a problem like this. Whereas this is AIX, the first thing I would look at is the Error Log for any irregularities, and 'lscfg -v' for detailed comparison of the two boxes. You are fortunate to have the advantage of a comparable system with good performance as a basic for pursuit. Given that this is local disk, also comparatively check attributes, as via 'lsattr -El hdisk4', and particularly check the queue_depth value. For disks which IBM's OS recognizes as being programmed for, a Queue Depth of 3 or more will be used and you will get good performance; for disks that the OS does not recognize, it will minimize Queue Depth to 1, and performance will be poor. (See discussions of Queue Depth on the Web, to perceive impact.) Your computer implementation people may not have benchmarked or otherwise performance-checked the systems when they came in, where one would perform disk and memory stress testing as part of acceptance verification before committing the boxes to production. If not, consider making small logical volumes on the disks to now perform such disk performance tests. And definitely assure that all installed memory and processors are online: I've seen such elements rather quietly fail and go offline, resulting in impaired performance whose cause it not apparent. One thing that I do when performance is hurting on AIX systems is an iptrace/ipreport run, to see what network traffic is hitting the box. A DoS attack, or unintentional equivalent, is identifiable only by such an examination. Inordinate network activity can greatly elevate the interrupt rate on the system (particularly with GigE), which will congest the system bus. Look for and put a stop to any network access which should not be happening. Beyond all that, you need to perform stepwise examination of the system elements upon which TSM, as an application, runs, including things like I/O path contention from other system activity. Thereafter, examination of TSM elements would be warranted. Richard Sims http://people.bu.edu/rbs/