Great discussion. One thing that some people have a hard time comprehending with TSM is that its not like other backup solutions like Veritas that can tell you a yes or no answer and have it be accurate. The fact that TSM is a living system and that the backups may run throughout a 24 hr period, and also have variable times that a schedule runs (weekly, daily, bi-weekly, monthly, manually) makes reporting for such an environment very difficult to understand for some to grasp.
One other thing to do is to find the person that is the 'liason' if you will and have a sit down. Describe the basics of how TSM works and its reporting. If a backup is Started, or Missed, or Failed, doesn't really tell you what the problem was in some cases. It takes a bit of knowledge. You can also setup seperate reporting that kicks off at different times of the day. That way you could get the success messages perhaps. You can at least find out what time they complete approximately and kick your report off near that time.. It's been hard for me since I was handed a TSM to admin and NEVER had touched one before. It's been quite a learning experience, and tons of 'fun' trying to describe it to others. The Operational Reporter is the best tool you can get, but it may take some expertise to get it working how you want it. Filters filters filters :) You can also hire a contractor and have them build you some nice reports if its in your budget. On 11/9/06, Prather, Wanda <[EMAIL PROTECTED]> wrote:
Yes, it frequently takes some work to customize the report. But if customized properly, it won't be that large, even for 100 clients, to go through reliably. Strategies my customers have used. 1) You need to set up the filters on the ACTIVITY LOG Section of the report so it only includes ACTIONABLE messages. For example, you don't need to see all the entries for files that failed to back up because they are in use, because they are listed in a later section of the report as well. So you just exclude that message number. Likewise, you will find a lot of nuisance messages that show up as errors, when they were really just a syntax error on a command the admin typed in. You can exclude those as well, and you will get the report down to just the actionable messages. 2) Turn OFF the MISSED FILES SUMMARY section so it isn't produced. That just seems pointless to me; you can't do anything with it. The relevant information is in the MISSED FILE DETAILS section. 3) Turn OFF any other sections that you don't need to see; for example, some of the graphs may not be needed every day. 4) Set the option that says "use collapsable sections"; then the viewer can look at just 1 section at a time. On most days, the only sections that have to be reviewed will be the CUSTOM SUMMARY, the ACTIVITY LOG DETAILS, ADMIN SCHEDULE STATUS, and CLIENT SCHEDULE STATUS. MISSED FILE DETAILS should be reviewed about once a week, and the admin should take action on the files that are being missed regularly. 5) The term "successful completion of ALL backups" doesn't mean much to TSM; there are plenty of sites that run backups continuously, 24 hours a day, and schedules overlap. SO what schedules are completed, depends on what time of day you ask. What you should do with TOR is is have the TSM administrator adjust the client schedule windows so that there is a time gap BETWEEN the schedules where you can run the TOR report. That will eliminate MOST of the missing reports, unless a client is hung and running hours behind schedule. 6) Consider distributing some of the workload. If a client misses or fails a backup, is your TSM admin supposed to GO to that client and rerun the backup, or do they just notify the owner of the machine? Can you make the client owner responsbile for dealing with it? In that case, all you have to do is set up TOR so that it sends an email to the client owner with the schedule status. FWIW, 100 TSM clients isn't actually all that many in TSM land. I think your TSM admin just needs some help getting a grip on what needs to be addressed. -----Original Message----- From: ADSM: Dist Stor Manager [mailto:[EMAIL PROTECTED] On Behalf Of Wesley Smith Sent: Thursday, November 09, 2006 12:01 PM To: ADSM-L@VM.MARIST.EDU Subject: FW: [ADSM-L] How do you verify the Completion and A ccuracy of Backups and Restores? Thanks, Wanda. I believe the TOR tool is what they are currently using. I think a large part of their problem is that the reports are so large, it is impossible for one person to go through the report every day with any amount of reliability. I know that they are responsible for handling the backups of well over 100 servers and that it is being done by just one person. I've seen the report as currently generated and noted a number of problems with it. The report runs at a scheduled time rather than having job triggers that would kick it off after the successful completion of all backups. As a result, the report will show backups that started but without showing that they have completed. On some days, there will be very few of these. On other days, quite a few. Throwing stuff like that into the mix of the real errors and other "pseudo errors" and you find yourself trying to chase down a lot of non-errors. I will be passing along to the appropriate people that perhaps there is some additional filtering that could be done to these reports to reduce their size to something that is more manageable. I'm hoping that we will be able to come up with some filtering and scripting aids that will help to automate this process as much as possible and reduce to a minimum the need for the Tivoli support person to spend a lot of time every day just reviewing the night's work. Thanks again for your time and help. Wesley -----Original Message----- From: ADSM: Dist Stor Manager [mailto:[EMAIL PROTECTED] On Behalf Of Prather, Wanda Sent: Wednesday, November 08, 2006 1:35 PM To: ADSM-L@VM.MARIST.EDU Subject: Re: [ADSM-L] How do you verify the Completion and A ccuracy of Backups and Restores? Ditto. To start, set up the TSM Operational Reporter (TOR). It is free as part of the product. It works real well right out of the box, but can also be customized to do some clever things. If your TSM folks are running a non-Windows TSM server, they may not be familiar with it, as it is a Windows application. If your TSM server is Windows, TOR gets installed when the server is installed, but you still need to configure it. If your TSM server isn't Windows, you'll need to install TOR separately on a Windows host. But it doesn't have to be a Windows server or anything fancy; you can run it on your desktop. It will tell you, every day, EXACTLY which backup schedules completed or did not; which clients had missed files, and what those files were. It also scrapes the TSM activity log for any server-end messages that need attention (although there are also frequently nuisance messages that you will want to filter out, using the customization available in TOR). You can have the reports generated as HTML that is available for browsing, or mailed to you. Sounds like nobody has done this yet. SOMEBODY SHOULD REVIEW THIS REPORT EVERY DAY. AND ACTUALLY ATTEND TO THE THINGS THAT NEED ATTENDING. You can read about TOR in the "monitoring your server" section of the TSM Administrator's Guide. If the missed backups you are referring to are data bases that are being backed up using a TSM Data Protection agent (backing up through the API), you may have to be creative about gathering the reports from those logs (esp. with Oracle - I think you have to actually view the RMAN logs to guarnatee that thoose worked correctly.) But I have had success writing very small scripts (e.g. perl) that scrape the information out of those logs, and send it to be displayed in the TOR Daily report. Wanda Prather "I/O, I/O, It's all about I/O" -(me) -----Original Message----- From: ADSM: Dist Stor Manager [mailto:[EMAIL PROTECTED] On Behalf Of Mark Stapleton Sent: Wednesday, November 08, 2006 11:44 AM To: ADSM-L@VM.MARIST.EDU Subject: Re: How do you verify the Completion and A ccuracy of Backups and Restores? From: ADSM: Dist Stor Manager [mailto:[EMAIL PROTECTED] On Behalf Of Wesley Smith > My problem is that they (that sister agency) do not seem to have a >reliable way of verifying that all backups have been properly >completed. They don't even seem to have a way to know that all files >(that need to be backed up) are being backed up. I've seen the reports >that get generated during the backup process and I am definitely >unimpressed. Backups start and backups complete. There doesn't seem to >be anything that says how many rows are copied or how large the files >are or anything else that could be used for verifying the accuracy of >the backups. They tell my folks that we should trust Tivoli is doing >the job correctly. Trust is the problem.... Let's start there. When you look at the dsmsched.log file, that contains a record of all scheduled backups and their outcomes, you should have a record of what files are backed up, the size of the files, and the timestamps give an idea of how long it took to back each file up. (This is assuming that the QUIET feature is not present in the client option file or the client option set designated for that TSM client.) If you're using the specialized TSM agents for databases or mail apps, the scheduled backup logs containing fairly granular information about individual file backups. What more do you need? > We have needed to have restores done on just a few databases in the >past and the restores were not complete and up to date. In each case >we were able to rebuild the data using logs maintained within the >applications but that should not have been necessary. Each recovery was >done at a point after a backup and before additional processing had been >done within the apps so they should have been complete. In each case, >the folks who run Tivoli for us were able to track down and show that >problems had occurred during the processing of the backups. They did >this through circumstantial evidence and in each case once again said >that they have no way of verifying that the backups are actually good. >I hear a lot about the difficulty of trying to write a program to >process the Tivoli log files. > > I think I'm at wit's end with these folks and the product. I know >that the people are competent and I suspect that the product (like >other things available from IBM) really is weak on the reporting and >verification issue. While TSM itself does lack some reporting functionalities (particularly when it comes to client backups and restores), I have to say this: On every properly maintained and monitored TSM system I have touched in the 12 years I've adminstered and engineered this product, I have *never* lost a single byte of information. Period. If you cannot do a restore because of "lost" data, something is happnening during backups that is not being caught at the time of the backups. >I'm hoping that someone out there in the Big Wide World has already >solved this problem with an in-house or third-party solution. Sorry >for being so long winded. Any ideas...? I think what is needed here is greater familiarity with TSM and its proper administration. Proper verification of good backups is best done by regular DR practice of planned bare-metal restores of chosen machines. If you can take data backed up by TSM and restore a given machine in a DR environment, and the machine comes back properly, you know the job is being done right. If it doesn't, *then* you dig into *why*. BTW, there are responses to this thread advocating ServerGraph and Bocada for reporting and monitoring. Be aware that those applications do a fine job of monitoring server operations. (Well, ServerGraph does, anyway.) Their reporting, however, is not granular enough to indicate whether a given file is being backed up properly. -- Mark Stapleton ([EMAIL PROTECTED]) Senior TSM consultant