Diana,
Sorry to chime in late on this but you've hit a subject I've been
struggling with for quite some time.
We have some pretty large Windows NT file and print servers using MSCS.
Each server has lots of small files(1.5 to 2.5 million) and total disk
space(the D: drive) between 150GB and 200GB, Compaq server, two 400mhz xeon
with 400MB ram. We have been running TSM on the mainframe since ADSM
version 1 and are currently at 3.7 of the TSM server with 3.7.2.01 and
4.1.2 on the NT clients.
Our Windows NT admins have had a concern for quite some time regarding TSM
restore performance and how long it would take to restore that big old D:
drive. They don't see the value in TSM as a whole as compared to the
competition they just want to know how fast can you recover my entire D:
drive. They decided they wanted to perform weekly full backups to direct
attached DLT drives using Arcserve and would use the TSM incrementals to
forward recover during full volume restore. We had to finally recover one
of those big D: drives this past September. The Arcserve portion of the
recovery took about 10 hours if I recall correctly. The TSM forward
recovery ran for 36 hours and only restored about 8.5GB. They were not
pleased. It seems all that comparing took quite some time. I've been
trying to get to the root of the bottleneck since then. I've worked with
support on and off over the last few months performing various traces and
the like. At this point we are looking in the area of mainframe TCPIP and
delay's in acknowledgments coming out of the mainframe during test
restores.
If you've worked with TSM for a number of years and through sources in
IBM/Tivoli and the valuable information from this listserv, over time you
learn about all the TSM client and server "knobs" to turn to try and get
maximum performance. Things like Bufpoolsize, database cache hits,
housekeeping processes running at the same time as backups/restores slowing
things down, network issues like auto-negotiate on NIC's, MTU sizes, TSM
server database and log disk placement, tape drive load/seek times and
speeds and feeds. Basically, I think we are pretty well set with all those
important things to consider. This problem we are having may be a
mainframe TCPIP issue in the end, but I am not sure that will be the
complete picture.
We have recently installed an AIX TSM server, H80 two-way, 2GB memory,
380GB EMC 3430 disk, 6 Fibre Channel 3590-E1A drives in a 3494, TSM server
at 4.1.2. We plan to move most of the larger clients from the TSM OS/390
server to the AIX TSM server. A good move to realize a performance
improvement according to many posts on this Listserv over the years. I am
in the process of testing my NT "problem children" as quickly as I can to
prove this configuration will address the concerns our NT Admins have about
restores of large NT servers. I'm trying to prevent them from installing a
Veritas SAN solution and asking them to stick with our Enterprise Backup
Strategic direction which is to utilize TSM. As you probably know, the SAN
enabled TSM backup/archive client for NT is not here and may never be from
what I've heard. My only option at this point is SAN tape library sharing
with the TSM client and server on the same machine for each of our MSCS
servers.
Now I'm sure many of you reading this may be thinking of things like, "why
not break the D: drive into smaller partitions so you can collocate by
filespace and restore all the data concurrently". No go guys, they don't
want to change the way they configure their servers just to accommodate TSM
when the feel they would not have to with other products. They feel that
with 144GB single drives around the corner who is to say what a "big" NT
partition is? NT seems to support these large drives without issues.
(Their words not mine).
Back to the issue. Our initial backup tests using our new AIX TSM server
have produced significant improvements in performance. I am just getting
the pieces in place to perform restore tests. My first test a couple days
ago was to restore part of the data from that server we had the issue with
in September. It took about one hour to lay down just the directories
before restoring any files. Probably still better than the mainframe but
not great. My plan for future tests is to perform backups and restores of
the same data to and from both of my TSM servers to compare performance. I
will share the results with you and the rest of the listserv as I progress.
In general I have always, like many other TSM users, achieved much better
restore/backup rates with larger files versus lots of smaller files.
Assuming you've done all the right tuning, the question that comes to my
mind is, does it really come down to the architecture? The TSM database
makes things very easy for day to day smaller recoveries which is the type
we perform most. But does the architecture that makes day to day
operations easier not lend itself well to backup/recovery of large amounts
of data made up of small files? I have very little experience with
competing products. Do they struggle with lots of small files as well?
Veritas, Arserve anyone? If the issue is, as some on the Listserv have
suggested, frequent interaction with the client file system the bottleneck,
then I suppose the answer would be yes the other products have the same
problem. Or is the issue more on the TSM database side due to it's design,
and other products using different architectures may not have this problem?
Maybe the competitions architecture is less bulletproof but if you're one
of our NT Admins you don't seem to care when the client keeps calling
asking how much longer the restore will be running. I know TSM
development is aware of the issues with lots of small files and I would be
curious what they plan to do about the problems Diana and I have
experienced.
The newer client option, Resourceutilization, has helped with backing up
clients with lots of small files more quickly. I would love to see the
same type of automated multi-tasking on restores. I don't know the
specifics of how this actually works but it seems to me that when I ask to
restore an entire NT drive, for example, the TSM client/server must sort
the file list in some fashion to intelligently request tape volumes to
minimize the mounts required. If that's the case could they take things
one step further and add an option to the restore specifying the number of
concurrent sessions/mountpoints to be used to perform the restore? For
example, if I have a node who's collocated data is spread across twenty
tapes and I have 6 tape drives available for the recovery, how about an
option for the restore command like:
RES -subd=y -nummp=6 d:\*
where the -nummp option would be the number of mount points/tape drives to
be used for the restore. TSM could sort the file list coming up with the
list of tapes to be used for the restore and perhaps spread the mounts
across 6 sessions/mount points. I'm sure I've probably made a complex task
sound simple but this type of option would be very useful. I think many of
us have seen the benefits of running multiple sessions to reduce recovery
elapsed time. I find my current choices for doing so difficult to
implement or politically undesirable.
If others have the same issues with lots of small files in particular with
Windows NT clients lets hear from you. Maybe we can come up with some
enhancement requests. I'll pass on the results of my tests as stated
above. I'd be interested in hearing from those of you that have worked
with other products and can tell me if they have the same performance
problems with lots of small files. If the performance of other products is
impacted in the same was as TSM performance then that would be good to
know. If it's more about the Windows NT NTFS file system then I'd be
satisfied with that explanation as well. If it's about lots of interaction
with the TSM database leads to slower performance, even when optimally
configured, then I'd like to know what Tivoli has in the works to address
the issue. Because if it's the TSM database, I could probably install the
fattest Fibre Channel/network pipe with the fastest peripherals and server
hardware around and it might not change a thing.
Thanks
Jeff Connor
Niagara Mohawk Power Corp.
"Diana J.Cline" <[EMAIL PROTECTED]>@VM.MARIST.EDU> on
02/14/2001 10:04:52 AM
Please respond to "ADSM: Dist Stor Manager" <[EMAIL PROTECTED]>
Sent by: "ADSM: Dist Stor Manager" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
cc:
Subject: Performance Large Files vs. Small Files
Using an NT Client and an AIX Server
Does anyone have a TECHNICAL reason why I can backup 30GB of 2GB files that
are
stored in one directory so much faster than 30GB of 2kb files that are
stored
in a bunch of directories?
I know that this is the case, I just would like to find out why. If the
amount
of data is the same and the Network Data Transfer Rate is the same between
the
two backups, why does it take the TSM server so much longer to process the
files being sent by the larger amount of files in multiple directories?
I sure would like to have the answer to this. We are trying to complete an
incremental backup an NT Server with about 3 million small objects
(according
to TSM) in many, many folders and it can't even get done in 12 hours. The
actual amount of data transferred is only about 7GB per night. We have
other
backups that can complete 50GB in 5 hours but they are in one directory and
the
# of files is smaller.
Thanks
Network data transfer rate
--------------------------
The average rate at which the network transfers data between
the TSM client and the TSM server, calculated by dividing the
total number of bytes transferred by the time to transfer the
data over the network. The time it takes for TSM to process
objects is not included in the network transfer rate. Therefore,
the network transfer rate is higher than the aggregate transfer
rate.
.
Aggregate data transfer rate
----------------------------
The average rate at which TSM and the network transfer data
between the TSM client and the TSM server, calculated by
dividing the total number of bytes transferred by the time
that elapses from the beginning to the end of the process.
Both TSM processing and network time are included in the
aggregate transfer rate. Therefore, the aggregate transfer
rate is lower than the network transfer rate.