Re: Looking for suggestions to deal with large backups not completing in 24-hours

Robert Talda Mon, 16 Jul 2018 08:13:51 -0700

Zoltan:
  I wish I could give you more details about the NAS/storage device 
connections, but either a) I’m not privy to that information; or b) I know it 
only as the SAN fabric.  That is, our largest backups are from systems in our 
server farm that are part of the same SAN fabric as both the system running the 
SP client doing the backups AND the system hosting the TSM server.  There is a 
10 GB pipe connecting the two physical systems but that hasn’t ever been the 
bottleneck.  And the system running the SP client is a VM as well.


  Our bigger challenge was filesystems or shares with lots of files.  This is 
where the proxy node strategy came into play.  We were able to work with the 
system admins to split the backup of the those filesystems into many smaller 
(in terms of number of files) backups that started deeper in the filesystem.  
That is, instead of running a backup against
\\rams\som\TSM\FC\*<smb://rams/som/TSM/FC/*>
We would have one backup running through PROXY.NODE1 for
\\rams\som\TSM\FC\dir1\*<smb://rams/som/TSM/FC/dir1/*>
While another was running through PROXY.NODE2 for
\\rams\som\TSM\FC\dir2\*<smb://rams/som/TSM/FC/dir2/*>
And so on and so forth.

We did this using a set of client schedules that used the “objects” option to 
specify the directory in question:

Def sched DOMAIN PROXY.NODE1.HOUR01 action=incr options=“-subir=yes 
-asnodename=DATANODE” -objects=‘“\\rams\som\TSM\RC\dir1\” startt=01:00 dur=1 
duru=hour

Where DATANODE is the target for agent PROXY.NODE1.

Currently, we are running up to 144 backups (6 Proxy nodes, 24 hourly backups) 
for our largest devices.

HTH,
Bob

On Jul 16, 2018, at 8:29 AM, Zoltan Forray 
<zfor...@vcu.edu<mailto:zfor...@vcu.edu>> wrote:

Robert,

Thanks for the extensive details.  You backup 5-nodes with as more data
then we do for 90-nodes.  So, my question is - what kind of connections do
you have to your NAS/storage device to process that much data in such a
short period of time?

I am not sure what benefit a proxy-node would do for us, other than to
manage multiple nodes from one connection/GUI - or am I totally off base on
this?

Our current configuration is such:

7-Windows 2016 VM's (adding more to spread out the load)
Each of these 7-VM's handle the backups for 5-30 nodes.  Each node is a
mountpoint for an user/department ISILON DFS mount -
i.e. \\rams\som\TSM\FC\*, \\rams\som\TSM\UR\*<smb://rams/som/TSM/UR/*> etc.  
FWIW, the reason we are
using VM's is the connection is actually faster then when we were using
physical servers since they only had gigabit nics.

Even when we moved the biggest ISILON node (20,000,000+ files) to a new VM
with only 4-other nodes, it still took 4-days to scan and backup 102GB of
32TB.  Below are a recent end-of-session statistics (current backup started
Friday and is still running)

07/09/2018 02:00:06 ANE4952I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects inspected:   20,276,912  (SESSION: 21423)
07/09/2018 02:00:06 ANE4954I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects backed up:       26,787  (SESSION: 21423)
07/09/2018 02:00:06 ANE4958I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects updated:             31  (SESSION: 21423)
07/09/2018 02:00:06 ANE4960I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects rebound:              0  (SESSION: 21423)
07/09/2018 02:00:06 ANE4957I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects deleted:              0  (SESSION: 21423)
07/09/2018 02:00:06 ANE4970I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects expired:         20,630  (SESSION: 21423)
07/09/2018 02:00:06 ANE4959I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects failed:              36  (SESSION: 21423)
07/09/2018 02:00:06 ANE4197I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects encrypted:            0  (SESSION: 21423)
07/09/2018 02:00:06 ANE4965I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of subfile objects:              0  (SESSION: 21423)
07/09/2018 02:00:06 ANE4914I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of objects grew:                 0  (SESSION: 21423)
07/09/2018 02:00:06 ANE4916I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of retries:                    124  (SESSION: 21423)
07/09/2018 02:00:06 ANE4977I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of bytes inspected:          31.75 TB  (SESSION: 21423)
07/09/2018 02:00:06 ANE4961I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total number of bytes transferred:       101.90 GB  (SESSION: 21423)
07/09/2018 02:00:06 ANE4963I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Data transfer time:                      115.78 sec  (SESSION: 21423)
07/09/2018 02:00:06 ANE4966I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Network data transfer rate:          922,800.00 KB/sec  (SESSION: 21423)
07/09/2018 02:00:06 ANE4967I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Aggregate data transfer rate:            271.46 KB/sec  (SESSION: 21423)
07/09/2018 02:00:06 ANE4968I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Objects compressed by:                       30%   (SESSION: 21423)
07/09/2018 02:00:06 ANE4976I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Total data reduction ratio:               99.69%   (SESSION: 21423)
07/09/2018 02:00:06 ANE4969I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Subfile objects reduced by:                   0%   (SESSION: 21423)
07/09/2018 02:00:06 ANE4964I (Session: 21423, Node: ISILON-SOM-SOMADFS2)
Elapsed processing time:              109:19:48  (SESSION: 21423)


Even when we m

On Sun, Jul 15, 2018 at 7:30 PM Robert Talda 
<r...@cornell.edu<mailto:r...@cornell.edu>> wrote:

Zoltan:
Finally get a chance to answer you.  I :think: I understand what you are
getting at…

First, some numbers - recalling that each of these nodes is one storage
device:
Node1: 358,000,000+ files totalling 430 TB of primary occupied space
Node2: 302,000,000+ files totaling 82 TB of primary occupied space
Node3: 79,000,000+ files totaling 75 TB of primary occupied space
Node4: 1,000,000+ files totalling 75 TB of primary occupied space
Node5: 17,000,000+ files totalling 42 TB of  primary occupied space
 There are more, but I think this answers your initial question.

Restore requests are handled by the local system admin or, for lack of a
better description, data admin.  (Basically, the research area has a person
dedicated to all the various data issues related to research grants, from
including proper verbiage in grant requests to making sure the necessary
protections are in place).

 We try to make it as simple as we can, because we do concentrate all the
data in one node per storage device (usually a NAS).  So restores are
usually done directly from the node - while all backups are done through
proxies.  Generally, the restores are done without permissions so that the
appropriate permissions can be applied to the restored data.  (Oft times,
the data is restored so a different user or set of users can work with it,
so the original permissions aren’t useful)

 There are some exceptions - of course, as we work at universities, there
are always exceptions - and these we handle as best we can by providing
proxy nodes with restricted priviledges.

 Let me know if I can provide more,
Bob


Robert Talda
EZ-Backup Systems Engineer
Cornell University
+1 607-255-8280
r...@cornell.edu<mailto:r...@cornell.edu>


On Jul 11, 2018, at 3:59 PM, Zoltan Forray <zfor...@vcu.edu> wrote:

Robert,

Thanks for the insight/suggestions.  Your scenario is similar to ours but
on a larger scale when it comes to the amount of data/files to process,
thus the issue (assuming such since you didn't list numbers).  Currently
we
have 91 ISILON nodes totaling 140M objects and 230TB of data. The largest
(our troublemaker) has over 21M objects and 26TB of data (this is the one
that takes 4-5 days).  dsminstr.log from a recently finished run shows it
only backed up 15K objects.

We agree that this and other similarly larger nodes need to be broken up
into smaller/less objects to backup per node.  But the owner of this
large
one is balking since previously this was backed up via a solitary Windows
server using Journaling so everything finished in a day.

We have never dealt with proxy nodes but might need to head in that
direction since our current method of allowing users to perform their own
restores relies on the now deprecated Web Client.  Our current method is
numerous Windows VM servers with 20-30 nodes defined to each.

How do you handle restore requests?

On Wed, Jul 11, 2018 at 2:56 PM Robert Talda <r...@cornell.edu> wrote:




--
*Zoltan Forray*
Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
Xymon Monitor Administrator
VMware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
www.ucc.vcu.edu<http://www.ucc.vcu.edu>
zfor...@vcu.edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://phishing.vcu.edu/

Re: Looking for suggestions to deal with large backups not completing in 24-hours

Reply via email to