Re: Looking for suggestions to deal with large backups not completing in 24-hours

Richard Cowen Thu, 19 Jul 2018 13:15:56 -0700

Canary! I like it!
Richard

-----Original Message-----
From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of Skylar 
Thompson
Sent: Thursday, July 19, 2018 10:37 AM
To: ADSM-L@VM.MARIST.EDU
Subject: Re: [ADSM-L] Looking for suggestions to deal with large backups not 
completing in 24-hours

There's a couple ways we've gotten around this problem:

1. For NFS backups, we don't let TSM do partial incremental backups, even if we 
have the filesystem split up. Instead, we mount sub-directories of the 
filesystem root on our proxy nodes. This has the double advantage of letting us 
break up the filesystem into multiple TSM filespaces (giving us directory-level 
backup status reporting, and parallelism in TSM when we have 
COLLOCG=FILESPACE), and also parallelism at the NFS level when there are 
multiple NFS targets we can talk to (as in the case with Isilon).

2. For GPFS backups, in some cases we can setup independent filesets and let 
mmbackup process each as a separate filesystem, though we have some instances 
where the end users want an entire GPFS filesystem to have one inode space so 
they can do atomic moves as renames. In either case, though, mmbackup does its 
own "incremental" backups with filelists passed to "dsmc selective", which 
don't update the last-backup time on the TSM filespace. Our workaround has been 
to run mmbackup via a preschedule command, and have the actual TSM incremental 
backup be of an empty directory (I call them canary directories in our 
documentation) that's set as a virtual mountpoint. dsmc will only run the 
backup portion of its scheduled task if the preschedule command succeeds, so if 
mmbackup fails, the canary never gets backed up, and will raise an alert.

On Wed, Jul 18, 2018 at 03:07:16PM +0200, Lars Henningsen wrote:
> @All
> 
> possibly the biggest issue when backing up massive file systems in parallel 
> with multiple dsmc processes is expiration. Once you back up a directory with 
> ???subdir no???, a no longer existing directory object on that level is 
> expired properly and becomes inactive. However everything underneath that 
> remains active and doesn???t expire (ever) unless you run a ???full??? 
> incremental on the level above (with ???subdir yes???) - and that kind of 
> defeats the purpose of parallelisation. Other pitfalls include avoiding 
> swapping, keeping log files consistent (dsmc doesn???t do thread awareness 
> when logging - it assumes being alone), handling the local dedup cache, 
> updating backup timestamps for a file space on the server, distributing load 
> evenly across multiple nodes on a scale-out filer, backing up from snapshots, 
> chunking file systems up into even parts automatically so you don???t end up 
> with lots of small jobs and one big one, dynamically distributing load across 
> multiple ???proxies??? if one isn???t enough, handling exceptions, handling 
> directories with characters you can???t parse to dsmc via the command line, 
> consolidating results in a single, comprehensible overview similar to the 
> summary of a regular incremental, being able to do it all in reverse for a 
> massively parallel restore??? the list is quite long.
> 
> We developed MAGS (as mentioned by Del) to cope with all that - and more. I 
> can only recommend trying it out for free.
> 
> Regards
> 
> Lars Henningsen
> General Storage

--
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine

Re: Looking for suggestions to deal with large backups not completing in 24-hours

Reply via email to