On 3/24/2011 5:00 PM, richardtoo...@paradise.net.nz wrote: > Quoting "Steven R. Gerber" <open...@gerber-systems.com>: > >> On 3/24/2011 4:33 PM, richardtoo...@paradise.net.nz wrote: >>> Quoting "Steven R. Gerber" <open...@gerber-systems.com>: >>> >>>> On 3/24/2011 2:36 PM, richardtoo...@paradise.net.nz wrote: >>>>> Quoting "Steven R. Gerber" <open...@gerber-systems.com>: >>>>> >>>>>> -------- Original Message -------- >>>>>> Subject: Re: rdist times out but will not die >>>>>> Date: Thu, 24 Mar 2011 21:49:01 +1300 >>>>>> From: Richard Toohey <richardtoo...@paradise.net.nz> >>>>>> To: Steven R. Gerber <sger...@gerber-systems.com> >>>>>> CC: t...@openbsd.org >>>>>> >>>>>> On 24/03/2011, at 4:06 PM, Steven R. Gerber wrote: >>>>>> >>>>>>> On 3/20/2011 2:07 PM, Steven R. Gerber wrote: >>>>>>>> I want to do local/remote mirror/backup (or should that be >>>>>> local-mirror >>>>>>>> / offsite-backup). >>>>>>>> So a two-part question: >>>>>>>> 1. Even if there is a timeout, shouldn't the job/process exit? >>>>>>>> >>>>>> ************************************************************* >>>>>> **************** >>>>>> * >>>>>>>> rdist@thedump: thedump: /mnt/mirror2/public/read_only/movies: >>>> chown >>>>>> from >>>>>>>> rdist:operator to cdripper:operator >>>>>>>> rdist@thedump: thedump: >>>>>>>> >> /mnt/mirror2/public/read_only/movies/The_Thomas_Crown_Affair_1999: >>>>>> chown >>>>>>>> from rdist:operator to root:operator >>>>>>>> rdist@thedump: >>>>>>>> >>>>>> /mnt/stripe2/public/read_only/movies/The_Thomas_Crow >>>>>> n_Affair_1999/THOMAS_CROW >>>>>> N_AFFAIR_16X9.md5: >>>>>>>> updating >>>>>>>> rdist@thedump: >>>>>>>> >>>>>> /mnt/stripe2/public/read_only/movies/The_Thomas_Crow >>>>>> n_Affair_1999/THOMAS_CROW >>>>>> N_AFFAIR_16X9.iso: >>>>>>>> installing >>>>>>>> rdist@thedump: LOCAL ERROR: Response time out >>>>>>>> rdist@thedump: updating of rdist@thedump finished >>>>>>>> $ ps -ax|grep rdist >>>>>>>> 26025 ?? I 0:00.00 tee /var/log/rdist/2011-03-20 >>>>>>>> 11059 ?? I 0:00.01 rdist -f /etc/Distfile >>>>>>>> 28446 ?? I 0:22.99 rdist: update rdist@thedump (rdist) >>>>>>>> 7795 ?? I 1:10.32 ssh -l rdist thedump r >>>>>>>> 13045 p0 S+ 0:00.00 grep rdist >>>>>>>> >>>>>> ************************************************************* >>>>>> **************** >>>>>> * >>>>>>>> 2. I know that they happen from time to time. How can I >>>>>> avoid/prevent >>>>>>>> timeouts? The default is 900 sec AKA 15 min? How can this happen >>>>>>>> between two local machines? >>>>>> >>>>>> How big is the file? >>>>> >>>>> So, how big is the file that it times out on? >>>>> >>>>> More than 2Gb? Guess so if a movie file? >>>>> >>>>> I might be barking up the wrong tree, but it will take you two >> seconds >>>> to see if >>>>> there's anything in this > 2Gb idea and if I'm wrong, move on. >>>>> >>>>> Regardless of that, yes, put more debugging on - might give you >> some >>>> more clues. >>>>> >>>>> OpenBSD helps those who help themselves. >>>> Richard, >>>> Thanks for the help. >>>> I had already read the IBM note 'LOCAL ERROR: response time out' >> (from >>>> 2006). (Google is not my enemy?) >>>> I had already checked: the file is >2GB (4.4GB). >>>> I ASSUMED that I can't the only who has tried to push large files >> with >>>> rdist. I searched the OpenBSD list archives (mine go back to 2006) >> and >>>> found nothing significant/useful. Maybe I missed something? >>>> I immediately moved to the misc list per your suggestion. >>>> I did a (manual) run of rdist with "-D" and got similar results -- I >> am >>>> still analyzing those messages. >>>> I usually do not compile OpenBSD, so it will take a while to review >> the >>>> rdist source code (client.c?). >>> >>> Thanks ... never assume anything, eh? 8-) >>> >>> If your files are > 2Gb, then that IBM link seems to be spot on, and >> answers >>> (maybe) number 2 on your list - why would you get a timeout on a local >> transfer >>> (if hardware related, you'd expect sftp to fail, or there to be other >> noticeable >>> issues)? >>> >>> I've not used rdist before, but I don't mind having a look now that I >> know your >>> files are > 2Gb. But going to be a quiet (ha!) evening project, so no >> promises >>> (and maybe someone else will blow the theory out of the water and >> provide a >>> different answer/fix.) >>> >>> The IBM note suggests that both client & server need to be amended, IF >> I am on >>> the right track. >>> >>> This is all purely speculative on my part, but it does SEEM to match >> what you >>> are seeing, doesn't it? >>> >>> Thanks. >> [SNIP] >> >> You are right on it! Thanks! >> Not to be greedy, but ... >> What do you think of the other issue that rdist logs a "finished" >> message but does not exit? >> >> Thanks. >> >> > More guessing (I'm already out on a limb ... the branch is about to break) ... > "something" is unhappy because of the time out? > > What messages are in the debug output - do you see "finish() called" as per > the > code in common.c below? What's the rest of the message(s)? > > What happens if you move all the > 2Gb files out the way temporarily and > re-run > (obviously I don't know how practical this is)? Does it finish normally? > > Or if that doesn't suit, how about creating a test directory with 20 (<2 Gb > each) files in, run it, then drop a big file (>2 Gb) in, re-run. If it fails, > then I'd say I was on to something (I don't know anything about rdist, so I do > not know how to set up this test environment.) Remove the big file, or > truncate > it down to < 2Gb and re-run. If that works, I get a cookie. > > common.c > > 154 void > 155 finish(void) > 156 { > 157 extern jmp_buf finish_jmpbuf; > 158 > 159 debugmsg(DM_CALL, > 160 "finish() called: do_fork = %d amchild = %d isserver > = %d", > 161 do_fork, amchild, isserver); > 162 cleanup(0); > 163 > 164 /* > 165 * There's no valid finish_jmpbuf for the rdist master parent. > 166 */ > 167 if (!do_fork || amchild || isserver) { > 168 > 169 if (!setjmp_ok) { > 170 #ifdef DEBUG_SETJMP > 171 error("attemping longjmp() without target"); > 172 abort(); > 173 #else > 174 exit(1); > 175 #endif > 176 } > 177 > 178 longjmp(finish_jmpbuf, 1); > 179 /*NOTREACHED*/ > 180 error("Unexpected failure of longjmp() in finish()"); > 181 exit(2); > 182 } else > 183 exit(1); > 184 } > > Thanks. > > >
I am getting the "finished() called" etc. I now have a theory (your "something" unhappy guess): rdist times out, but the child process does not and is still trying to get the end-of-file. The child is basically in an infinite loop: it does not time out because the dump does respond but it keeps retrieving from the first part of file -- it never reaches past the miscalculated size.