On 3/24/2011 10:06 PM, richardtoo...@paradise.net.nz wrote: > Quoting "Steven R. Gerber" <open...@gerber-systems.com>: > >> On 3/24/2011 5:00 PM, richardtoo...@paradise.net.nz wrote: >>> Quoting "Steven R. Gerber" <open...@gerber-systems.com>: >>> >>>> On 3/24/2011 4:33 PM, richardtoo...@paradise.net.nz wrote: >>>>> Quoting "Steven R. Gerber" <open...@gerber-systems.com>: >>>>> >>>>>> On 3/24/2011 2:36 PM, richardtoo...@paradise.net.nz wrote: >>>>>>> Quoting "Steven R. Gerber" <open...@gerber-systems.com>: >>>>>>> >>>>>>>> -------- Original Message -------- >>>>>>>> Subject: Re: rdist times out but will not die >>>>>>>> Date: Thu, 24 Mar 2011 21:49:01 +1300 >>>>>>>> From: Richard Toohey <richardtoo...@paradise.net.nz> >>>>>>>> To: Steven R. Gerber <sger...@gerber-systems.com> >>>>>>>> CC: t...@openbsd.org >>>>>>>> >>>>>>>> On 24/03/2011, at 4:06 PM, Steven R. Gerber wrote: >>>>>>>> >>>>>>>>> On 3/20/2011 2:07 PM, Steven R. Gerber wrote: >>>>>>>>>> I want to do local/remote mirror/backup (or should that be >>>>>>>> local-mirror >>>>>>>>>> / offsite-backup). >>>>>>>>>> So a two-part question: >>>>>>>>>> 1. Even if there is a timeout, shouldn't the job/process exit? >>>>>>>>>> >>>>>>>> ************************************************************* >>>>>>>> **************** >>>>>>>> * >>>>>>>>>> rdist@thedump: thedump: /mnt/mirror2/public/read_only/movies: >>>>>> chown >>>>>>>> from >>>>>>>>>> rdist:operator to cdripper:operator >>>>>>>>>> rdist@thedump: thedump: >>>>>>>>>> >>>> /mnt/mirror2/public/read_only/movies/The_Thomas_Crown_Affair_1999: >>>>>>>> chown >>>>>>>>>> from rdist:operator to root:operator >>>>>>>>>> rdist@thedump: >>>>>>>>>> >>>>>>>> /mnt/stripe2/public/read_only/movies/The_Thomas_Crow >>>>>>>> n_Affair_1999/THOMAS_CROW >>>>>>>> N_AFFAIR_16X9.md5: >>>>>>>>>> updating >>>>>>>>>> rdist@thedump: >>>>>>>>>> >>>>>>>> /mnt/stripe2/public/read_only/movies/The_Thomas_Crow >>>>>>>> n_Affair_1999/THOMAS_CROW >>>>>>>> N_AFFAIR_16X9.iso: >>>>>>>>>> installing >>>>>>>>>> rdist@thedump: LOCAL ERROR: Response time out >>>>>>>>>> rdist@thedump: updating of rdist@thedump finished >>>>>>>>>> $ ps -ax|grep rdist >>>>>>>>>> 26025 ?? I 0:00.00 tee /var/log/rdist/2011-03-20 >>>>>>>>>> 11059 ?? I 0:00.01 rdist -f /etc/Distfile >>>>>>>>>> 28446 ?? I 0:22.99 rdist: update rdist@thedump (rdist) >>>>>>>>>> 7795 ?? I 1:10.32 ssh -l rdist thedump r >>>>>>>>>> 13045 p0 S+ 0:00.00 grep rdist >>>>>>>>>> >>>>>>>> ************************************************************* >>>>>>>> **************** >>>>>>>> * >>>>>>>>>> 2. I know that they happen from time to time. How can I >>>>>>>> avoid/prevent >>>>>>>>>> timeouts? The default is 900 sec AKA 15 min? How can this >> happen >>>>>>>>>> between two local machines? >>>>>>>> >>>>>>>> How big is the file? >>>>>>> >>>>>>> So, how big is the file that it times out on? >>>>>>> >>>>>>> More than 2Gb? Guess so if a movie file? >>>>>>> >>>>>>> I might be barking up the wrong tree, but it will take you two >>>> seconds >>>>>> to see if >>>>>>> there's anything in this > 2Gb idea and if I'm wrong, move on. >>>>>>> >>>>>>> Regardless of that, yes, put more debugging on - might give you >>>> some >>>>>> more clues. >>>>>>> >>>>>>> OpenBSD helps those who help themselves. >>>>>> Richard, >>>>>> Thanks for the help. >>>>>> I had already read the IBM note 'LOCAL ERROR: response time out' >>>> (from >>>>>> 2006). (Google is not my enemy?) >>>>>> I had already checked: the file is >2GB (4.4GB). >>>>>> I ASSUMED that I can't the only who has tried to push large files >>>> with >>>>>> rdist. I searched the OpenBSD list archives (mine go back to 2006) >>>> and >>>>>> found nothing significant/useful. Maybe I missed something? >>>>>> I immediately moved to the misc list per your suggestion. >>>>>> I did a (manual) run of rdist with "-D" and got similar results -- >> I >>>> am >>>>>> still analyzing those messages. >>>>>> I usually do not compile OpenBSD, so it will take a while to >> review >>>> the >>>>>> rdist source code (client.c?). >>>>> >>>>> Thanks ... never assume anything, eh? 8-) >>>>> >>>>> If your files are > 2Gb, then that IBM link seems to be spot on, >> and >>>> answers >>>>> (maybe) number 2 on your list - why would you get a timeout on a >> local >>>> transfer >>>>> (if hardware related, you'd expect sftp to fail, or there to be >> other >>>> noticeable >>>>> issues)? >>>>> >>>>> I've not used rdist before, but I don't mind having a look now that >> I >>>> know your >>>>> files are > 2Gb. But going to be a quiet (ha!) evening project, so >> no >>>> promises >>>>> (and maybe someone else will blow the theory out of the water and >>>> provide a >>>>> different answer/fix.) >>>>> >>>>> The IBM note suggests that both client & server need to be amended, >> IF >>>> I am on >>>>> the right track. >>>>> >>>>> This is all purely speculative on my part, but it does SEEM to >> match >>>> what you >>>>> are seeing, doesn't it? >>>>> >>>>> Thanks. >>>> [SNIP] >>>> >>>> You are right on it! Thanks! >>>> Not to be greedy, but ... >>>> What do you think of the other issue that rdist logs a "finished" >>>> message but does not exit? >>>> >>>> Thanks. >>>> >>>> >>> More guessing (I'm already out on a limb ... the branch is about to >> break) ... >>> "something" is unhappy because of the time out? >>> >>> What messages are in the debug output - do you see "finish() called" >> as per the >>> code in common.c below? What's the rest of the message(s)? >>> >>> What happens if you move all the > 2Gb files out the way temporarily >> and re-run >>> (obviously I don't know how practical this is)? Does it finish >> normally? >>> >>> Or if that doesn't suit, how about creating a test directory with 20 >> (<2 Gb >>> each) files in, run it, then drop a big file (>2 Gb) in, re-run. If it >> fails, >>> then I'd say I was on to something (I don't know anything about rdist, >> so I do >>> not know how to set up this test environment.) Remove the big file, or >> truncate >>> it down to < 2Gb and re-run. If that works, I get a cookie. >>> >>> common.c >>> >>> 154 void >>> 155 finish(void) >>> 156 { >>> 157 extern jmp_buf finish_jmpbuf; >>> 158 >>> 159 debugmsg(DM_CALL, >>> 160 "finish() called: do_fork = %d amchild = %d isserver = %d", >>> 161 do_fork, amchild, isserver); >>> 162 cleanup(0); >>> 163 >>> 164 /* >>> 165 * There's no valid finish_jmpbuf for the rdist master parent. >>> 166 */ >>> 167 if (!do_fork || amchild || isserver) { >>> 168 >>> 169 if (!setjmp_ok) { >>> 170 #ifdef DEBUG_SETJMP >>> 171 error("attemping longjmp() without target"); >>> 172 abort(); >>> 173 #else >>> 174 exit(1); >>> 175 #endif >>> 176 } >>> 177 >>> 178 longjmp(finish_jmpbuf, 1); >>> 179 /*NOTREACHED*/ >>> 180 error("Unexpected failure of longjmp() in finish()"); >>> 181 exit(2); >>> 182 } else >>> 183 exit(1); >>> 184 } >>> >>> Thanks. >>> >>> >>> >> >> I am getting the "finished() called" etc. >> I now have a theory (your "something" unhappy guess): rdist times out, >> but the child process does not and is still trying to get the >> end-of-file. The child is basically in an infinite loop: it does not >> time out because the dump does respond but it keeps retrieving from the >> first part of file -- it never reaches past the miscalculated size. >> >> > > My diffs will no doubt get mangled by my webmail and I don't know enough about > rdist (or the rdist protocol) to know if these are correct. > > Hopefully they are a step in the right direction. > > Basic idea from https://www-304.ibm.com/support/docview.wss?uid=isg1IY85396 > > (I was going to look at FreeBSD's version for inspiration but looks like they > ditched rdist in 2003.) > > Basically strtol to strtoll, %ld to %lld, and (int)/(long) to (off_t) to cope > with files bigger than > 2Gb. > > Works for me on i386 - without these patches I see the reported behaviour, > with > the patches I see the 4Gb file transferred. > > With patches - it works: > > $ cat rdist.conf > HOSTS = (172.16.1.111) > FILES = (/home/richard.toohey/rdist-test) > ${FILES} -> ${HOSTS} > > $ rdist -f rdist.conf > 172.16.1.111: updating host 172.16.1.111 > richard.toohey@172.16.1.111's password: > 172.16.1.111: /home/richard.toohey/rdist-test/zerofile.tst: installing > 172.16.1.111: updating of 172.16.1.111 finished > > zerofile.tst created with: > > dd if=/dev/zero of=zerofile.tst bs=1k count=4700000 > > HTH. > > /usr/src/usr.bin/rdist/client.c > =============================== > > # diff -uw /home/richard.toohey/obsd-src/usr.bin/rdist/client.c client.c > --- /home/richard.toohey/obsd-src/usr.bin/rdist/client.c Thu Oct 29 > 17:34:06 2009 > +++ client.c Fri Mar 25 14:54:32 2011 > @@ -399,8 +399,8 @@ > */ > ENCODE(ername, rname); > > - (void) sendcmd(C_RECVREG, "%o %04o %ld %ld %ld %s %s %s", > - opts, stb->st_mode & 07777, (long) stb->st_size, > + (void) sendcmd(C_RECVREG, "%o %04o %lld %ld %ld %s %s %s", > + opts, stb->st_mode & 07777, (off_t) stb->st_size, > stb->st_mtime, stb->st_atime, > user, group, ername); > if (response() < 0) { > @@ -409,8 +409,8 @@ > } > > > - debugmsg(DM_MISC, "Send file '%s' %ld bytes\n", rname, > - (long) stb->st_size); > + debugmsg(DM_MISC, "Send file '%s' %lld bytes\n", rname, > + (off_t) stb->st_size); > > /* > * Set remote time out alarm handler. > @@ -666,8 +666,8 @@ > * Gather and send basic link info > */ > ENCODE(ername, rname); > - (void) sendcmd(C_RECVSYMLINK, "%o %04o %ld %ld %ld %s %s %s", > - opts, stb->st_mode & 07777, (long) stb->st_size, > + (void) sendcmd(C_RECVSYMLINK, "%o %04o %lld %ld %ld %s %s %s", > + opts, stb->st_mode & 07777, (off_t) stb->st_size, > stb->st_mtime, stb->st_atime, > user, group, ername); > if (response() < 0) > @@ -682,7 +682,7 @@ > error("%s: readlink failed", target); > err(); > } > - (void) snprintf(tbuf, sizeof(tbuf), "%.*s", (int) stb->st_size, lbuf); > + (void) snprintf(tbuf, sizeof(tbuf), "%.*s", (off_t) stb->st_size, > lbuf); > ENCODE(ername, tbuf); > (void) sendcmd(C_NONE, "%s\n", ername); > > @@ -869,7 +869,7 @@ > /* > * Parse size > */ > - size = (off_t) strtol(cp, (char **)&cp, 10); > + size = (off_t) strtoll(cp, (char **)&cp, 10); > if (*cp++ != ' ') { > error("update: size not delimited"); > return(US_NOTHING); > @@ -921,8 +921,8 @@ > > debugmsg(DM_MISC, "update(%s,) local mode %04o remote mode %04o\n", > rname, lmode, rmode); > - debugmsg(DM_MISC, "update(%s,) size %ld mtime %d owner '%s' grp > '%s'\n", > - rname, (long) size, mtime, owner, group); > + debugmsg(DM_MISC, "update(%s,) size %lld mtime %d owner '%s' grp > '%s'\n", > + rname, (off_t) size, mtime, owner, group); > > if (statp->st_mtime != mtime) { > if (statp->st_mtime < mtime && IS_ON(opts, DO_YOUNGER)) { > @@ -935,8 +935,8 @@ > } > > if (statp->st_size != size) { > - debugmsg(DM_MISC, "size does not match (%ld != %ld).\n", > - (long) statp->st_size, (long) size); > + debugmsg(DM_MISC, "size does not match (%lld != %lld).\n", > + (off_t) statp->st_size, (off_t) size); > return(US_OUTDATE); > } > > /usr/src/usr.bin/rdistd/server.c > ================================ > # diff -uw /home/richard.toohey/obsd-src/usr.bin/rdistd/server.c server.c > --- /home/richard.toohey/obsd-src/usr.bin/rdistd/server.c Thu Oct 29 > 17:34:06 2009 > +++ server.c Fri Mar 25 14:49:18 2011 > @@ -391,7 +391,7 @@ > #else > /* > * We use MT_NOTICE instead of MT_CHANGE because this function is > - * sometimes called by other functions that are suppose to return a > + * sometimes called by other functions that are supposed to return a > * single ack() back to the client (rdist). This is a kludge until > * the Rdist protocol is re-done. Sigh. > */ > @@ -656,8 +656,8 @@ > case S_IFIFO: > #endif > #endif > - (void) sendcmd(QC_YES, "%ld %ld %o %s %s", > - (long) stb.st_size, stb.st_mtime, > + (void) sendcmd(QC_YES, "%lld %ld %o %s %s", > + (off_t) stb.st_size, stb.st_mtime, > stb.st_mode & 07777, > getusername(stb.st_uid, target, options), > getgroupname(stb.st_gid, target, options)); > @@ -1420,7 +1420,7 @@ > /* > * Get file size > */ > - size = strtol(cp, &cp, 10); > + size = strtoll(cp, &cp, 10); > if (*cp++ != ' ') { > error("recvit: size not delimited"); > return; > @@ -1523,7 +1523,7 @@ > */ > if (min_freespace || min_freefiles) { > /* Convert file size to kilobytes */ > - long fsize = (long) (size / 1024); > + off_t fsize = (off_t) (size / 1024); > > if (getfilesysinfo(target, &freespace, &freefiles) != 0) > return; > > Thanks. > > >
Wow! I had not seen your message and started editing client.c ... Your changes are about the same as mine, but ... Why cast size, statp->st_size, etc. to (off_t) when that is their defined type? Style? Is the comparison at line 689 a problem because 'n' is an int? if (n != stb->st_size) { Thanks.