Re: Does Whimsy need to have a copy of Bills?

Greg Stein Mon, 25 Nov 2019 16:37:18 -0800

One quick answer from earlier in the thread:

> > > > What's most concerning is not just the elapsed time, but that this
> > > > likely means that the call is expensive on the server, which may
> > > > impact other users.
> > > >
> > > >
> > > I see this as premature optimisation; we don't know whether svn list
is
> > > more expensive than svn update overall.
> > > There may be other reasons why list is slower. Nor do we know if the
> > > request will impact other users.

"svn list" must generate the listing of 12k+ files, recursively. That takes
some time to process and deliver over the network. I believe it is likely a
PROPFIND which introduces some overheads on both ends (XML construction on
server, parsing on client; sheer network size, too).

"svn up" generates a diff report. "get these 3 files", and that's easily
extracted from the difference between revision-working-copy and
revision-server (plus some other concerns).

So yes: an update is *way* faster, all around.

On Mon, Nov 25, 2019 at 5:15 PM Sam Ruby <ru...@intertwingly.net> wrote:

> Actually copy Greg this time.
>
> ---------- Forwarded message ---------
> From: Sam Ruby <ru...@intertwingly.net>
> Date: Mon, Nov 25, 2019, 4:15 PM
> Subject: Re: Does Whimsy need to have a copy of Bills?
> To: Whimsy dev <dev@whimsical.apache.org>
>
>
> adding Greg to email.
>
> Recap: a change is being proposed whereas whimsy will do the
> equivalent of the following command after every icla is processed:
>
> svn ls https://svn.apache.org/repos/private/documents/iclas --depth
> infinity
>

Do you really need to use depth=infinity? The directory name is likely
sufficient information. ?

depth=immediates (the default for svn ls) is going to be just a few seconds.

Currently this appears to take around forty to sixty elapsed seconds to
> process.
>
> Questions for Greg:
> 1) Does this proposed workload present an unreasonable load on the svn
> server?
>

This should be fine as long as you don't put a "-v" switch in there.
That'll take 10-15 minutes as it reconstructs all the files on the server
and measures their size.

The listing is just a single thread on the server, a fetch of the directory
names, and then assembly/delivery of that result. There really shouldn't be
any contention with other users, or heavy use of the CPU.

> 2) Are there any faster alternatives which get us a list of names but no
> data
>

So I experimented with a hack. I did a "full checkout" of the iclas
directory, but stopped it after a single file was checked out. This left a
partial checkout. Subversion will tell the server "I have $these. what am I
missing?" when you run "svn status -u". You'll get a listing of the 12k
missing files. Takes about 7 seconds or so.

Specifically:
$ # kill after reading/printing one line (the first file checked out)
$ svn co https://svn.apache.org/repos/private/documents/iclas | python -c
'import signal,os,sys ; print sys.stdin.readline() ;
os.killpg(os.getpgrp(), signal.SIGHUP)'
A    iclas/jiwei-guo

svn: E200015: Caught signal
svn: E200042: Additional errors:
svn: E200015: Caught signal
Hangup
$ # get a recursive listing via status
$ time svn st -u iclas | wc -l
12818

real    0m6.782s
user    0m3.461s
sys     0m2.358s

The status output should be easy to parse (it is designed as a fixed-width
set of codes, then filename).

Even if you do a full/normal checkout, note that "svn status -u" may be
useful. Depending on whether you need the content, or just the names, you
may want to migrate to the status-based approach.

Oh! Just realized a better way, to avoid the hack/partial checkout. Even
better, just check out the "iclas" directory for the revision it was
created. It is an empty directory in that revision (a sibling directory
received a bunch of Member applications, but those won't be in this
checkout).

$ svn -r 9696 co https://svn.apache.org/repos/private/documents/iclas
Checked out revision 9696.

The "svn status" works the same against the above (empty) working copy.
Also at about 8 seconds.

So. In summary, use "svn status" against a HEAD checkout, or against r9696
for those who don't want the gigabytes of ICLA forms.

A similar technique can be used for any of the other Whimsy data
directories, of course. To find when a particular directory was created:

$ # run an "svn log" in reverse order, and limit/stop at the first log
entry.
$ svn log --stop-on-copy --limit 1 -r0:HEAD
https://svn.apache.org/repos/private/documents/iclas
------------------------------------------------------------------------
r9696 | jim | 2006-11-17 10:35:28 -0600 (Fri, 17 Nov 2006) | 3 lines

Start loading of scanned docs. Start with creating
the dirs and upload the member apps

------------------------------------------------------------------------

Note that sometimes a directory is created with content, in that revision.
The "iclas" directory just happened to be created empty. But I imagine most
directories will be much smaller at their creation, than they are today.
(iow, don't expect them to always be empty at creation)

Hope that helps,
-g

> - Sam Ruby
>
> On Mon, Nov 25, 2019 at 3:44 PM sebb <seb...@gmail.com> wrote:
> >
> > On Mon, 25 Nov 2019 at 19:48, Sam Ruby <ru...@intertwingly.net> wrote:
> >
> > > On Mon, Nov 25, 2019 at 2:29 PM sebb <seb...@gmail.com> wrote:
> > > >
> > > > On Mon, 25 Nov 2019 at 17:59, Sam Ruby <ru...@intertwingly.net>
> wrote:
> > > >
> > > > > On Sun, Nov 24, 2019 at 1:17 PM sebb <seb...@gmail.com> wrote:
> > > > > >
> > > > > > I was thinking of using svn ls to create a listing file which
> would
> > > be
> > > > > > cached locally.
> > > > >
> > > > > Unfortunately, some observations (numbers below are approximate):
> > > > >
> > > > > svn up on a populated iclas directory: one second
> > > > >
> > > > > svn ls on iclas: two seconds, but only returns depth one
> > > > >
> > > > > svn ls on iclas --depth infinity: one minute
> > > > >
> > > > > What's most concerning is not just the elapsed time, but that this
> > > > > likely means that the call is expensive on the server, which may
> > > > > impact other users.
> > > > >
> > > > >
> > > > I see this as premature optimisation; we don't know whether svn list
> is
> > > > more expensive than svn update overall.
> > > > There may be other reasons why list is slower. Nor do we know if the
> > > > request will impact other users.
> > > >
> > > > Besides, if the code checks SVN info first, it will only need to
> fetch
> > > the
> > > > updated listing when there has been a change.
> > > > Those directories are not busy.
> > > >
> > > > Furthermore, every time a new test installation is set up, there is
> > > > definitely a large load on the server and network.
> > > > This network load in particular must be orders of magnitude greater
> than
> > > > for a listing.
> > >
> > > Sorry for not being clear.  I would be very concerned if whimsy-vm4
> > > were invoking svn ls --depth infinity every 10 minutes as the current
> > > cron job does.
> >
> >
> > That would not be the case.
> >
> > The job would use 'svn info' on the remote repo and only fetch the
> listing
> > if necessary.
> >
> > For the repos in question, changes are rare.
> >
> >
> > >   Before any such change is deployed, it would be wise
> > > for us to check both with the infrastructure team and the subversion
> > > team (Greg likely can help with both).
> > >
> > > I'm less concerned about the overhead on development machines, and
> > > there I suspect that most users would be happy with a svn checkout
> > > --depth empty.
> > >
> > >
> > This would not allow testing of the functions that need to know the list
> of
> > file names.
> >
> >
> > > - Sam Ruby
> > >
>

Re: Does Whimsy need to have a copy of Bills?

Reply via email to