Re: [Savannah-hackers-public] Emacs git repository clone limits

Bob Proulx Thu, 28 May 2020 16:45:13 -0700

Philippe Vaucher wrote:
> > > Is there a limit and/or maintainance going? Was I but in some sort of
> > > throttle-list?
> >
> > We do have rate limits in place.  Because otherwise the general
> > background radiation activity of the Internet will break things.
> >
> > However nothing has changed recently in regard to the rate limits for
> > a long time.  As I look at the logs the last rate limit change was Sat
> > Dec 7 06:31:50 2019 -0500 which seems long enough ago that it isn't
> > anything recent.  Meaning that this is probably simply just you
> > competing with other users on the Internet for resources.
> 
> Until recently I only did up to ~8 concurrent git clones, but recently
> with infrastructure changes I'm able to do much more.


For a high level of parallelism I would definitely try to fanout using
local resources.  It would be much more reliable.

In electrical design the fanout and load of a circuit is a design
consideration.  One doesn't want to place too much of a load on a
remote source.  And a source knows how much fanout it can drive.  This
is a similar problem to software mirroring.

I would limit the amount of load placed and then keep all of the
fanout to using local resources.  Then you get to see the effects of
both ends of things locally and have better visibility.  Also since
that is on the LAN side the network performance is unlikely to
bottleneck.

I will resist telling some anecdotal stories of various experiences
here and some managerial problems in this area. :-)

> > Cloning with the https transport that uses git-http-backend for the
> > backend.  We are using Nginx rate limiting.  Which you can read about
> > how the algorithm works here.  It is basically a smoothing process.
> 
> Until recently I was cloning git://git.sv.gnu.org/emacs.git, the
> switch to https is an attempt at working around the limitation I hit
> recently. My train of thought was that http:// is easier to scale than
> git://, if you say otherwise I can revert back to git:// clones.

Well...  It's not simple!  And also things change with time and server
resources and configuration.  So any answer I were to state today
would mutate into a wrong answer at some different point in time.

I understand your reasoning.  And if this were an infrastructure at,
say, Amazon AWS EC2, setup to be elastic and just scaled out then your
reasoning would be totally on target.  Increased load would scale out
to more parallel resources.  It is more typical to have a load
balancer in front of http/https protocol servers and so those can be
set up rather straight forward.  And also load balancers could be set
up in front of git:// protocol server too.

However the GNU Project is dedicated and directed to using only Free
Software.  And is hosted in this goal by the FSF.  Which means
everything is self-hosted.  Funded by the annual fund raising from
donors just like all of us who donate funding.  And resources are
limited.  Note that Github is not Free Software.  Therefore we cannot
endorse its use.  Same thing with Amazon AWS.  Although regardless of
this we know that many users do use them anyway.

Let's talk about the technical details of the differences between git
protocol servers and http/https protocol servers.

The abbreviated details of the git-daemon is that it is running as the
'nobody' user like this.

  git daemon --init-timeout=10 --timeout=28800 --max-connections=15 
--export-all --base-path=/srv/git --detach

In this configuration git-daemon acts as the supervisor process
managing its children.  The limit values are tweaked and learned from
having too many connections causing failures and tuned lower in order
to prevent this.

When connections exceed the max-connection limit then they will queue
to the kernel limit /proc/sys/net/core/somaxconn (defaults to 128) and
be serviced as able.  Any queued connections past the 128 default the
client will get a connection failure.  The behavior at that point is
client dependent.  It might retry.

The nginx configuration is this:

        location /git/ {
                autoindex on;
                root /srv;
                location ~ ^/git(/.*/(info/refs|git-upload-pack)$) {
                        gzip off;
                        include fastcgi_params;
                        fastcgi_pass unix:/var/run/fcgiwrap.socket;
                        fastcgi_param SCRIPT_FILENAME 
/usr/local/sbin/git-http-backend;
                        fastcgi_param PATH_INFO $1;
                        fastcgi_param GIT_HTTP_EXPORT_ALL true;
                        fastcgi_param GIT_PROJECT_ROOT /srv/git;
                        client_max_body_size 0;
                }
        }

Looking at this now I see there is no rate limit being applied to this
section.  Therefore what I mentioned previously applies to the cgit
and gitweb sections which have been more problematic.  With no rate
limits all clients will be attempted.  Hmm...  I think that may have
been a mistake.  It is possible that adding a rate limit will smooth
the resource use and actually improve the situation.  The cgit and
gitweb sections use a "limit_req zone=one burst=15;" limit.  cgit in
particular is resource intensive for various reasons.  I'll need to do
some testing.

This is the "smart" git transfer protocol server using
git-http-backend.  At one time the dumb basic file transfer protocol
was used but due to gripe tickets all of the transfer protocols were
converted to the smart transfer protocol using git-http-backend.  This
is an external binary that is executed for each connection.  It would
actually be interesting to benchmark real use cases of the different
protocols and see what the performance differences are on the current
system.

When you are seeing proxy gateway failures I think it most likely that
the system is under resource stress and is unable to launch a
git-http-backend process within the timeouts.  This resource stress
can occur as a sum total of everything that is happening on the server
at the same time.  It includes git://, http(s)://, and also svn and
bzr and hg.  (Notably all of the CVS operations are on a different VM,
though likely on the same host server.)  All of those are running on
this system and when all of them coincidentally spike use at the same
time then they will compete with themselves for resources.  The system
will run very slowly.  I/O is shared.  Memory is shared.

Additionally there is an rsync server on the same VM.  Mirror sites
may also be hitting the system invoking rsync.  Which is also
competing for system resources.

And of course all member access is over ssh protocol.  Since this is
member access there are no limits placed on ssh connections.  In
general ssh has caused the least problems because it is authenticated
access from member accounts.  I'll note we have fail2ban rules running
looking and banning abuse from ssh too.

We have set up a new VM that is in the pipeline to be used.  It has
twice the resources of the current one.  It needs provisioning.  Which
I could do manually fairly easily.  But I was taking the time to
script it so that it would be easier in the future for each of the
next ones.

Among other things the current VM has Linux memory overcommit
enabled.  Which means that the OOM Out of Memory Killer is triggered
at times.  And when that happens there is no longer any guarentee that
the machine is in a happy state.  Pretty much it requires a reboot to
ensure that everything is happy after the OOM Killer is invoked.  The
new system has more resources and I will be disabling overcommit which
avoids the OOM Killer.  I strongly feel the OOM killer is
inappropriate for enterprise level production servers.  (I would have
sworn it was already disabled.  But looking a bit ago I saw that it
was enabled.  Did someone else enable it?  Maybe.  That's the problem
of cooking in a shared kitchen.  Things move around and it could have
been any of the cooks.)

> > Whenever I have set up continuous integration builds I always set up a
> > local mirror of all remote repositories.  I poll with a cronjob to
> > refresh those repositories.  Since the update is incremental it is
> > pretty efficient for keeping the local mirrors up to date.  Then all
> > of the continuous integration builds pull from the local mirror.
> >
> > This has a pretty good result in that the LAN is very robust and only
> > shows infrastructure failures when there is something really
> > catastrophic happening on the local network.  Since 100% of everything
> > is local in that case.
> >
> > Transient WAN glitches such as routing problems or server brownouts
> > happen periodically and are unavoidable but then will catch up the
> > local mirror image with the next cronjob refresh.  Can refresh on a
> > pretty quick cycle.  I would do every four hours for example.  Plus
> > those errors are much more understandable as a network issue separate
> > from the CI build errors.  It's a good way to protect the CI builds
> > and redunce noise from them to a minimum.
> 
> It's what I actually used back in the days, the Dockerfile didn't
> clone the repository but did copy the already checked-out repository
> inside the image. That has all the advantages you cited, but cloning
> straight from your repository makes my images more trustworthy because
> the user sees that nothing fishy is going on.

Since git commits are hash ids there should be no difference in the
end.  Since commit a given hash id will be the same regardless of how
it arrived there.  I don't see how anyone can say anything fishy is
happening.  I might liken it to newsgroups.  It doesn't matter how the
article arrive and may have come from any of a number of routes.  It
will be the same article regardless.  With git the hash id ensures
that the object content is identical.

> Also he can just take my Dockerfile and build it directly without
> having to clone something locally first.

I didn't quite follow the why of this being different.  Generally I
would like to see cpu effort distributed so that it is amortized
across all participants as much as possible.  As opposed to having it
lumped.  However if something can be done once instead of done
repeatedly then of course that is better for that reason.  Since I
didn't quite follow the detail here I can only comment with a vague
hand waving response that is without deep meaning.

> To be honest I think my realistic alternatives here are to find the
> right clone limit (4? 8? 20? depending on the hour of the day) and use
> one which is reasonable in terms of time it takes to build and abuse
> of your servers. The images are usually only built once per day, and
> because it's all cached they are only built when the base image
> changes, which is like once per month. So most of the time I do *not*
> clone anything from your repositories... that's when I'd like all the
> images building in parralel, but when suddenly each of the images
> requires a clone then that's where I'd like at most 2 images building
> simultenaously to ensure it works.

Another time honored technique is to wrap "the action" with a retry
loop.  Try.  If failed then sleep for a bit and retry.  As long as the
entire retry loop succeeds then report that as a success not a
failure.  Too bad git clone does not include such functionality by
default.  But it shouldn't be too hard to apply.

Most problems are transient.  Either an Internet problem such as a
routing failure.  Those are unavoidable and happen at a certain rate.
And server brownouts.  Which happen more often because high profile
servers are also high profile targets.  I don't know why people take
any joy in bringing servers like this down.  The nature of things now
are that if an agent wants to DDoS a site off the network then there
is nothing the site operator can do to prevent it.

> I could also switch to the github mirror
> (https://github.com/emacs-mirror/emacs), because I expect github to
> have so many resources that I can clone from them like crazy. But it
> feels a bit wrong, cloning from the official repo sounds better and
> more trustworthy. I'll probably go the "limit to N clones route",
> right now it's limited at ~16 (4 concurrent jobs, each of them
> building for 4 architectures).

You see to me using all of that non-free software also feels wrong.  (shrug)

> I just had this thought that maybe I could play man-in-the-middle with
> /etc/hosts and make-believe git.sv.gnu.org is a local repository, and
> once per day I sync that local repo with the real one. That was the
> dockerfile would appear as cloning the real repo yet caching would be
> done.

Clever.  But is it needed?  You could easily have multiple remotes.
It doesn't matter which you clone from.  The hashes will be the same.

You could clone from your local reference.  Then do a second update
from "the official upstream" and have the same result.

Note that if git-http-backend cannot be invoked for whatever reason
then it doesn't matter if the transfer is large or small.  Even the
smallest transfer would still error because it couldn't get going at
all.  It's just that minimizing the transfers places the smallest load
on the upstream server.  Which is the most friendly.  And avoids
transfering bandwidth which has already been transferred repeatedly.

Some random thoughts in this space...

Among the arguments between git protocol and https protocol many are
worried about agents injecting malicious code into the unencrypted git
stream.  I am not sure this is possible due to git's hashing.  But
with https it is prevented.  Due to this there is an effort to use
https everywhere.  I am not sure it is possible to successfully inject
code into a git protocol stream.

Using https everywhere has a nice appeal.  Until one is trying to sort
things out on the server side and try to differentiate attacks, abuse,
and valid use.  It's all mixed together.  And it is all happening
continuously.  It is like standing under a water fall trying to figure
out where the broken water pipe is located.

You might consider using ssh transfer protocol.  Since that is all
authenticated member access.  It is encrypted and therefore avoids
injection attacks.  It does require a valid member account to hold the
ssh keys.

> > Do the failures are are seeing have a periodic time of day cycle when
> > they are more likely to happen in the middle of the US nighttime?  If
> > so then that is probably related.
> 
> What do you reckon would the best schedule?

Strangely I would recommend the middle of the day.  Which I know is
counter intuitive to many.  Because then there are local resources to
observe what is happening and make tweaks and tuning.  Most of the
problems we have are currently in the middle of the US night time and
difficult for any of us to know what is happening by browsing through
the limited logs the day after.  Plus in the global planetary network
scheme of things there is no middle of the night.  It's always day to
someone who is a user of free software and we would like it to work
for all of them.  Also I know that backups happen at US night.  That
will have an effect too.  Hopefully not too negative.

I know this has been a long and rambling email.  I salute you to have
reached the end of it. :-)

Bob

Re: [Savannah-hackers-public] Emacs git repository clone limits

Reply via email to