Philippe Vaucher wrote: > > > Is there a limit and/or maintainance going? Was I but in some sort of > > > throttle-list? > > > > We do have rate limits in place. Because otherwise the general > > background radiation activity of the Internet will break things. > > > > However nothing has changed recently in regard to the rate limits for > > a long time. As I look at the logs the last rate limit change was Sat > > Dec 7 06:31:50 2019 -0500 which seems long enough ago that it isn't > > anything recent. Meaning that this is probably simply just you > > competing with other users on the Internet for resources. > > Until recently I only did up to ~8 concurrent git clones, but recently > with infrastructure changes I'm able to do much more.
For a high level of parallelism I would definitely try to fanout using local resources. It would be much more reliable. In electrical design the fanout and load of a circuit is a design consideration. One doesn't want to place too much of a load on a remote source. And a source knows how much fanout it can drive. This is a similar problem to software mirroring. I would limit the amount of load placed and then keep all of the fanout to using local resources. Then you get to see the effects of both ends of things locally and have better visibility. Also since that is on the LAN side the network performance is unlikely to bottleneck. I will resist telling some anecdotal stories of various experiences here and some managerial problems in this area. :-) > > Cloning with the https transport that uses git-http-backend for the > > backend. We are using Nginx rate limiting. Which you can read about > > how the algorithm works here. It is basically a smoothing process. > > Until recently I was cloning git://git.sv.gnu.org/emacs.git, the > switch to https is an attempt at working around the limitation I hit > recently. My train of thought was that http:// is easier to scale than > git://, if you say otherwise I can revert back to git:// clones. Well... It's not simple! And also things change with time and server resources and configuration. So any answer I were to state today would mutate into a wrong answer at some different point in time. I understand your reasoning. And if this were an infrastructure at, say, Amazon AWS EC2, setup to be elastic and just scaled out then your reasoning would be totally on target. Increased load would scale out to more parallel resources. It is more typical to have a load balancer in front of http/https protocol servers and so those can be set up rather straight forward. And also load balancers could be set up in front of git:// protocol server too. However the GNU Project is dedicated and directed to using only Free Software. And is hosted in this goal by the FSF. Which means everything is self-hosted. Funded by the annual fund raising from donors just like all of us who donate funding. And resources are limited. Note that Github is not Free Software. Therefore we cannot endorse its use. Same thing with Amazon AWS. Although regardless of this we know that many users do use them anyway. Let's talk about the technical details of the differences between git protocol servers and http/https protocol servers. The abbreviated details of the git-daemon is that it is running as the 'nobody' user like this. git daemon --init-timeout=10 --timeout=28800 --max-connections=15 --export-all --base-path=/srv/git --detach In this configuration git-daemon acts as the supervisor process managing its children. The limit values are tweaked and learned from having too many connections causing failures and tuned lower in order to prevent this. When connections exceed the max-connection limit then they will queue to the kernel limit /proc/sys/net/core/somaxconn (defaults to 128) and be serviced as able. Any queued connections past the 128 default the client will get a connection failure. The behavior at that point is client dependent. It might retry. The nginx configuration is this: location /git/ { autoindex on; root /srv; location ~ ^/git(/.*/(info/refs|git-upload-pack)$) { gzip off; include fastcgi_params; fastcgi_pass unix:/var/run/fcgiwrap.socket; fastcgi_param SCRIPT_FILENAME /usr/local/sbin/git-http-backend; fastcgi_param PATH_INFO $1; fastcgi_param GIT_HTTP_EXPORT_ALL true; fastcgi_param GIT_PROJECT_ROOT /srv/git; client_max_body_size 0; } } Looking at this now I see there is no rate limit being applied to this section. Therefore what I mentioned previously applies to the cgit and gitweb sections which have been more problematic. With no rate limits all clients will be attempted. Hmm... I think that may have been a mistake. It is possible that adding a rate limit will smooth the resource use and actually improve the situation. The cgit and gitweb sections use a "limit_req zone=one burst=15;" limit. cgit in particular is resource intensive for various reasons. I'll need to do some testing. This is the "smart" git transfer protocol server using git-http-backend. At one time the dumb basic file transfer protocol was used but due to gripe tickets all of the transfer protocols were converted to the smart transfer protocol using git-http-backend. This is an external binary that is executed for each connection. It would actually be interesting to benchmark real use cases of the different protocols and see what the performance differences are on the current system. When you are seeing proxy gateway failures I think it most likely that the system is under resource stress and is unable to launch a git-http-backend process within the timeouts. This resource stress can occur as a sum total of everything that is happening on the server at the same time. It includes git://, http(s)://, and also svn and bzr and hg. (Notably all of the CVS operations are on a different VM, though likely on the same host server.) All of those are running on this system and when all of them coincidentally spike use at the same time then they will compete with themselves for resources. The system will run very slowly. I/O is shared. Memory is shared. Additionally there is an rsync server on the same VM. Mirror sites may also be hitting the system invoking rsync. Which is also competing for system resources. And of course all member access is over ssh protocol. Since this is member access there are no limits placed on ssh connections. In general ssh has caused the least problems because it is authenticated access from member accounts. I'll note we have fail2ban rules running looking and banning abuse from ssh too. We have set up a new VM that is in the pipeline to be used. It has twice the resources of the current one. It needs provisioning. Which I could do manually fairly easily. But I was taking the time to script it so that it would be easier in the future for each of the next ones. Among other things the current VM has Linux memory overcommit enabled. Which means that the OOM Out of Memory Killer is triggered at times. And when that happens there is no longer any guarentee that the machine is in a happy state. Pretty much it requires a reboot to ensure that everything is happy after the OOM Killer is invoked. The new system has more resources and I will be disabling overcommit which avoids the OOM Killer. I strongly feel the OOM killer is inappropriate for enterprise level production servers. (I would have sworn it was already disabled. But looking a bit ago I saw that it was enabled. Did someone else enable it? Maybe. That's the problem of cooking in a shared kitchen. Things move around and it could have been any of the cooks.) > > Whenever I have set up continuous integration builds I always set up a > > local mirror of all remote repositories. I poll with a cronjob to > > refresh those repositories. Since the update is incremental it is > > pretty efficient for keeping the local mirrors up to date. Then all > > of the continuous integration builds pull from the local mirror. > > > > This has a pretty good result in that the LAN is very robust and only > > shows infrastructure failures when there is something really > > catastrophic happening on the local network. Since 100% of everything > > is local in that case. > > > > Transient WAN glitches such as routing problems or server brownouts > > happen periodically and are unavoidable but then will catch up the > > local mirror image with the next cronjob refresh. Can refresh on a > > pretty quick cycle. I would do every four hours for example. Plus > > those errors are much more understandable as a network issue separate > > from the CI build errors. It's a good way to protect the CI builds > > and redunce noise from them to a minimum. > > It's what I actually used back in the days, the Dockerfile didn't > clone the repository but did copy the already checked-out repository > inside the image. That has all the advantages you cited, but cloning > straight from your repository makes my images more trustworthy because > the user sees that nothing fishy is going on. Since git commits are hash ids there should be no difference in the end. Since commit a given hash id will be the same regardless of how it arrived there. I don't see how anyone can say anything fishy is happening. I might liken it to newsgroups. It doesn't matter how the article arrive and may have come from any of a number of routes. It will be the same article regardless. With git the hash id ensures that the object content is identical. > Also he can just take my Dockerfile and build it directly without > having to clone something locally first. I didn't quite follow the why of this being different. Generally I would like to see cpu effort distributed so that it is amortized across all participants as much as possible. As opposed to having it lumped. However if something can be done once instead of done repeatedly then of course that is better for that reason. Since I didn't quite follow the detail here I can only comment with a vague hand waving response that is without deep meaning. > To be honest I think my realistic alternatives here are to find the > right clone limit (4? 8? 20? depending on the hour of the day) and use > one which is reasonable in terms of time it takes to build and abuse > of your servers. The images are usually only built once per day, and > because it's all cached they are only built when the base image > changes, which is like once per month. So most of the time I do *not* > clone anything from your repositories... that's when I'd like all the > images building in parralel, but when suddenly each of the images > requires a clone then that's where I'd like at most 2 images building > simultenaously to ensure it works. Another time honored technique is to wrap "the action" with a retry loop. Try. If failed then sleep for a bit and retry. As long as the entire retry loop succeeds then report that as a success not a failure. Too bad git clone does not include such functionality by default. But it shouldn't be too hard to apply. Most problems are transient. Either an Internet problem such as a routing failure. Those are unavoidable and happen at a certain rate. And server brownouts. Which happen more often because high profile servers are also high profile targets. I don't know why people take any joy in bringing servers like this down. The nature of things now are that if an agent wants to DDoS a site off the network then there is nothing the site operator can do to prevent it. > I could also switch to the github mirror > (https://github.com/emacs-mirror/emacs), because I expect github to > have so many resources that I can clone from them like crazy. But it > feels a bit wrong, cloning from the official repo sounds better and > more trustworthy. I'll probably go the "limit to N clones route", > right now it's limited at ~16 (4 concurrent jobs, each of them > building for 4 architectures). You see to me using all of that non-free software also feels wrong. (shrug) > I just had this thought that maybe I could play man-in-the-middle with > /etc/hosts and make-believe git.sv.gnu.org is a local repository, and > once per day I sync that local repo with the real one. That was the > dockerfile would appear as cloning the real repo yet caching would be > done. Clever. But is it needed? You could easily have multiple remotes. It doesn't matter which you clone from. The hashes will be the same. You could clone from your local reference. Then do a second update from "the official upstream" and have the same result. Note that if git-http-backend cannot be invoked for whatever reason then it doesn't matter if the transfer is large or small. Even the smallest transfer would still error because it couldn't get going at all. It's just that minimizing the transfers places the smallest load on the upstream server. Which is the most friendly. And avoids transfering bandwidth which has already been transferred repeatedly. Some random thoughts in this space... Among the arguments between git protocol and https protocol many are worried about agents injecting malicious code into the unencrypted git stream. I am not sure this is possible due to git's hashing. But with https it is prevented. Due to this there is an effort to use https everywhere. I am not sure it is possible to successfully inject code into a git protocol stream. Using https everywhere has a nice appeal. Until one is trying to sort things out on the server side and try to differentiate attacks, abuse, and valid use. It's all mixed together. And it is all happening continuously. It is like standing under a water fall trying to figure out where the broken water pipe is located. You might consider using ssh transfer protocol. Since that is all authenticated member access. It is encrypted and therefore avoids injection attacks. It does require a valid member account to hold the ssh keys. > > Do the failures are are seeing have a periodic time of day cycle when > > they are more likely to happen in the middle of the US nighttime? If > > so then that is probably related. > > What do you reckon would the best schedule? Strangely I would recommend the middle of the day. Which I know is counter intuitive to many. Because then there are local resources to observe what is happening and make tweaks and tuning. Most of the problems we have are currently in the middle of the US night time and difficult for any of us to know what is happening by browsing through the limited logs the day after. Plus in the global planetary network scheme of things there is no middle of the night. It's always day to someone who is a user of free software and we would like it to work for all of them. Also I know that backups happen at US night. That will have an effect too. Hopefully not too negative. I know this has been a long and rambling email. I salute you to have reached the end of it. :-) Bob