Git repo outage

2024-01-30 Thread Tongliang Liao
We’re getting errors when cloning make with `git clone 
'https://git.savannah.gnu.org/git/make.git'` both in CI and locally.
Browser link also leads to nowhere: https://git.savannah.gnu.org/cgit/make.git

Is this removed intentionally or it is an outage?


Thanks,
Tongliang

Re: Git repo outage

2024-01-30 Thread Bob Proulx
Tongliang Liao wrote:
> We’re getting errors when cloning make with `git clone 
> 'https://git.savannah.gnu.org/git/make.git'` both in CI and locally.
> Browser link also leads to nowhere: https://git.savannah.gnu.org/cgit/make.git
>
> Is this removed intentionally or it is an outage?

It's was an outage.  One of the host systems crashed tonight.  That
took several systems offline.  And due to that there was some cascade
failure which took a while to diagnose and fix.  Things were actually
a little stressful there for a while!  Fixed now.  All services should
be back online now.

Please make a bookmark of this URL to get out-of-band status of system
problems.

https://hostux.social/@fsfstatus

Bob



Re: Git repo outage

2024-01-30 Thread Tongliang Liao
Thanks for the info and quick action!

I also saw people asking online regarding the status page with responses like 
“there’s no status page for this site”.
Would be great if theisURL can be documented somewhere more visible, or perhaps 
302 something like https://status.savannah.gnu.org 
 here.


Tongliang

> On Jan 30, 2024, at 00:13, Bob Proulx  wrote:
> 
> Tongliang Liao wrote:
>> We’re getting errors when cloning make with `git clone 
>> 'https://git.savannah.gnu.org/git/make.git'` both in CI and locally.
>> Browser link also leads to nowhere: 
>> https://git.savannah.gnu.org/cgit/make.git
>> 
>> Is this removed intentionally or it is an outage?
> 
> It's was an outage.  One of the host systems crashed tonight.  That
> took several systems offline.  And due to that there was some cascade
> failure which took a while to diagnose and fix.  Things were actually
> a little stressful there for a while!  Fixed now.  All services should
> be back online now.
> 
> Please make a bookmark of this URL to get out-of-band status of system
> problems.
> 
>https://hostux.social/@fsfstatus
> 
> Bob



Re: Git repo outage

2024-01-30 Thread Paul Smith
On Tue, 2024-01-30 at 01:13 -0700, Bob Proulx wrote:
> Please make a bookmark of this URL to get out-of-band status of
> system problems.
> 
>     https://hostux.social/@fsfstatus

If you have a mastodon account you can follow that account as well if
you prefer.

Bob, note that the last update (7h ago) says things are still down:

> Savannah services git, svn, hg, bzr, download and audio-video.gnu.org
> are unavailable.

I'm not sure if that's a typo, or if someone forgot to add the latest
updates, or...?



Re: Git repo outage

2024-01-30 Thread Bob Proulx
Tongliang Liao wrote:
> I also saw people asking online regarding the status page with
> responses like "there's no status page for this site".  Would be
> great if theisURL can be documented somewhere more visible,

Yes.  I know that most of the IRC channels have it in the /topic
banner for public dissemination.  Savannah mentions it at various
times in the web UI news area when there are events.  It's otherwise
buried down this one page only.

   https://savannah.gnu.org/maintenance/NotSavannahAdmins/

I'll try to make it documented more visibily.  I think I will
advertise it on the front page of the servers such as here.

https://git.savannah.gnu.org/

Note that the Savannah Hackers admin team can do many things with the
Savannah system but that we don't have access to update notifications
to that FSF sysadmin managed hostux.social account.

> or perhaps 302 something like https://status.savannah.gnu.org
>  here.

The problem is that if there is a larger problem at the Boston
datacenter where everything is hosted then it's a larger problem and
nothing from that datacenter works at all.  That's been a problem a
few times in the past.  When a problem like that happens that, for
example, breaks all networking to the entire set of machines, then
nothing there can work to communicate the failure.

That's why they (the FSF sysadmin staff) are using the hostux.social
media account as it is hosted elsewhere on a completely independent
network.  The likelihood that both sites would be offline at the same
time is very small that way.

For just the one Savannah web UI system if there is going to be an
extended downtime then I will usually put up a maintenance page there.
But that's only possible when one is dealing with just that one thing
of the web server there which does not happen very often.

Bob



Re: Git repo outage

2024-01-30 Thread Bob Proulx
Paul Smith wrote:
> Bob Proulx wrote:
> > https://hostux.social/@fsfstatus
>
> Bob, note that the last update (7h ago) says things are still down:
>
> > Savannah services git, svn, hg, bzr, download and audio-video.gnu.org
> > are unavailable.
>
> I'm not sure if that's a typo, or if someone forgot to add the latest
> updates, or...?

I don't have access to make hostux.social updates there.  That's the
FSF sysadmin staff only.  I'm just a volunteer out in the field. :-)

I asked Michael to update it.  He has done so during the time I have
taken to write this email.  It's updated now.  And I see that other
systems were also affected last night too.

The problem was that this was happening at midnight in my timezone and
2am US/Eastern for the FSF sysadmin staff timezone.  Michael the FSF
admin staff person on call was paged in to get the kvmhost3 system
that crashed back online.  He did that.  And that got the guest VMs
online.  Which was great!  Thanks Michael!

But then we had this network storage server problem.  That's a part of
the infrastructure which I happen to have full admin access to and
also I was the person who set up that side of things too making me the
logical person to work through the problem.  (Me looks around the room
shyly since I am the one who set that up and it is a problem there.
But I blame the specific version of the Trisquel system and Linux
kernel that is running on that server because none of the other
servers I admin ever exhibit this problem.)

It being after 2am there I told Michael to get some sleep because if
by morning I couldn't fix it yet then I would need him to be awake and
thinking straight to tag back into the problem.  I continued to look
at the problem.  Initially to me it looked like a network problem
because I could see mount requests going from client to server but the
server was not responding to those requests.  That was a red herring
and distracted me for a bit.  And at that exact moment I had not
realized that the kvmhost3 crash had also affected the nfs1 server
because nfs1 is running on a different host system kvmhost2.  But
seemingly kvmhost2 also rebooted.  Which triggered the nfs1 reboot.
And nfs1 has this bimodal working, not working, boot mode that we just
need to retire that entire system and retire that problem with it.

Just for the record though it's been about three years (pre-pandemic!)
since the last time we had an unscheduled crash affecting nfs1 that
caught us at an unexpected time with this problem.  (Other crashes on
other systems have happened but kvmhost2 has been super reliable rock
solid.)  Usually I am explicitly doing a reboot on nfs1 knowing I need
to ensure this has booted into an okay state.  Honestly I was slow
realizing that was the problem this time.  And then when I did realize
that was the problem I took a moment to try to debug it again.  Since
I can only do it when the system is down and it was down already.
Unfortunately I wasn't able to diagnose it and it being after midnight
for me and people starting to email in with problem reports I simply
rebooted it retrying until it booted up in the okay working state.  It
took two more reboots before that happened.  And then of course it was
okay.  It's always okay and reliable from them forward when it boots
into the okay working mode.

Just because I am talking here I will say that we have set up and
running an nfs2 system already.  It's in production and being used by
other systems and working great.  If nfs1's problem shifted to where
rebooting it still did not resolve the problem then we would move the
SAN block devices from nfs1 over to nfs2 and shift to serving the data
from there.  We will do that eventually regardless.  But it isn't a
transparent move and there would be a mad scramble to update symlinks
and mount points.  Would rather work through that during a scheduled
daylight downtime when not after midnight my time after an unexpected
crash.

Just to show how everything is connected the reason I posted here at
all was because after clearing this problem I ran an anti-spam review
of the mail queues and saw the message from Tongliang Liao to the
list, approved it through, and then because I had happened to see it
wrote a response.  There were also a few messages to the Savannah
lists which was expected and I responded there.  I am sure there were
messages to other lists that I did not see and so I didn't make random
postings anywhere else.  :-)

Bob