Re: [OpenAFS] Non-functional fileserver

Stephan Wonczak Thu, 18 Jul 2024 03:57:03 -0700

  Hi Mark,
  Comments inline.

On Thu, 11 Jul 2024, MS Vitale wrote:

Dr. Wonczak,

Thank you for your report.  Please see my interleaved replies below:
On Jul 11, 2024, at 9:50 AM, Stephan Wonczak <[email protected]> wrote:
Today we had a strange problem with two of our test-AFS-Servers. Apartfrom our normal cell we created two additional cells, each oneconsisting of a single server that servers as both DB-Server andFileserver. These servers were created about two years back, and wereworking fine then. Yesterday we had need to test something new and werevisited the servers.
 "bos status" came back fine with "all servers running".
'bos status <host> -long' is useful in this situation, and may reportthat a core file is present.

Yes. probably. I indeed neglected to use the "long" option. However, theinfo that a core file is present is not really helpful in itself.

However, "vos listvol -server xxx" resulted in "possible communicationfailure" Digging a bit, we had numerous log entries in VolSerLog"SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)".This pointed to the fact, that the fssync.sock socket file wasmissing. Indeed, /var/log/messages showed that the fileserver-processhad dumped core during startup. Interestingly, though, a fileserverprocess -was- running, just not really functioning.Several unsuccessful hours of debugging, tracing and googling later, Iwas ready to give up and trash the test cell and create a new one fromscratch. During the process of purging the files I thought "OK,/usr/afs/etc/CellServDB for this cell stays the same, so I can keepthat." On a hunch, I actually looked what was inside: Lo and behold!The configured DB-server adress for the cell had the wrong IP.
 This is when I remembered that both problematic machines were moved to a 
different network segment. We had corrected the -client- CellervDB during that 
move, but forgot about the server CellServDB.
 Now, the whole point of this story:
 The logs were spectacularily unhelpful in pinpointing this misconfiguration. 
Indeed, I would not have expected the fileserver to dump core instead of 
refusing to run at all. At the very least there should be a log entry that no 
DB-Server could be reached (and CellServDB should be checked).
 Recreating this behaviour is easy:
 Take a working single-server cell, and change the IP in
 /usr/afs/etc/CellServDB. Restart the fileserver and watch things go
  south.
I tried this (running master) and was able to reproduce some of yoursymptoms,as expected - but not all of them.
In this case, when the CSDB has the wrong IP address, the fileserver
will never be fully functional even though it is "running".


  Yes, of course. Failure in this case is expected and correct.

When a fileserver is in this state, the fileserver FSSYNC channel isindeed blocked until the fileserver is able to complete registrationwith the vlserver. As you observed, this in turn affects any volserveroperation that requires the FSSYNC channel.


  Also expected :-)

The fileserver will also be unable to obtain required authorizationinformation from the ptserver.
However, I did NOT experience a fileserver crash.

I tried several times, and each time I had a crash/coredump duringstartup. This was even in the logs (BosLog):


Thu Jul 11 14:57:29 2024: fs started pid 65412: /usr/afs/bin/salvager
Thu Jul 11 14:57:29 2024: Listening on 0.0.0.0:7007
Thu Jul 11 14:57:29 2024: fs:salv exited with code 0
Thu Jul 11 14:57:29 2024: fs started pid 65423: /usr/afs/bin/fileserver
Thu Jul 11 14:57:29 2024: fs started pid 65424: /usr/afs/bin/volserver
Thu Jul 11 14:58:05 2024: fs:vol exited on signal 15
Thu Jul 11 14:58:05 2024: fs:file exited on signal 3 (core dumped)

And I also see these expected messages in FileLog:
 ...
 Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodically 
(code=-1, err=0)
 Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in 30 
seconds; code=-1.
 Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in 30 
seconds; code=-1.
 ...

Admittedly, these message are not as helpful as they could be; theyshould mention which IP addrs it is trying to reach.


  Some hint to "check CellServDB" would be -really- useful here, too.

What version of OpenAFS are you running?


openafs-1.8.11

I just noticed: There still seems to be something not working correctly.Although everything is working correcty (at least -I- did not findanything amiss), I still get these messages in FileLog every five minutes:

Thu Jul 18 12:36:59 2024 VL_RegisterAddrs rpc failed; will retryperiodically (code=5376, err=0)Thu Jul 18 12:41:59 2024 VL_RegisterAddrs rpc failed; will retryperiodically (code=5376, err=0)Thu Jul 18 12:46:59 2024 VL_RegisterAddrs rpc failed; will retryperiodically (code=5376, err=0)


  Any ideas as to that?

        Dipl. Chem. Dr. Stephan Wonczak

        Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
        Universitaet zu Koeln, Weyertal 121, 50931 Koeln
        Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Non-functional fileserver

Reply via email to