Hi Jeffrey,
On Mon, 22 Jul 2024, Jeffrey Altman wrote:
On Jul 18, 2024, at 6:56 AM, Stephan Wonczak
<[email protected]> wrote:
I just noticed: There still seems to be something not working
correctly. Although everything is working correcty (at least -I- did
not find anything amiss), I still get these messages in FileLog every
five minutes:
Thu Jul 18 12:36:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Thu Jul 18 12:41:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Thu Jul 18 12:46:59 2024 VL_RegisterAddrs rpc failed; will retry
periodically (code=5376, err=0)
Any ideas as to that?
5376 - no quorum elected
Strange.
Earlier you mentioned that the cell consists of a single machine on which
the DB and FILE services are co-located.
In a single server configuration the UBIK services (vlserver, ptserver, …)
should be operating in single server mode and there should never be an
election. Since the vlserver is returning 5376 it indicates there might
still be a problem with the contents of the server CellServDB and perhaps
the NetInfo/NetRestrict configuration.
Really strange.
I do not have any NetInfo or NetRestrict files, so no problem there.
Here are the contents of /usr/afs/etc/CellServDB:
afstest.uni-koeln.de #Cell name
134.95.13.39 #afstest.rrz.uni-koeln.de
(Yes, really only these two lines!)
What errors are logged to the VLLog?
None at all.
Thu Jul 11 14:57:30 2024 Starting AFS vlserver 4 (/usr/afs/bin/vlserver)
Thu Jul 11 14:57:30 2024 @(#)OpenAFS 1.8.11 2024-06-13
[email protected]
Thu Jul 11 14:58:45 2024 Ubik: I am the sync site
These are the last entries.
What does 'udebug <host> 7003 -long’ report?
This is where it gets really weird:
[root@afstest/usr/afs]$ udebug afstest.rrz.uni-koeln.de vl -long
Host's addresses are: 134.95.13.39
Host's 134.95.13.39 time is Fri Jul 26 10:26:26 2024
Local time is Fri Jul 26 10:26:26 2024 (time differential 0 secs)
Last yes vote for 134.95.13.39 was 13 secs ago (sync site);
Last vote started 13 secs ago (at Fri Jul 26 10:26:13 2024)
Local db version is 1610030433.14
I am sync site until 47 secs from now (at Fri Jul 26 10:27:13 2024) (2
servers)
Recovery state 1
The last trans I handled was 1720702725.17056
Sync site's db version is 1610030433.14
0 locked pages, 0 of them for write
Last time a new db version was labelled was:
1279661 secs ago (at Thu Jul 11 14:58:45 2024)
Server (134.95.110.160): (db 0.0)
last vote never rcvd
last beacon never sent
dbcurrent=0, up=0 beaconSince=0
Where does this IP 134.95.110.160 come from?
Well, actually, this was the -old- IP of this machine before it was
moved into another network. But where did this come from? Hmmm...
I got it.
After correcting the server-CellServDB, I did not reboot the machine. I
just stopped (and afterwards) restarted both the openafs-server and
openafs-client. Obviously, the wrong IP remained in some kernel resident
lists. I tried fixing the issue with "fs newcell", but no luck there.
One reboot later, however, things are looking fine now:
udebug afstest.rrz.uni-koeln.de vl -long
Host's addresses are: 134.95.13.39
Host's 134.95.13.39 time is Fri Jul 26 10:39:31 2024
Local time is Fri Jul 26 10:39:31 2024 (time differential 0 secs)
Last yes vote for 134.95.13.39 was 0 secs ago (sync site);
Last vote started 0 secs ago (at Fri Jul 26 10:39:31 2024)
Local db version is 1610030433.14
I am sync site forever (1 server)
Recovery state 1f
The last trans I handled was 1721983108.0
Sync site's db version is 1610030433.14
0 locked pages, 0 of them for write
Last time a new db version was labelled was:
63 secs ago (at Fri Jul 26 10:38:28 2024)
Thanks, Jeffrey, for poining me in the right direction!
(and hopefully someone can learn from my bunbling here :-) )
Dipl. Chem. Dr. Stephan Wonczak
Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
Universitaet zu Koeln, Weyertal 121, 50931 Koeln
Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625