We're seeing a similar issue. We just recently migrated all of our dafileservers to 1.8.6 (the three dbs are still on 1.6.24). We're running CentOS 7.9 (kernel 3.10.0-1160.2.2) and these are all vms on vmware.
The db servers appear to be okay (vos listvldb works, udebug shows recovery state 1f), and the fileservers still *seem* to be serving content (could be cached), but a 'vos partinfo localhost -localauth' returns: Could not fetch the list of partitions from the server Possible communication failure Error in vos listpart command. Possible communication failure even though the underlying storage is attached, and 'find /vicepa -ls' can traverse the vice mount and hasn't returned any errors. I restarted the afs processes on one server, and post restart I'm seeing the following in FileLog: > Thu Jan 14 07:17:34 2021 File server has terminated normally at Thu Jan 14 > 07:17:34 2021 > Thu Jan 14 07:17:34 2021 File server starting (/usr/afs/bin/dafileserver > -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000 > -nobusy -udpsize 524288 -rxpck 800 -b 16000) > Thu Jan 14 07:19:54 2021 VL_RegisterAddrs rpc failed; will retry > periodically (code=5377, err=0) > Thu Jan 14 07:24:35 2021 File server starting (/usr/afs/bin/dafileserver > -L -p 256 -vattachpar 8 -vc 32768 -s 10000 -l 20000 -hr 1 -cb 5000000 > -nobusy -udpsize 524288 -rxpck 800 -b 16000) > Thu Jan 14 07:26:55 2021 VL_RegisterAddrs rpc failed; will retry > periodically (code=-1, err=0) > Thu Jan 14 07:30:25 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > Thu Jan 14 07:32:40 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > Thu Jan 14 07:34:55 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > Thu Jan 14 07:37:10 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > The dasalvager process keeps exiting (exit code 1), and SalsrvLog shows: > Thu Jan 14 08:22:57 2021 @(#)OpenAFS 1.8.6 2020-07-15 > [email protected] > Thu Jan 14 08:22:57 2021 Starting OpenAFS Online Salvage Server 2.4 > (/usr/afs/bin/salvageserver) > Thu Jan 14 08:23:43 2021 SYNC_connect: temporary failure on circuit > 'FSSYNC' (will retry) > Thu Jan 14 08:23:59 2021 SYNC_connect: temporary failure on circuit > 'FSSYNC' (will retry) > Thu Jan 14 08:24:23 2021 SYNC_connect: temporary failure on circuit > 'FSSYNC' (will retry) > Thu Jan 14 08:24:55 2021 SYNC_connect: temporary failure on circuit > 'FSSYNC' (will retry) > Thu Jan 14 08:25:35 2021 SYNC_connect: temporary failure on circuit > 'FSSYNC' (will retry) > SYNC_connect failed (giving up!): Connection refused > Thu Jan 14 08:26:23 2021 Unable to connect to file server; aborted > Really at a loss at what else to look for. Best regards, k- On Thu, Jan 14, 2021 at 7:45 AM Valtteri Vuorikoski <[email protected]> wrote: > > I have a small OpenAFS 1.8.6 setup using the Debian and Ubuntu packages. > Last night everything was working fine, this morning machines were > timing out trying to talk to volume servers. Database replication was > also stuck. > > While there is a single backup database and file server, databases and > volumes are primarily on a single server. I logged in to that server > ("afs1"), made it the only machine in the cell by editing client and > server CellServDB and set out trying to restore things. > > afs1 is running Debian bullseye. Kernel 5.8 (running at the time when > things broke) and 5.10 result in an equally non-functional system. There > are no iptables rules on the system. > > OpenAFS is almost 100% dead for no apparent reason: > > - "pts listentries" and "vos listvldb localhost" work. udebug shows both > servers in recovery state 1f, site is sync site and there are no > replicas (as expected at this point). > > - After restarting services, vos status -localauth -server localhost > prints the following: > > Could not access status information about the server > Possible communication failure > Error in vos status command. > Possible communication failure > > - After a while, vos status no longer prints anything, just hangs. All > AFS client access times out. > > - There is mostly nothing in the logs. Starting > vlserver/ptserver/dafileserver with -d 125 doesn't lead to any extra > output. Nothing out of the ordinary (except AFS client errors) appears > in dmesg or journalctl -b. After starting dafileserver -L, the following > log appears: > > Thu Jan 14 11:59:54 2021 File server starting > (/usr/lib/openafs/dafileserver -L) > Thu Jan 14 11:59:54 2021 VL_RegisterAddrs rpc failed; will retry > periodically (code=5376, err=0) > Thu Jan 14 12:01:04 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > Thu Jan 14 12:02:09 2021 Couldn't get CPS for AnyUser, will try again in > 30 seconds; code=-1. > [the last message keeps repeating] > > - dasalvager appears to run successfully. I'm currently running a > voldump to recover data and it's running fine so far. There is plenty > of disk space. > > - Kerberos appears to be working. kinit works, aklog works, pts/vos > commands without > -localauth work when a superuser token is present. KDC (Samba) doesn't > show any problems related to the afs principal. Clocks are accurate. > > - Rebooting the whole system (a qemu VM) makes no difference. > > After four hours of debugging, I'm at the end of my wits. Even > temporarily removing all databases, restarting ptserver and vlserver and > touching NoAuth won't make fileserver/volserver happy. It seems like RX > communication is failing somehow, but I have no idea why. > > Any ideas what's going on here? > > -Valtteri > > _______________________________________________ > OpenAFS-info mailing list > [email protected] > https://lists.openafs.org/mailman/listinfo/openafs-info > -- Kendrick Hernandez *UNIX Systems Administrator* Division of Information Technology University of Maryland, Baltimore County
