[Hampshire] Repeated server crash overnight
Hi all I have an Ubuntu box which is on 24/7/365. It has ufw running allowing nothing from outside my lan. A couple of times recently, I've come in to find the machine locked up with a lot of disk access (it can be ping'd but I can't ssh into it and it doesn't respond to mouse or keyboard on the console - only power cycling brings it back). As I say, this has now happened twice in the last 3-4 nights. It may have been hacked (but I doubt it looking at kern.log and auth.log - and I'm behind a NAT router with no ports open). Does anyone know if Ubuntu (Jammy) does some indexing or some other regular task overnight? The reason I ask is I'm wondering if it's said indexing that's crashed the (very old) system. It's fine for a file server but not really fit for anything else. Incidentally, I've checked crontab and there's nothing in there. Anything else I should be checking? Cheers Rob -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
Hi Rob, anything in /etc/cron.daily which would run at or about midnight? Or any files in /var/spool/cron/crontabs or /etc/cron.d? Or even a self-re-scheduling "at" job? ("sudo atq" will list any pending jobs) On 13/03/2023 08:02, rmluglist2--- via Hampshire wrote: Hi all I have an Ubuntu box which is on 24/7/365. It has ufw running allowing nothing from outside my lan. A couple of times recently, I’ve come in to find the machine locked up with a lot of disk access (it can be ping’d but I can’t ssh into it and it doesn’t respond to mouse or keyboard on the console – only power cycling brings it back). As I say, this has now happened twice in the last 3-4 nights. It may have been hacked (but I doubt it looking at kern.log and auth.log – and I’m behind a NAT router with no ports open). Does anyone know if Ubuntu (Jammy) does some indexing or some other regular task overnight? The reason I ask is I’m wondering if it’s said indexing that’s crashed the (very old) system. It’s fine for a file server but not really fit for anything else. Incidentally, I’ve checked crontab and there’s nothing in there. Anything else I should be checking? Cheers Rob -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
Hi Rob, You didn't say if you had checked /var/log/syslog Is there anything indicative of the issue there? The only indexing task I can think of is updatedb for locate, which I think is a cron.daily thing - haven't used Ubuntu for a few years so may be wrong. Which filesystem(s)? Do you have anything like recoll installed? Best wishes, Gareth > On 13 Mar 2023, at 08:03, rmluglist2--- via Hampshire > wrote: > > > Hi all > > I have an Ubuntu box which is on 24/7/365. It has ufw running allowing > nothing from outside my lan. > > A couple of times recently, I’ve come in to find the machine locked up with a > lot of disk access (it can be ping’d but I can’t ssh into it and it doesn’t > respond to mouse or keyboard on the console – only power cycling brings it > back). As I say, this has now happened twice in the last 3-4 nights. > > It may have been hacked (but I doubt it looking at kern.log and auth.log – > and I’m behind a NAT router with no ports open). Does anyone know if Ubuntu > (Jammy) does some indexing or some other regular task overnight? The reason > I ask is I’m wondering if it’s said indexing that’s crashed the (very old) > system. It’s fine for a file server but not really fit for anything else. > Incidentally, I’ve checked crontab and there’s nothing in there. > > Anything else I should be checking? > > Cheers > Rob > > > -- > Please post to: Hampshire@mailman.lug.org.uk > Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire > LUG URL: http://www.hantslug.org.uk > -- -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
[snip] > /var/log/syslog > >Is there anything indicative of the issue there? Nothing that I can see. All I can tell is something called freshclam which I’d never even heard of. Ufw is blocking a lot of requests – but only from two media clients (box in question is my media server) so I don’t think it’s that. [snip] >Which filesystem(s)? I’m assuming it’s / - how do I tell? >Do you have anything like recoll installed? No. Never heard of it. By the looks of it, it’s something to do with: [system] Failed to activate service 'org .freedesktop.nm_dispatcher': timed out (service_start_timeout=25000ms) This (from auth.log) is the only thing I can see which isn’t to do with local media clients (minidlna etc). Cheers Rob -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
> On 13 Mar 2023, at 13:31, rmluglist2--- via Hampshire > wrote: > > > [snip] > > /var/log/syslog > > > >Is there anything indicative of the issue there? > > Nothing that I can see. All I can tell is something called freshclam which > I’d never even heard of. That's the automatic updater for clamav (antivirus) definitions/sigs etc > Ufw is blocking a lot of requests – but only from two media clients (box in > question is my media server) so I don’t think it’s that. > > [snip] > > >Which filesystem(s)? > > I’m assuming it’s / - how do I tell? Sorry I meant ext4? Btrfs? Other? > > >Do you have anything like recoll installed? > > No. Never heard of it. > > By the looks of it, it’s something to do with: > [system] Failed to activate service 'org > .freedesktop.nm_dispatcher': timed out (service_start_timeout=25000ms) I can't find much that's instructive about that error in isolation from a quick Google/ddg search. Does sudo journalctl -b show similar issues, and anything near it? (History of previous boot) > > This (from auth.log) is the only thing I can see which isn’t to do with local > media clients (minidlna etc). > > Cheers > Rob > -- > Please post to: Hampshire@mailman.lug.org.uk > Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire > LUG URL: http://www.hantslug.org.uk > -- -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
On Mon, 13 Mar 2023 at 08:03, rmluglist2--- via Hampshire < hampshire@mailman.lug.org.uk> wrote: > Hi all > > > > I have an Ubuntu box which is on 24/7/365. It has ufw running allowing > nothing from outside my lan. > > > > A couple of times recently, I’ve come in to find the machine locked up > with a lot of disk access (it can be ping’d but I can’t ssh into it and it > doesn’t respond to mouse or keyboard on the console – only power cycling > brings it back). As I say, this has now happened twice in the last 3-4 > nights. > > > > I have seen this behaviour sometimes. By default Linux can block all interactive conversations when using high disk access High disk access can be caused by a number of things: 1) some app actually needs the disk 2) Faults on the disk, causing many retries. 3) Swap file access After a reboot, you can look for faults on the disk with "smartctl -a /dev/sda" and see if there are any log messages there about failed sectors, or sector reallocation counts increasing etc. If an app needs the disk, it is probably something kicked off by cron. You can force these apps to use a lower priority for io with "ionice" Google ionice for suitable ways to run it. But, I think a good diagnosis is probably to disable cron altogether for say a week, and see if the problem disappears. Then at least you will then know that cron and the apps it runs are the problem. Another possible cause, is an app causing it to run low on memory that results in unpredictable behaviour when memory allocation fails, and it seems a lot of programs don't behave well when that happens. This might also cause excessive swap file access. These are all problems that are difficult to diagnose while they are happening, so the trick is to set up monitoring to watch for each of the cases. E.g. take metrics of free RAM and when the fault happens, you can look at the metrics graph, to see if that is the problem etc. take metrics of the disk access on a per app basis. Normally the lock up will not be immediate, it will get slow first and then eventually lock up. So at least some metrics are written before the lock up. Kind Regards James -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --
Re: [Hampshire] Repeated server crash overnight
G'day all, On 13/03/2023 14:32, James Dutton via Hampshire wrote: On Mon, 13 Mar 2023 at 08:03, rmluglist2--- via Hampshire mailto:hampshire@mailman.lug.org.uk>> wrote: Hi all __ __ I have an Ubuntu box which is on 24/7/365. It has ufw running allowing nothing from outside my lan. __ __ A couple of times recently, I’ve come in to find the machine locked up with a lot of disk access (it can be ping’d but I can’t ssh into it and it doesn’t respond to mouse or keyboard on the console – only power cycling brings it back). As I say, this has now happened twice in the last 3-4 nights. __ __ I have seen this behaviour sometimes. By default Linux can block all interactive conversations when using high disk access High disk access can be caused by a number of things: 1) some app actually needs the disk 2) Faults on the disk, causing many retries. 3) Swap file access After a reboot, you can look for faults on the disk with "smartctl -a /dev/sda" and see if there are any log messages there about failed sectors, or sector reallocation counts increasing etc. If an app needs the disk, it is probably something kicked off by cron. You can force these apps to use a lower priority for io with "ionice" Google ionice for suitable ways to run it. But, I think a good diagnosis is probably to disable cron altogether for say a week, and see if the problem disappears. Then at least you will then know that cron and the apps it runs are the problem. I've seen this behaviour with ClamAV; in the end I had to remove it. The database gets to a certain point where it won't fit in memory along with the rest of the system; swap doesn't help, you'd need to add RAM to accommodate it. https://unix.stackexchange.com/questions/114709/how-to-reduce-clamav-memory-usage/278110 Another possible cause, is an app causing it to run low on memory that results in unpredictable behaviour when memory allocation fails, and it seems a lot of programs don't behave well when that happens. This might also cause excessive swap file access. These are all problems that are difficult to diagnose while they are happening, so the trick is to set up monitoring to watch for each of the cases. E.g. take metrics of free RAM and when the fault happens, you can look at the metrics graph, to see if that is the problem etc. take metrics of the disk access on a per app basis. Normally the lock up will not be immediate, it will get slow first and then eventually lock up. So at least some metrics are written before the lock up. Kind Regards James HTH Brad OpenPGP_signature Description: OpenPGP digital signature -- Please post to: Hampshire@mailman.lug.org.uk Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire LUG URL: http://www.hantslug.org.uk --