APIC-Errors+Crashes on GA 586DX, 2.2.17/2.4.3
Hi, I am using for my Internet-Gateway a dual Pentium MMX 200Mhz with a Gigabyte 586DX Motherboard (with the Intel 430HX Chipset). The last year I used Linux-2.2.16,2.2.17 with it and had several hangs of the network and ISDN subsystem. The network dies like this (when copying huge amount of data): Feb 3 11:58:03 violin kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another device? Feb 3 11:58:03 violin kernel: Flags; bus-master 1, full 0; dirty 16 current 16. Feb 3 11:58:03 violin kernel: Transmit list vs. c13dfa00. Feb 3 11:58:03 violin kernel: 0: @c13dfa00 length 806e status 0001006e ... and the isdn subsystem like this: Jan 27 00:41:15 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0 Jan 27 00:41:28 violin kernel: ippp0: dialing 2 194040... Jan 27 00:41:28 violin kernel: isdn: HiSax,ch0 cause: E001B Although there is no direct hint to an APIC problem, I read in several newsgroup articles that these two errors refer to APIC errors. They system is still usable after such an error, only that eth0/isdn is not accessible, even if I reload the modules. The only solution is a reboot. Well - some days ago I tried to switch to 2.4.3, hoping that these errors will be gone then. The first thing that I noticed was that I got thousands of lines like this: Apr 22 16:19:31 violin kernel: APIC error on CPU0: 04(00) At first I ignored it and started to test the stability by copying huge amounts of data via NFS over eth0. After around 500MB (and 45000 APIC Errors!) the isdn subsystem died: Apr 18 16:32:12 violin kernel: isdn_tx_timeout dev ippp0 dialstate 0 Apr 18 16:32:12 violin kernel: ippp0: all channels busy - requeuing! ... After around 10 minutes the whole system crashed, leaving this in the syslog: Apr 18 16:59:15 violin kernel: APIC error on CPU1: 02(00) Apr 18 16:59:15 violin kernel: APIC error on CPU0÷230Ý:'217û8226^C^@^@^ A^@H^@R*¬Öf^Máj225L`cÛ½235é^A^@^@^@203$^A^@Å!^C9Å!^C9º þ8^A^@+¢ý8,¢ý8÷230Ý:"217û8^^^K^@^@^A^@H^@aÑõ2009¢B1kp^BĬ:¼ ... I did a little statistic of the occurence of the APIC-errors: APIC error on CPU0: 01(00)650 APIC error on CPU0: 02(00) 9037 APIC error on CPU0: 04(00)916 APIC error on CPU0: 06(00) 1 APIC error on CPU0: 06(04) 1 APIC error on CPU0: 08(00) 5369 APIC error on CPU0: 08(02) 1 APIC error on CPU0: 09(00) 3 APIC error on CPU0: 0a(00) 4 APIC error on CPU0: 0c(00) 1 APIC error on CPU0: 40(00) 60 APIC error on CPU0: 48(00) 23 APIC error on CPU1: 00(00) 13 APIC error on CPU1: 01(00) 5398 APIC error on CPU1: 02(00) 9533 APIC error on CPU1: 02(02) 5 APIC error on CPU1: 04(00) 6861 APIC error on CPU1: 08(00) 7836 APIC error on CPU1: 08(01) 7 APIC error on CPU1: 08(02) 1 APIC error on CPU1: 08(04) 1 APIC error on CPU1: 08(08) 1 APIC error on CPU1: 09(00) 5 APIC error on CPU1: 0a(00) 17 APIC error on CPU1: 0c(00) 23 Following the advice of Donald Becker he gave in some newsgroup I restarted the kernel with the "noapic" parameter. The strange thing is that the APIC errors are still there, at least there are a lot less than before, moreover the system seems slower but at least more stable. BTW, why are there still APIC errors although there are no interrupts assigned to CPU1 (as seen in /proc/interrupts). I next tried to find out what triggers these APIC errors: Without "noapic" kernel parameter: The Errors are triggered by a certain amount of interrupts, whatever device produces interrupts. With "noapic": It seems as if those errors are mostly triggered by NFS. When I copy the same amount of data with FTP, there are a lot less Errors. (E.g. for 500MB there are 40 with NFS and only 2 with FTP). What I wonder is why linux outputs a line like this (with noapic): <4>Intel MultiProcessor Specification v1.1 <4>Virtual Wire compatibility mode. although the board seems to be capable of MPS 1.4 (as there is a Bios option "MPS 1.4 for single Processor). The following Hardware is in the system: 3com905b, ISDN AVM Fritz!, 128MB RAM, IBM 36GB HD, some SCSI-Devices (HD,CDROM,Tape in an external case) and a very old monochrome graphics card. Perhaps this old graphics adapter is a problem? It would be somehow sad to give away the board or use it as a single CPU board, so do you perhaps have any clue of how to get rid of these problems? If you need any further information that would help to fix this, so please tell me. Best Regards, Hermann Himmelbauer -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: APIC errors on 2.4.4
Andy Arvai wrote: > > Hi, > > I'm having IO-APIC errors with 2.4.4. I spent some time searching the > web to understand more about this problem and I'm still not sure if > it is a hardware problem on the motherboard or a problem with the > kernel. I will try the noapic boot option, but are there any > patches that might fix this? Here are some of the errors I was > getting: > > May 15 22:47:43 rad kernel: APIC error on CPU0: 02(01) > May 15 22:48:00 rad kernel: APIC error on CPU0: 01(02) This is a hardware error. I also have a buggy motherboard (586DX) and experience the same problems. Looke at /proc/interrupts, there you can see how many errors are detected. The problem is that double errors are not detected - these can lead to system crashes or block network/isdn cards. If there are only a few APIC errors, it is very unlikely that a double error occurs, if there are very many, the probability is high. It seems that there are quite a lot of motherboards that have a buggy APIC. There is some patch by Alan Cox in the 2.4.4-ac series that does something about this problem but I do not know what exactly. The only thing you can do is boot with the "noapic" option, and disable IO-Interrupts on the second CPU (I assume you have a SMP system?). This will reduce the amount of errors but there is also a performance decrease. Regards, Hermann -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Curious boot problem: linux-2.4.4SMP/AsusTek MB
Hi, A friend of mine uses linux-2.4.4 on his nice dual P3-1000 machine (AsusTek Motherboard). The problem is that the kernel gets stuck at the initialization of the APIC during boottime. The real weird thing is that if he boots linux-2.2.18 and afterwards 2.4.4, the kernel boots properly. So it seems that kernel 2.2 does some "initalization" that helps 2.4.4 to boot. Does anyone have a clue? Best Regards, Hermann -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: correctable ECC error
root wrote: > Kernel gurus, > Is this behavior all familiar to you? If so, please tell me whether > my memory is failing or not. During the initial machine check before > the SRM console prompt, I get Memory OK all the time. Have a look at: http://reality.sgi.com/cbrady_denver/memtest86/ IMHO this is one of the best memory tests I have ever seen. Regards, Hermann -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Interrupted sound with 2.4.4-ac6
Hi, I built a nice mp3 player out of a AMD 486-DX133 and a soundblaster es1371. I always used 2.2.16 and it worked properly. Due to several reasons I want to switch to 2.4, so I tried my luck with 2.4.4-ac6. Basically it works but the sound gets interrupted (around 0.5 - 5seconds silence) from time to time although the system is around 44% idle. This never happened with 2.2.16. The interrupts are not periodically, sometimes there are none for a minute, sometimes there are even interrupts around 5 seconds long. I have to state that the data comes from an nfs-mounted directory - perhaps this is a reason? Another interesting thing is that during those interrupts the processor usage of mpg123 decreases from 53% to around 20%, so it looks as if mpg123 can either not get data or not output data. Do you have any clues? Are there perhaps some kernel parameters to tune (buffer size, dma...)? Best Regards, Hermann -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Interrupted sound with 2.4.4-ac6
Hi, At first I have to apologize if you get this mail double, I sent it yesterday once but I did not get it from the list and it seems it vanished for some reason, so I try it a second time: I built a nice mp3 player out of a AMD 486-DX133 and a soundblaster es1371. I always used 2.2.16 and it worked properly. Due to several reasons I want to switch to 2.4, so I tried my luck with 2.4.4-ac6. Basically it works but the sound gets interrupted (around 0.5 - 5seconds silence) from time to time although the system is around 44% idle. This never happened with 2.2.16. These interrupts are not periodically, sometimes there are none for a minute, sometimes there are even interrupts around 5 seconds long. I have to state that the data comes from an nfs-mounted directory - perhaps this is a reason? (There is no network load). Another interesting thing is that during those interrupts the processor usage of mpg123 decreases from 53% to around 20%, so it looks as if mpg123 can either not get data or not output data. Do you have any clues? Are there perhaps some kernel parameters to tune (buffer size, dma...)? Best Regards, Hermann -- ,_, (O,O) "There is more to life than increasing its speed." ( ) -- Gandhi -"-"-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Random Freezes: Kernel 5.4.34 / "iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY" on Ryzen System (ASRock X470D4U2-2T)
Dear Linux developers, After searching the internet for countless hours and not finding any solution, I hope you can point me in the right direction for my following problem: I'm trying to install a 3-node cluster (proxmox/ceph) and experience random freezes. The node can either be completely frozen (no blinking cursor on console, no ping) or can get somewhat blocked / slow etc. This happens most often on node 2 (approx. 3-4 times / day), node 3 never got stuck within 14 days runtime, node 1 once. Unfortunately I did not find any way to trigger this behaviour, hoever, I *think* that this happens most often if I stress the machine in some way (performance test within a virtual machine) and then idling the machine. When the machine freezes completely, there is no logfile. However, if it is partially frozen, the following can be found in the kernel log (dmesg -T): --- snip -- [Sat Aug 22 21:18:59 2020] perf: interrupt took too long (3134 > 3127), lowering kernel.perf_event_max_sample_rate to 63750 [Fri Aug 28 16:55:06 2020] iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=2b:00.0 pasid=0x0 address=0x2420 flags=0x0080] [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[0]: 6007e10a8603 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[1]: 0009 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[2]: 2007e0d6a011 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[3]: [Fri Aug 28 17:04:34 2020] iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=2b:00.0 pasid=0x0 address=0x5bf0 flags=0x0080] [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[0]: 6007e10a8603 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[1]: 0009 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[2]: 2007e0d6a011 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[3]: [...] same message above multiple times [...] [Fri Aug 28 17:16:46 2020] BUG: unable to handle page fault for address: ffb8 [Fri Aug 28 17:16:46 2020] #PF: supervisor write access in kernel mode [Fri Aug 28 17:16:46 2020] #PF: error_code(0x0002) - not-present page [Fri Aug 28 17:16:46 2020] PGD 2d080e067 P4D 2d080e067 PUD 2d0810067 PMD 0 [Fri Aug 28 17:16:46 2020] Oops: 0002 [#1] SMP NOPTI [Fri Aug 28 17:16:46 2020] CPU: 0 PID: 1472 Comm: corosync Tainted: P OE 5.4.34-1-pve #1 [Fri Aug 28 17:16:46 2020] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U2-2T, BIOS P3.30 10/03/2019 [Fri Aug 28 17:16:46 2020] RIP: 0010:hrtimer_nanosleep+0x9c/0x190 [Fri Aug 28 17:16:46 2020] Code: 39 c8 0f 8f e1 00 00 00 48 69 c0 00 ca 9a 3b 48 01 d0 48 89 c2 48 89 45 b0 4c 01 e2 0f 88 8d 00 00 00 48 39 c2 0f 8c 84 00 00 <00> 48 b8 ff ff ff ff ff ff ff 7f 48 39 f2 48 0f 4c d0 89 de 4c 89 [Fri Aug 28 17:16:46 2020] RSP: 0018:ac97812a3e88 EFLAGS: 00010286 [Fri Aug 28 17:16:46 2020] RAX: RBX: 0001 RCX: [Fri Aug 28 17:16:46 2020] RDX: 97092e81dfc0 RSI: RDI: ac97812a3e88 [Fri Aug 28 17:16:46 2020] RBP: ac97812a3ef8 R08: 0338 R09: 0001 [Fri Aug 28 17:16:46 2020] R10: 0006 R11: 970905f54e00 R12: [Fri Aug 28 17:16:46 2020] R13: ac97812a3e88 R14: ac97812a3f08 R15: [Fri Aug 28 17:16:46 2020] FS: 7fb728e5a700() GS:97092e80() knlGS: [Fri Aug 28 17:16:46 2020] CS: 0010 DS: ES: CR0: 80050033 [Fri Aug 28 17:16:46 2020] CR2: ffb8 CR3: 0007c65e8000 CR4: 003406f0 [Fri Aug 28 17:16:46 2020] Call Trace: [Fri Aug 28 17:16:46 2020] ? hrtimer_init_sleeper+0x90/0x90 [Fri Aug 28 17:16:46 2020] __x64_sys_nanosleep+0x7b/0xa0 [Fri Aug 28 17:16:46 2020] do_syscall_64+0x57/0x190 [Fri Aug 28 17:16:46 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fri Aug 28 17:16:46 2020] RIP: 0033:0x7fb732eea720 [Fri Aug 28 17:16:46 2020] Code: 00 f0 ff ff 77 44 c3 0f 1f 00 55 48 89 f5 53 48 89 fb 48 83 ec 18 e8 3f 00 04 00 48 89 ee 48 89 df 89 c2 b8 23 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2a 89 d7 89 44 24 0c e8 7d 00 04 00 8b 44 24 [Fri Aug 28 17:16:46 2020] RSP: 002b:7fb728e59df0 EFLAGS: 0293 ORIG_RAX: 0023 [Fri Aug 28 17:16:46 2020] RAX: ffda RBX: 7fb728e59e20 RCX: 7fb732eea720 [Fri Aug 28 17:16:46 2020] RDX: RSI: RDI: 7fb728e59e20 [Fri Aug 28 17:16:46 2020] RBP: R08: 7fb72bffe618 R09: [Fri Aug 28 17:16:46 2020] R10: 000e7c62 R11: 0293 R12: cccd [Fri Aug 28 17:16:46 2020] R13: 0002 R14: 7fb728e5a700 R15: 0002 [Fri Aug 28 17:16:46 2020] Modules linked in: msr(E) ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6_tables(E) sctp(E) iptable_filter(E) bpfilter(E) bonding(E) softdog(E) nfnetlink_log(E) nfnetlink(E) edac_mce_amd(E) ipmi_ssif(E) kvm_amd(E) ccp(E) kvm(E) zfs(POE) zunicode(POE) zlua(PO
Re: Random Freezes: Kernel 5.4.34 / "iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY" on Ryzen System (ASRock X470D4U2-2T)
Dear developer list, As I found a solution and as it may well be that someone has the same problem, I post it here: I changed the AMD Ryzen 3 3200G to a AMD Ryzen 5 3600 on one node and to a AMD Ryzen 3 3100 on the two other nodes, now the stability problem is gone. I don't really know why, I can think of two reasons: 1) The 3200G did not support ECC but I use ECC RAM. Maybe this leads to errors (although intensive memory testing with memtest86 did not report anything). 2) The new CPUs do not have integrated graphic capabilities. I noticed that the two onboard 10GBit-Ethernet adapters now have other PCI addresses with the new CPU. And with the old CPUs there were problem with malfunctioning of these 10G adapters. As I recently got the information from another user that his X470D4U2-2T based setup with a Ryzen 2200G is also unstable it seems that the problem is (2) - so it seems best to avoid using this mainboard with AMD-CPUs with builtin graphics capabilities. The ASRock Rack X470D4U2-2T is definitly stable now. Best Regards, Hermann Am 03.09.20 um 20:05 schrieb Hermann Himmelbauer: Dear Linux developers, After searching the internet for countless hours and not finding any solution, I hope you can point me in the right direction for my following problem: I'm trying to install a 3-node cluster (proxmox/ceph) and experience random freezes. The node can either be completely frozen (no blinking cursor on console, no ping) or can get somewhat blocked / slow etc. This happens most often on node 2 (approx. 3-4 times / day), node 3 never got stuck within 14 days runtime, node 1 once. Unfortunately I did not find any way to trigger this behaviour, hoever, I *think* that this happens most often if I stress the machine in some way (performance test within a virtual machine) and then idling the machine. When the machine freezes completely, there is no logfile. However, if it is partially frozen, the following can be found in the kernel log (dmesg -T): --- snip -- [Sat Aug 22 21:18:59 2020] perf: interrupt took too long (3134 > 3127), lowering kernel.perf_event_max_sample_rate to 63750 [Fri Aug 28 16:55:06 2020] iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=2b:00.0 pasid=0x0 address=0x2420 flags=0x0080] [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[0]: 6007e10a8603 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[1]: 0009 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[2]: 2007e0d6a011 [Fri Aug 28 16:55:06 2020] AMD-Vi: DTE[3]: [Fri Aug 28 17:04:34 2020] iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=2b:00.0 pasid=0x0 address=0x5bf0 flags=0x0080] [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[0]: 6007e10a8603 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[1]: 0009 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[2]: 2007e0d6a011 [Fri Aug 28 17:04:34 2020] AMD-Vi: DTE[3]: [...] same message above multiple times [...] [Fri Aug 28 17:16:46 2020] BUG: unable to handle page fault for address: ffb8 [Fri Aug 28 17:16:46 2020] #PF: supervisor write access in kernel mode [Fri Aug 28 17:16:46 2020] #PF: error_code(0x0002) - not-present page [Fri Aug 28 17:16:46 2020] PGD 2d080e067 P4D 2d080e067 PUD 2d0810067 PMD 0 [Fri Aug 28 17:16:46 2020] Oops: 0002 [#1] SMP NOPTI [Fri Aug 28 17:16:46 2020] CPU: 0 PID: 1472 Comm: corosync Tainted: P OE 5.4.34-1-pve #1 [Fri Aug 28 17:16:46 2020] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U2-2T, BIOS P3.30 10/03/2019 [Fri Aug 28 17:16:46 2020] RIP: 0010:hrtimer_nanosleep+0x9c/0x190 [Fri Aug 28 17:16:46 2020] Code: 39 c8 0f 8f e1 00 00 00 48 69 c0 00 ca 9a 3b 48 01 d0 48 89 c2 48 89 45 b0 4c 01 e2 0f 88 8d 00 00 00 48 39 c2 0f 8c 84 00 00 <00> 48 b8 ff ff ff ff ff ff ff 7f 48 39 f2 48 0f 4c d0 89 de 4c 89 [Fri Aug 28 17:16:46 2020] RSP: 0018:ac97812a3e88 EFLAGS: 00010286 [Fri Aug 28 17:16:46 2020] RAX: RBX: 0001 RCX: [Fri Aug 28 17:16:46 2020] RDX: 97092e81dfc0 RSI: RDI: ac97812a3e88 [Fri Aug 28 17:16:46 2020] RBP: ac97812a3ef8 R08: 0338 R09: 0001 [Fri Aug 28 17:16:46 2020] R10: 0006 R11: 970905f54e00 R12: [Fri Aug 28 17:16:46 2020] R13: ac97812a3e88 R14: ac97812a3f08 R15: [Fri Aug 28 17:16:46 2020] FS: 7fb728e5a700() GS:97092e80() knlGS: [Fri Aug 28 17:16:46 2020] CS: 0010 DS: ES: CR0: 80050033 [Fri Aug 28 17:16:46 2020] CR2: ffb8 CR3: 0007c65e8000 CR4: 003406f0 [Fri Aug 28 17:16:46 2020] Call Trace: [Fri Aug 28 17:16:46 2020] ? hrtimer_init_sleeper+0x90/0x90 [Fri Aug 28 17:16:46 2020] __x64_sys_nanosleep+0x7b/0xa0 [Fri Aug 28 17:16:46 2020] do_syscall_64+0x57/0x190 [Fri Aug 28 17:16:46 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fri Aug 28 17:16