Hello, Ian, your theory with the out-of-memory seems to be the step into the right direction.
It looks like the problems did not really start with the instalaltion of the new packages, but with the set of the xen kernel parameter dom0_mem=1024M which I made approximatelly at the same time like the upgrades. If I have removed this option now, so Dom0 has complete 12GB for its run and the problem does not occur anymore. Also the domains are suspended correctly after the call of /etc/init.d/xendomains stop Possibly this is also the reason, why I could not reproduce this problem with the non-xen kernel - because in that case the memory also was not reduced to this 1GB, but the complete 12GB memory pool was used withtout any specifications, so possibly the error could not occur as well. Also usage of dom0_mem=2048 is not enough to fix the problem for me. I have tried dom0_mem=2048 but it leads also to the hangup by the shutdown during the domain suspension. Only if I omit the dom0_mem parameter completely at all it works correctly. Free memory after increase of the dom0_mem to 2048M: total used free shared buffers cached Mem: 2090832 448092 1642740 0 111600 90908 -/+ buffers/cache: 245584 1845248 Swap: 999416 0 999416 - so there is basically no problem with the base memory amount, there is enough memory for everything. According to the swiotlb parameter - I have found following lines in kern.log from the previous reboots: Sep 13 17:15:13 alg-puv-xen-1 kernel: [ 3.105461] xen_swiotlb_fixup: buf=ffff880005711000 size=67108864 Sep 13 17:15:13 alg-puv-xen-1 kernel: [ 3.126345] xen_swiotlb_fixup: buf=ffff880009771000 size=32768 - (so the 64MB should be there) but the given lines are repeatet there always with the same values, independently on the fact if dom0_mem has been set to 1024M, 2048M or unset completely. After I have specified swiotlb=65536 on the line with the xen kernel then I got in the log the same thing like If I would done nothing (and also the hangups during domain suspension). If I put this parameter to the linux kernel module parameters, then it also did not changed the value in the log: Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.856096] Kernel command line: root=/dev/md0 ro console=tty0 vga=773 swiotlb=65536 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.856129] PID hash table entries: 4096 (order: 3, 32768 bytes) Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.856512] Initializing CPU#0 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.873864] DMA: Placing 128MB software IO TLB between ffff880005711000 - ffff88000d711000 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.873868] DMA: software IO TLB at phys 0x5711000 - 0xd711000 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.873871] xen_swiotlb_fixup: buf=ffff880005711000 size=134217728 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.915338] xen_swiotlb_fixup: buf=ffff88000d7d1000 size=32768 Sep 13 18:15:32 alg-puv-xen-1 kernel: [ 3.924636] Memory: 1891528k/2097152k available (3141k kernel code, 432k absent, 205192k reserved, 1905k data, 592k init) But the reboot came through without the crash! :-) Where has to be applied the swiotlb parameter to see some effect of the swiotlb memory change in the logs? So, it worked if I have specified in Dom0 in the "baloon" mode by omitting the specification of dom0_mem or, if dom0_mem is specified then also the swiotlb=65536 must be specified. I have noticed one interesting behavior - during the successfull suspension of the domains during the shutdown the first one which is beeing suspended writes very fast three "dots", then it stops to write the dots for some time and then agfter some time very fast a lot of (possibly also all remaining) "dots" are written on the screen. By the next suspensions the suspension works continuously dot-by-dot smoothly without any delays. It looks like it waits for something during the first suspension (memory allocation?). Generally, it is for me very surpsrising, how the aacraid module works, I am no C or kernel developer but I would expect something like this cannot happen - the module should allocate its necessary memory in the start or, I would understand there can fail some specific read or write operation if the sw raid has not enough memory to execute them, but I would never expect this will lead to the hangup and freeze of the whole system. The probability of data corruption is so increased drastically. And especially by raid1, which is arranged in the most of cases to archieve more data safety :-). With regards, Artur -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/dc2040e10fe4444482855753998de...@private.praha.bcpraha.com