Re: Process Hang in __read_seqcount_begin

2012-10-30 Thread Thomas Gleixner
On Tue, 30 Oct 2012, Peter LaDow wrote: > Anyway, based on earlier discussion, is there any reason not to use a > lock (presuming any solution properly takes into account possible > recursion)? I understand that the mainline is protected, but perhaps > in the RT version we can use seqlock (and pre

Re: Process Hang in __read_seqcount_begin

2012-10-30 Thread Peter LaDow
Ok. More of an update. We've managed to create a scenario that exhibits the problem much earlier. We can now cause the lockup to occur within a few hours (rather than the 12 to 24 hours in our other scenario). Our setup is to to have a a lot of traffic constantly being processed by the netfilt

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Thomas Gleixner
On Fri, 26 Oct 2012, Peter LaDow wrote: > On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet wrote: > If this were safe, we wouldn't be seeing this lockup and your patch > wouldn't be needed. So it seems that your patch doesn't really > address the issue that we are not "sure a thread cannot be interr

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Peter LaDow
On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet wrote: > Do you know what is per cpu data in linux kernel ? I sorta did. But since your response, I did more reading, and now I see what you mean. But I don't think this is a per cpu issue. More below. > Because its not needed. Really I dont know

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Eric Dumazet
On Fri, 2012-10-26 at 11:51 -0700, Peter LaDow wrote: > (I've added netfilter and linux-rt-users to try to pull in more help). > > On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet wrote: > > Upstream kernel is fine, there is no race, as long as : > > > > local_bh_disable() disables BH and preemption

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Thomas Gleixner
On Fri, 26 Oct 2012, Peter LaDow wrote: > (I've added netfilter and linux-rt-users to try to pull in more help). > > On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet wrote: > > Upstream kernel is fine, there is no race, as long as : > > > > local_bh_disable() disables BH and preemption. > > Looking

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Peter LaDow
(I've added netfilter and linux-rt-users to try to pull in more help). On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet wrote: > Upstream kernel is fine, there is no race, as long as : > > local_bh_disable() disables BH and preemption. Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Eric Dumazet
On Fri, 2012-10-26 at 09:15 -0700, Peter LaDow wrote: > On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet wrote: > > Could you try following patch ? > > So, I applied your patch. And so far, it seems to have fixed the > issue. I've had my systems running for 48 hours, and no lockup in > iptables.

Re: Process Hang in __read_seqcount_begin

2012-10-26 Thread Peter LaDow
On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet wrote: > Could you try following patch ? So, I applied your patch. And so far, it seems to have fixed the issue. I've had my systems running for 48 hours, and no lockup in iptables. Usually, I could get a lockup to occur within 12 to 24 hours, and

Re: Process Hang in __read_seqcount_begin

2012-10-24 Thread Eric Dumazet
On Wed, 2012-10-24 at 09:30 -0700, Peter LaDow wrote: > On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet wrote: > > Could you try following patch ? > > Thanks for the suggestion. But I have a question about the patch below. > > > + /* Note : cmpxchg() is a memory barrier, we dont need smp_wmb

Re: Process Hang in __read_seqcount_begin

2012-10-24 Thread Peter LaDow
On Tue, Oct 23, 2012 at 9:32 PM, Eric Dumazet wrote: > Could you try following patch ? Thanks for the suggestion. But I have a question about the patch below. > + /* Note : cmpxchg() is a memory barrier, we dont need smp_wmb() */ > + if (old != new && cmpxchg(&ptr->sequence, old, new

Re: Process Hang in __read_seqcount_begin

2012-10-23 Thread Eric Dumazet
On Tue, 2012-10-23 at 17:15 -0700, Peter LaDow wrote: > (Sorry for the subject change, but I wanted to try and pull in those > who work on RT issues, and the subject didn't make that obvious. > Please search for the same subject without the RT Linux trailing > text.) > > Well, more information. E

Re: Process Hang in __read_seqcount_begin

2012-10-23 Thread Peter LaDow
(Sorry for the subject change, but I wanted to try and pull in those who work on RT issues, and the subject didn't make that obvious. Please search for the same subject without the RT Linux trailing text.) Well, more information. Even with SMP enabled (and presumably the migrate_enable having cal

Re: Process Hang in __read_seqcount_begin

2012-10-22 Thread Peter LaDow
> Now, is preemption required to be disabled in non-SMP systems? I did more digging, and I found this. In linux/netfilter/x_tables.h, there is the definition of xt_write_recseq_begin. This function updates the sequence number for the sequence locks. This is called in the iptables kernel code.

Re: Process Hang in __read_seqcount_begin

2012-10-22 Thread Peter LaDow
On Mon, Oct 22, 2012 at 10:01 AM, Eric Dumazet wrote: > This looks like a corruption of s->sequence, and is value is odd, even > if no writer is alive. > > Does local_bh_disable() disables preemption on RT ? Hmmm With PREEMPT_RT_FULL defined (as we have): void local_bh_disable(void) {

Re: Process Hang in __read_seqcount_begin

2012-10-22 Thread Eric Dumazet
On Mon, 2012-10-22 at 09:46 -0700, Peter LaDow wrote: > I posted this problem some time back on the linux-rt-users and > netfilter lists. Since then, we thought we had a workaround to avoid > this problem, so we dropped the issue. But now 5 months later, the > problem has reappeared. And this ti

Process Hang in __read_seqcount_begin

2012-10-22 Thread Peter LaDow
I posted this problem some time back on the linux-rt-users and netfilter lists. Since then, we thought we had a workaround to avoid this problem, so we dropped the issue. But now 5 months later, the problem has reappeared. And this time it is much more serious and much more difficult to re-creat