Hi.
I tried to reproduce errors in virtual environment (some VMs on my
notebook).
I've tried to create 1000 client PPPoE sessions from this box via script:
for i in `seq 1 1000`; do pppd plugin rp-pppoe.so user test password
test nodefaultroute maxfail 0 persist nodefaultroute holdoff 1 noauth
eth0; done
And on VM that is used as client I've got strange random crashes (that
are present only when server is online - so they're network-related):
http://postimg.org/image/ohr2mu3rj/ - crash is here:
(gdb) list *process_one_work+0x32
0xc10607b2 is in process_one_work
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/workqueue.c:1952).
1947 __releases(&pool->lock)
1948 __acquires(&pool->lock)
1949 {
1950 struct pool_workqueue *pwq = get_work_pwq(work);
1951 struct worker_pool *pool = worker->pool;
1952 bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
1953 int work_color;
1954 struct worker *collision;
1955 #ifdef CONFIG_LOCKDEP
1956 /*
http://postimg.org/image/x9mychssx/ - crash is here (noticed twice):
0xc10658bf is in kthread_data
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:136).
131 * The caller is responsible for ensuring the validity of @task when
132 * calling this function.
133 */
134 void *kthread_data(struct task_struct *task)
135 {
136 return to_kthread(task)->data;
137 }
which is leaded by strange place:
(gdb) list *kthread_create_on_node+0x120
0xc1065340 is in kthread
(/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/kernel/kthread.c:176).
171 {
172 __kthread_parkme(to_kthread(current));
173 }
174
175 static int kthread(void *_create)
176 {
177 /* Copy data: it's on kthread's stack */
178 struct kthread_create_info *create = _create;
179 int (*threadfn)(void *data) = create->threadfn;
180 void *data = create->data;
And earlier:
(gdb) list *ret_from_kernel_thread+0x21
0xc13bb181 is at
/var/testpoint/LEAF/source/i486-unknown-linux-uclibc/linux/linux-4.1/arch/x86/kernel/entry_32.S:312.
307 popl_cfi %eax
308 pushl_cfi $0x0202 # Reset kernel eflags
309 popfl_cfi
310 movl PT_EBP(%esp),%eax
311 call *PT_EBX(%esp)
312 movl $0,PT_EAX(%esp)
313 jmp syscall_exit
314 CFI_ENDPROC
315 ENDPROC(ret_from_kernel_thread)
316
Stack corruption?..
I'll try to make test environment on real hardware. And I'll try to test
with older kernels.
22.11.2015 07:17, Alexander Duyck пишет:
On 11/21/2015 12:16 AM, Andrew wrote:
Memory corruption, if happens, IMHO shouldn't be a hardware-related -
almost all of these boxes, except H61M-based box from 1st log, works
for a long time with uptime more than year; and only software was
changed on it; H61M-based box runs memtest86 for a tens of hours w/o
any error. If it was caused by hardware - they should crash even
earlier.
I wasn't saying it was hardware related. My thought is that it could
be some sort of use after free or double free type issue. Basically
what you end up with is the memory getting corrupted by software that
is accessing regions it shouldn't be.
Rarely on different servers I saw 'zram decompression error' messages
(in this case I've got such message on H61M-based box).
Also, other people that uses accel-ppp as BRAS software, have
different kernel panics/bugs/oopses on fresh kernels.
I'll try to apply these patches, and I'll try to switch back to
kernels that were stable on some boxes.
If you could bisect this it would be useful. Basically we just need
to determine where in the git history these issues started popping up
so that we can then narrow down on the root cause.
- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html