Re: [Discuss-gnuradio] segmentation fault in qa_constellation_receiver_test

Kelly Boswell Fri, 21 Feb 2014 08:31:36 -0800

Thank you, Tom. I'll try that after I'm off of work tonight. And thank you
for the great ideas, Nathan.
On Fri, Feb 21, 2014 at 2:39 AM, West, Nathan
<n...@ostatemail.okstate.edu> wrote:
> On Thu, Feb 20, 2014 at 11:25 PM, Kelly Boswell <krbosw...@gmail.com>
wrote:
>> After the make test failed for this module, I decided to poke around to
see
>> if there is an easy fix. I made a script that simply executes the test
over
>> and over until it seg faults and exits after the core file is created.
>>
>> xxxxx@xxxx:~/src/gnuradio/build/gr-digital/python/digital$ ./runtests.sh
>> Using Volk machine: avx_64_mmx
>> Segmentation fault (core dumped)
>>
>> xxxxx@xxxx:~/src/gnuradio/build/gr-digital/python/digital$ gdb
>> /usr/bin/python2.7 core
>> (gdb) bt
>> (gdb) bt
>> #0  0x00007fe8f627fb17 in volk_32fc_32f_dot_prod_32fc_a_avx ()
>>    from /home/kelly/src/gnuradio/build/volk/lib/libvolk.so.0.0.0
>> #1  0x00007fe8f52dd25f in
>> gr::filter::kernel::fir_filter_ccf::filter(std::complex<float> const*) ()
>>    from
>>
/home/kelly/src/gnuradio/build/gr-filter/lib/libgnuradio-filter-3.8git.so.0.0.0
>> #2  0x00007fe8f143c45b in
>> gr::digital::pfb_clock_sync_ccf_impl::general_work(int, std::vector<int,
>> std::allocator<int> >&, std::vector<void const*, std::allocator<void
const*>
>>>&, std::vector<void*, std::allocator<void*> >&) ()
>>    from
>>
/home/kelly/src/gnuradio/build/gr-digital/lib/libgnuradio-digital-3.8git.so.0.0.0
>> #3  0x00007fe8f653809e in gr::block_executor::run_one_iteration() ()
>>    from
>>
/home/kelly/src/gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime-3.8git.so.0.0.0
>> #4  0x00007fe8f6573622 in
>> gr::tpb_thread_body::tpb_thread_body(boost::shared_ptr<gr::block>, int)
()
>>    from
>>
/home/kelly/src/gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime-3.8git.so.0.0.0
>> #5  0x00007fe8f6565ea1 in
>>
boost::detail::function::void_function_obj_invoker0<gr::thread::thread_body_wrapper<gr::tpb_container>,
>> void>::invoke(boost::detail::function::function_buffer&) ()
>>    from
>>
/home/kelly/src/gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime-3.8git.so.0.0.0
>> ---Type <return> to continue, or q <return> to quit---
>> #6  0x00007fe8f6526610 in
boost::detail::thread_data<boost::function0<void>
>>>::run() ()
>>    from
>>
/home/kelly/src/gnuradio/build/gnuradio-runtime/lib/libgnuradio-runtime-3.8git.so.0.0.0
>> #7  0x00007fe8f9adc94a in ?? ()
>>    from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.53.0
>> #8  0x00007fe8fc8a3f6e in start_thread (arg=0x7fe8e2ffd700)
>>     at pthread_create.c:311
>> #9  0x00007fe8fc5ce9cd in clone ()
>>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
>>
>> Of course, I had to recompile it with debugging info to glean anything
>> useful from the stack trace.  So, I did that and I traced the bug to this
>> line:
>>
>> c0Val = _mm256_mul_ps(a0Val, b0Val);
>>
>> I can't dump the values in a0Val or b0Val, though, because they're
>> intermediate values that are optimized away by the optimized kernel
code.  I
>> tried stepping through the assembler instructions but I'm not familiar
with
>> the various sse and avx extensions. Heck, I'm not even familiar with the
>> x86_64 instruction set.  So I have a huge learning curve ahead of me,
there.
>> Is it possible to just dump the values in these __m256 data types to a
file
>> so I can debug it that way?  If that's not easy to do, then I'm willing
to
>> learn what I have to about the instruction set so I can debug this thing.
>> But I would sure appreciate some help if anyone has some advice to offer.
>>
>> Software version:
>> I rebased to the latest version of the next branch last night before I
went
>> to bed at around 1:30 am CDT.
>>
>> Operating System:
>> kelly@octs2:~/src/gnuradio/volk/kernels/volk$ uname -a
>> Linux octs2 3.11.0-17-generic #31-Ubuntu SMP Mon Feb 3 21:52:43 UTC 2014
>> x86_64 x86_64 x86_64 GNU/Linux
>> It's Ubuntu 13.10
>>
>> Hardware: ASUS X750J
>> Intel Quad Core i7 4700HQ 2.4GHz
>>
>> cpuinfo:
>> processor    : 7
>> vendor_id    : GenuineIntel
>> cpu family    : 6
>> model        : 60
>> model name    : Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
>> stepping    : 3
>> microcode    : 0x8
>> cpu MHz        : 2401.000
>> cache size    : 6144 KB
>> physical id    : 0
>> siblings    : 8
>> core id        : 3
>> cpu cores    : 4
>> apicid        : 7
>> initial apicid    : 7
>> fpu        : yes
>> fpu_exception    : yes
>> cpuid level    : 13
>> wp        : yes
>> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb
>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
>> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx
est
>> tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb
>> xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
>> tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
>> bogomips    : 4789.27
>> clflush size    : 64
>> cache_alignment    : 64
>> address sizes    : 39 bits physical, 48 bits virtual
>> power management:
>>
>
>
> Hi Kelly,
>
> First, this is great debugging, thanks for getting so much info and
> trying to go for a fix on your own.
>
> On to the good stuff. I was able to reproduce this on my i7-4700MQ.
> Here's some additional info for the logs:
>
> * constellation_receiver is a hier block with a fir_filter_ccf inside
> that is calling the volk avx dot product.
> * The avx dot product proto-kernel passes VOLK QA
> * The qa_fir_filter.py is testing a fir_filter_ccf that passes its QA.
> * Just for kicks, I forced VOLK to use the generic kernel and I still
> see the segfault.
>
> A couple of things I'd like to try (and please feel free to give these a
try):
> * Go back to a commit just before fir_filter.cc started using
> volk_malloc and volk_free.  (or for bonus points go back to some point
> in time when this test always passes and do a git bisect)
> * fiddle with parameters of the test, data length, number of taps in
> filter, etc.
> * Doubtful this would change, but test on different processors. It
> would be pretty wild if there was something off in the 4700 line, but
> the fact that the generic proto-kernel had the same result and nobody
> else has reported this yet is suspicious. My guess is GCC is actually
> emitting *very* similar code for the generic and avx dot product
> proto-kernels.
>
> Nathan



I was having similar issues this week with some AVX boxes. It looks
like it's a problem using posix_memalign (which is called by
volk_malloc if posix_memalign is available). Removing the use of
posix_memalign solves my problem. I'll work with Nathan off-list to
see about fixing this, possibly by removing the use of that version of
malloc.

Tom

_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Re: [Discuss-gnuradio] segmentation fault in qa_constellation_receiver_test

Reply via email to