Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.

Marcus Müller Mon, 07 Mar 2016 13:47:20 -0800

Hi Gonzalo,

> I installed perf top but i am not sure how to use it.. I will
investigate it.


Assuming you have build GNU Radio/your application with debugging
symbols (for example, by having "cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo
.."), try something like:

sudo sysctl kernel/perf_event_paranoid=-1
perf record -a {your program}
perf report

Best regards,
Marcus

On 03/07/2016 10:30 PM, Gonzalo Arcos wrote:
> Thanks for your answer.
>
>  I installed perf top but i am not sure how to use it.. I will
> investigate it. However, does the program need to be compiled in debug
> mode for the performance counters to have effect?
>
> As a side question... Has anyone managed to profile a gnuradio
> application with valgrind / oprofile? I am very interested in getting
> this to work, since when i tried profiling it with those tools, and
> then opening KCacheGrind, the displayed graph did not contain
> information about each block, let alone functions inside blocks. It
> has been several months since i tried this, but i remember that the
> result was like 99.9% of the time running the start() function of the
> block, and i could not get any more information than that, which of
> course was not helpful at all.
>
>
>
>
> 2016-02-29 6:28 GMT-03:00 West, Nathan <n...@ostatemail.okstate.edu
> <mailto:n...@ostatemail.okstate.edu>>:
>
>     It won't give you time spent, but 'perf top' is a nice tool that
>     gives function-level performance counters for all running code. It
>     comes with linux-tools and uses performance counters built in to
>     the kernel. There's also a couple of other perf subtools you can
>     explore.
>
>
>     Regarding your full buffers, I think that's a result of GNU
>     Radio's scheduler.
>      If you have a flowgraph with A->B and B takes a very long time to
>     process all of its samples then A will always have full output
>     buffers since it operates much faster. It's not necessarily bad or
>     cause for concern, but performance improvements should focus on B.
>
>     -nathan
>
>     On Sun, Feb 28, 2016 at 10:48 PM, Gonzalo Arcos
>     <gonzaloarco...@gmail.com <mailto:gonzaloarco...@gmail.com>> wrote:
>
>         Thanks to all of you for your very informative answers.
>
>         Douglas, i feel good now because you have described perfectly
>         all the things i did / thought on how to improve the
>         performance :), i also agree that merging blocks should be a
>         last time resort. I have used the performance monitor and
>         managed to improve the perofrmance of the most expensive
>         blocks. What i could not achieve though, is profiling the
>         program with a mainstream profiler, like valgrind or oprofile,
>         or some other profilers for python. I remember than when
>         visualizing the data, all the time was spent in the start() of
>         the top block, and i could not get information pertaining each
>         blocks general work, let alone functions executed within the
>         block. After discovering the performance monitor, i used it in
>         conjuntion with calls to clock() to determine the time spent
>         in each function within each block, to get a rough
>         measurement. But if it is possible to get this information
>         automatically, i am very interested in learning how to do it.
>         Could you help me?
>
>         There is also another interesting aspect of improving
>         performance, which is blocks being blocked due to the output
>         buffer being full. Ive tried playing around a bit with the min
>         and max output buffer sizes, but the performance did not seem
>         to be affected.
>         After using the performance monitor to analyze the buffer
>         average full %, i see that most of them are relatively full,
>         however, i do not know if they are full enough to make an
>         upstream block to have to wait to push data into the buffer.
>
>
>         2016-02-28 19:39 GMT-03:00 Douglas Geiger
>         <doug.gei...@bioradiation.net
>         <mailto:doug.gei...@bioradiation.net>>:
>
>             The phenomenon Sylvain is pointing at is basically the
>             fact that as compilers improve, you should expect the
>             'optimized' proto-kernels to no longer have as dramatic an
>             improvement compared with the generic ones. As to your
>             question of 'is it worth it' - that comes down to a couple
>             of things: for example - how much of an improvement do you
>             require to be 'worth it' (i.e., how much is your time
>             worth and/or how much of an performance improvement do you
>             require for your application). Similarly, is it worth it
>             to you to get cross-platform improvements (which is one of
>             the features of VOLK)? Or, perhaps, is it worth it to you
>             just to learn how to use VOLK?
>
>             A couple of thoughts here: in general, when I have a
>             flowgraph that is not meeting my performance requirements,
>             my first step is to do some course profiling (i.e. via
>             gr-perf-monitorx) to determine if there is a single block
>             that is my primary performance bottleneck. If so - that is
>             the block I will concentrate on for optimizations (both
>             via VOLK, and/or any algorithmic improvements - e.g. can I
>             turn any run-time calculations into a look-up table
>             calculated either at compile-time, or within the constructor).
>              If there is not a clear bottleneck, then next I look a
>             little deeper using perf/oprofile to look at what
>             functions my flowgraph is spending a lot of time in: can I
>             e.g. create a faster version of some primitive calculation
>             that all my blocks use a lot, and therefore get a speed-up
>             across many blocks which should translate into a fast
>             over-all application.
>
>              Finally, if I still need more improvements I would look
>             at collecting many blocks together into a single, larger
>             block. This is generally less desirable, since you now
>             have a (more) application-specific block, and it becomes
>             harder to re-use in later projects, but if you have
>             performance requirements that drive you there, then it
>             absolutely is an option. At this point you likely have
>             multiple operations being done to your incoming samples,
>             and it becomes easy to collect all of those into a single
>             larger VOLK call (and from there, create a SIMD-ized
>             proto-kernel that targets your particular platform). So,
>             while re-usability of code drives you away from this
>             scenario, it offers the greatest potential for performance
>             improvements, and thus is where many applications with
>             high performance requirements tend to gravitate towards.
>             Ideally you can strike a balance between the two: i.e.
>             have widely re-usable blocks, but with a set of operations
>             inside them that you can take advantage of e.g. SIMD-ized
>             function calls to make them high-performance. If you can
>             craft the block to be widely re-usable for a certain class
>             of things (e.g. look at how the OFDM blocks are setup to
>             be easily re-configurable for the many ways an OFDM
>             waveform can be crafted). In the long-run having more
>             knobs to turn to customize your existing code base to deal
>             with whatever new scenario you are looking at in 1/2/10
>             years from now is always better than a brittle solution
>             that solves today's problem, but is difficult to re-use to
>             deal with tomorrow's.
>
>             Hope that was helpful. If you are interested in learning
>             more about how to use VOLK - certainly have a look at
>             libvolk.org <http://libvolk.org> - the documentation is (I
>             think) fairly good at introducing the concepts and intent,
>             as well as how the API looks/works. And certainly don't be
>             shy about asking more questions here.
>
>              Good luck,
>               Doug
>
>             On Sun, Feb 28, 2016 at 1:58 AM, Sylvain Munaut
>             <246...@gmail.com <mailto:246...@gmail.com>> wrote:
>
>                 > Just wanted to ask the more experienced users if you think 
> this idea is
>                 > worth a shot, or the performance improvement will be
>                 marginal.
>
>                 Performance improvement is vastly dependent of the
>                 operation you're doing.
>
>                 You can get an idea of the improvement by comparing
>                 the volk-profile
>                 output for the generic kernel (coded in pure C) and
>                 the sse/avx ones.
>
>                 For instance, on my laptop : for some very simple one
>                 (like float
>                 add), the generic is barely slower than simd. Most
>                 likely because it's
>                 so simple than even the compiler itself was able to
>                 simdize it by
>                 itself.
>                 But for other things (like complex multiply), the SIMD
>                 version is 10x faster ...
>
>
>                 Cheers,
>
>                    Sylvain
>
>                 _______________________________________________
>                 Discuss-gnuradio mailing list
>                 Discuss-gnuradio@gnu.org <mailto:Discuss-gnuradio@gnu.org>
>                 https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
>
>
>
>             -- 
>             Doug Geiger
>             doug.gei...@bioradiation.net
>             <mailto:doug.gei...@bioradiation.net>
>
>
>
>         _______________________________________________
>         Discuss-gnuradio mailing list
>         Discuss-gnuradio@gnu.org <mailto:Discuss-gnuradio@gnu.org>
>         https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
>
>
>
>
> _______________________________________________
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.

Reply via email to