Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.

Dennis Glatting Sun, 28 Feb 2016 20:52:50 -0800

On Mon, 2016-02-29 at 00:48 -0300, Gonzalo Arcos wrote:
> Thanks to all of you for your very informative answers.
> 
> Douglas, i feel good now because you have described perfectly all the
> things i did / thought on how to improve the performance :), i also
> agree that merging blocks should be a last time resort. I have used
> the performance monitor and managed to improve the perofrmance of the
> most expensive blocks. What i could not achieve though, is profiling
> the program with a mainstream profiler, like valgrind or oprofile, or
> some other profilers for python. I remember than when visualizing the
> data, all the time was spent in the start() of the top block, and i
> could not get information pertaining each blocks general work, let
> alone functions executed within the block. After discovering the
> performance monitor, i used it in conjuntion with calls to clock() to
> determine the time spent in each function within each block, to get a
> rough measurement. But if it is possible to get this information
> automatically, i am very interested in learning how to do it. Could
> you help me?
>


Once upon a time, and unfortunately not long enough ago...

A particularly ugly method is to craft support code around your block
then call general_work() directly (i.e., exclude most of GNURadio).
There are many pitfalls to this approach but I was able to analyze the
performance of some blocks across several implementations using the
usual tools.


> There is also another interesting aspect of improving performance,
> which is blocks being blocked due to the output buffer being full.
> Ive tried playing around a bit with the min and max output buffer
> sizes, but the performance did not seem to be affected.
> After using the performance monitor to analyze the buffer average
> full %, i see that most of them are relatively full, however, i do
> not know if they are full enough to make an upstream block to have to
> wait to push data into the buffer.
> 
> 
> 2016-02-28 19:39 GMT-03:00 Douglas Geiger <doug.geiger@bioradiation.n
> et>:
> > The phenomenon Sylvain is pointing at is basically the fact that as
> > compilers improve, you should expect the 'optimized' proto-kernels
> > to no longer have as dramatic an improvement compared with the
> > generic ones. As to your question of 'is it worth it' - that comes
> > down to a couple of things: for example - how much of an
> > improvement do you require to be 'worth it' (i.e., how much is your
> > time worth and/or how much of an performance improvement do you
> > require for your application). Similarly, is it worth it to you to
> > get cross-platform improvements (which is one of the features of
> > VOLK)? Or, perhaps, is it worth it to you just to learn how to use
> > VOLK?
> > 
> > A couple of thoughts here: in general, when I have a flowgraph that
> > is not meeting my performance requirements, my first step is to do
> > some course profiling (i.e. via gr-perf-monitorx) to determine if
> > there is a single block that is my primary performance bottleneck.
> > If so - that is the block I will concentrate on for optimizations
> > (both via VOLK, and/or any algorithmic improvements - e.g. can I
> > turn any run-time calculations into a look-up table calculated
> > either at compile-time, or within the constructor).
> >  If there is not a clear bottleneck, then next I look a little
> > deeper using perf/oprofile to look at what functions my flowgraph
> > is spending a lot of time in: can I e.g. create a faster version of
> > some primitive calculation that all my blocks use a lot, and
> > therefore get a speed-up across many blocks which should translate
> > into a fast over-all application.
> > 
> >  Finally, if I still need more improvements I would look at
> > collecting many blocks together into a single, larger block. This
> > is generally less desirable, since you now have a (more)
> > application-specific block, and it becomes harder to re-use in
> > later projects, but if you have performance requirements that drive
> > you there, then it absolutely is an option. At this point you
> > likely have multiple operations being done to your incoming
> > samples, and it becomes easy to collect all of those into a single
> > larger VOLK call (and from there, create a SIMD-ized proto-kernel
> > that targets your particular platform). So, while re-usability of
> > code drives you away from this scenario, it offers the greatest
> > potential for performance improvements, and thus is where many
> > applications with high performance requirements tend to gravitate
> > towards. Ideally you can strike a balance between the two: i.e.
> > have widely re-usable blocks, but with a set of operations inside
> > them that you can take advantage of e.g. SIMD-ized function calls
> > to make them high-performance. If you can craft the block to be
> > widely re-usable for a certain class of things (e.g. look at how
> > the OFDM blocks are setup to be easily re-configurable for the many
> > ways an OFDM waveform can be crafted). In the long-run having more
> > knobs to turn to customize your existing code base to deal with
> > whatever new scenario you are looking at in 1/2/10 years from now
> > is always better than a brittle solution that solves today's
> > problem, but is difficult to re-use to deal with tomorrow's.
> > 
> > Hope that was helpful. If you are interested in learning more about
> > how to use VOLK - certainly have a look at libvolk.org - the
> > documentation is (I think) fairly good at introducing the concepts
> > and intent, as well as how the API looks/works. And certainly don't
> > be shy about asking more questions here.
> > 
> >  Good luck,
> >   Doug
> > 
> > On Sun, Feb 28, 2016 at 1:58 AM, Sylvain Munaut <246...@gmail.com>
> > wrote:
> > > > Just wanted to ask the more experienced users if you think this
> > > idea is
> > > > worth a shot, or the performance improvement will be marginal.
> > > 
> > > Performance improvement is vastly dependent of the operation
> > > you're doing.
> > > 
> > > You can get an idea of the improvement by comparing the volk-
> > > profile
> > > output for the generic kernel (coded in pure C) and the sse/avx
> > > ones.
> > > 
> > > For instance, on my laptop : for some very simple one (like float
> > > add), the generic is barely slower than simd. Most likely because
> > > it's
> > > so simple than even the compiler itself was able to simdize it by
> > > itself.
> > > But for other things (like complex multiply), the SIMD version is
> > > 10x faster ...
> > > 
> > > 
> > > Cheers,
> > > 
> > >    Sylvain
> > > 
> > > _______________________________________________
> > > Discuss-gnuradio mailing list
> > > Discuss-gnuradio@gnu.org
> > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
> > > 
> > 
> > 
> > -- 
> > Doug Geiger
> > doug.gei...@bioradiation.net
> > 
> _______________________________________________
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.

Reply via email to