On Mon, 2016-02-29 at 00:48 -0300, Gonzalo Arcos wrote: > Thanks to all of you for your very informative answers. > > Douglas, i feel good now because you have described perfectly all the > things i did / thought on how to improve the performance :), i also > agree that merging blocks should be a last time resort. I have used > the performance monitor and managed to improve the perofrmance of the > most expensive blocks. What i could not achieve though, is profiling > the program with a mainstream profiler, like valgrind or oprofile, or > some other profilers for python. I remember than when visualizing the > data, all the time was spent in the start() of the top block, and i > could not get information pertaining each blocks general work, let > alone functions executed within the block. After discovering the > performance monitor, i used it in conjuntion with calls to clock() to > determine the time spent in each function within each block, to get a > rough measurement. But if it is possible to get this information > automatically, i am very interested in learning how to do it. Could > you help me? >
Once upon a time, and unfortunately not long enough ago... A particularly ugly method is to craft support code around your block then call general_work() directly (i.e., exclude most of GNURadio). There are many pitfalls to this approach but I was able to analyze the performance of some blocks across several implementations using the usual tools. > There is also another interesting aspect of improving performance, > which is blocks being blocked due to the output buffer being full. > Ive tried playing around a bit with the min and max output buffer > sizes, but the performance did not seem to be affected. > After using the performance monitor to analyze the buffer average > full %, i see that most of them are relatively full, however, i do > not know if they are full enough to make an upstream block to have to > wait to push data into the buffer. > > > 2016-02-28 19:39 GMT-03:00 Douglas Geiger <doug.geiger@bioradiation.n > et>: > > The phenomenon Sylvain is pointing at is basically the fact that as > > compilers improve, you should expect the 'optimized' proto-kernels > > to no longer have as dramatic an improvement compared with the > > generic ones. As to your question of 'is it worth it' - that comes > > down to a couple of things: for example - how much of an > > improvement do you require to be 'worth it' (i.e., how much is your > > time worth and/or how much of an performance improvement do you > > require for your application). Similarly, is it worth it to you to > > get cross-platform improvements (which is one of the features of > > VOLK)? Or, perhaps, is it worth it to you just to learn how to use > > VOLK? > > > > A couple of thoughts here: in general, when I have a flowgraph that > > is not meeting my performance requirements, my first step is to do > > some course profiling (i.e. via gr-perf-monitorx) to determine if > > there is a single block that is my primary performance bottleneck. > > If so - that is the block I will concentrate on for optimizations > > (both via VOLK, and/or any algorithmic improvements - e.g. can I > > turn any run-time calculations into a look-up table calculated > > either at compile-time, or within the constructor). > > If there is not a clear bottleneck, then next I look a little > > deeper using perf/oprofile to look at what functions my flowgraph > > is spending a lot of time in: can I e.g. create a faster version of > > some primitive calculation that all my blocks use a lot, and > > therefore get a speed-up across many blocks which should translate > > into a fast over-all application. > > > > Finally, if I still need more improvements I would look at > > collecting many blocks together into a single, larger block. This > > is generally less desirable, since you now have a (more) > > application-specific block, and it becomes harder to re-use in > > later projects, but if you have performance requirements that drive > > you there, then it absolutely is an option. At this point you > > likely have multiple operations being done to your incoming > > samples, and it becomes easy to collect all of those into a single > > larger VOLK call (and from there, create a SIMD-ized proto-kernel > > that targets your particular platform). So, while re-usability of > > code drives you away from this scenario, it offers the greatest > > potential for performance improvements, and thus is where many > > applications with high performance requirements tend to gravitate > > towards. Ideally you can strike a balance between the two: i.e. > > have widely re-usable blocks, but with a set of operations inside > > them that you can take advantage of e.g. SIMD-ized function calls > > to make them high-performance. If you can craft the block to be > > widely re-usable for a certain class of things (e.g. look at how > > the OFDM blocks are setup to be easily re-configurable for the many > > ways an OFDM waveform can be crafted). In the long-run having more > > knobs to turn to customize your existing code base to deal with > > whatever new scenario you are looking at in 1/2/10 years from now > > is always better than a brittle solution that solves today's > > problem, but is difficult to re-use to deal with tomorrow's. > > > > Hope that was helpful. If you are interested in learning more about > > how to use VOLK - certainly have a look at libvolk.org - the > > documentation is (I think) fairly good at introducing the concepts > > and intent, as well as how the API looks/works. And certainly don't > > be shy about asking more questions here. > > > > Good luck, > > Doug > > > > On Sun, Feb 28, 2016 at 1:58 AM, Sylvain Munaut <246...@gmail.com> > > wrote: > > > > Just wanted to ask the more experienced users if you think this > > > idea is > > > > worth a shot, or the performance improvement will be marginal. > > > > > > Performance improvement is vastly dependent of the operation > > > you're doing. > > > > > > You can get an idea of the improvement by comparing the volk- > > > profile > > > output for the generic kernel (coded in pure C) and the sse/avx > > > ones. > > > > > > For instance, on my laptop : for some very simple one (like float > > > add), the generic is barely slower than simd. Most likely because > > > it's > > > so simple than even the compiler itself was able to simdize it by > > > itself. > > > But for other things (like complex multiply), the SIMD version is > > > 10x faster ... > > > > > > > > > Cheers, > > > > > > Sylvain > > > > > > _______________________________________________ > > > Discuss-gnuradio mailing list > > > Discuss-gnuradio@gnu.org > > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio > > > > > > > > > -- > > Doug Geiger > > doug.gei...@bioradiation.net > > > _______________________________________________ > Discuss-gnuradio mailing list > Discuss-gnuradio@gnu.org > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio _______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio