On Fri, 2015-08-14 at 15:17 -0400, Douglas Geiger wrote:
> On Fri, Aug 14, 2015 at 2:03 PM, Dennis Glatting <[email protected]>
> wrote:
> > Sorry for the HTML...
> >
> > I have been done some work applying OpenMP to GNURadio and
> > collected some data. This data was collected WITHOUT GNURadio
> > overhead. Specifically, I interfaced directly with my detector
> > passing 30 seconds (30 seconds * 10msps) of data in a buffer (i.e.,
> > I allocated and filled gr_vector_const_void_star, etc.) and
> > calculated the performance at different sensitivities. Herein lies
> > one problem.
> >
> > When applying OpenMP against a buffer there has to be enough data
> > to make it worth while but GNURadio buffers are fairly small. I
> > don't see a reasonable way to increase buffer sizes for a single
> > source->block without modifying the constant in flat_flowgraph.cc
> > which has the side effect of the default size for all buffers. Yes?
> >
> Is this a sink block (i.e. no outputs)? In general, IIRC, you have
> more control on output buffer size (since you own them, and input
> buffers are owned by upstream blocks). You can call
> set_output_multiple()/set_min_output_buffer(...)/set_min_noutput_item
> s(...) to influence output buffer size (and for a sync_block, and
> therefore a sync_decimator/sync_interpolator, that has a
> corresponding influence on the input buffer size). Others may correct
> me on how much influence sink blocks have in current releases...
>
hackRF/BladeRF Source -> Preamble Detector (the block) -> multiple
blocks.
If the preamble detector detects a signal it forwards that set of
samples to a "framer" and a GUI Sink.
> > I am looking for a way to measure GNURadio overhead. There is a certain
> > amount of overhead depending on the number of blocks, set() functions, GUI
> > Sinks, etc. and I'd like to know what that overhead is. Ideas?
> > What exactly are you interested in measuring when you say 'overhead'? Are
> > you talking about memory usage? CPU usage? Latency (and if you're
> > interested in latency, do you mean one-way, two-way)?
>
CPU. Latency. One way.
I have samples coming in at a certain rate into a limited sized buffer.
I need to know the servicing interval (latency) in an attempt to
architect a solution to prevent or reduce SDR overruns.
There is a scheduler decision process to release the block for
execution and the overhead before calling general_work() (e.g., in
block_executor.cc), such as: update read/write pointers, are tags
present, maybe service performance counters, etc. At high sample rates
that can reduce the processing rate of samples.
> > One thought is to set a hardware pin low in the source block and set it
> > high in the detector block then measuring with a scope. The problem is
> > these pins often incur kernel overhead by opening something in /dev,
> > writing a string, then closing the device and waiting for the kernel to get
> > around to actually toggling the pin. Measurements showed this is wildly
> > unpredictable. Another option is to toggle a ping on an SDR but the same
> > problem exists with additional USB transaction delays.
> > This sounds like you are interested in one-way latency... maybe?
> >
> > Anyway, in the data below a "signal" buffer is defined as ~1200 samples
> > (i.e., MAXimum message size), or 2x1200=2400 complex number "chunks". I
> > found 2xMAX a reasonable value because it is within a reasonable buffer
> > amount from GNURadio with my alteration to flat_flowgraph.cc. The OpenMP
> > code really looks like this:
> > > > #pragma omp parallel for num_threads(ncpu),schedule(dynamic,1) \
> > if( work_list.size() > 1 )
> > for( size_t i = 0; i < work_list.size(); ++i ) {
> > > > do work...
> > }
> > > > Pretty simple.
> > > > That said, unless GNURadio can provide a selective and reasonably large
> > > > amount of samples to process then the value of applying OpenMP is
> > > > probably moot.
> > > > > > Below, the term "sensitivity" is a bit of a misnomer because as
> > > > > > sensitivity increases signal rejection increases; but the text is
> > > > > > what it is. More specifically, there are roughly twelves sets of
> > > > > > criteria that need to be met before signal presence is declared.
> > > > > > Some of those criteria involve std::log10() and std::pow(10.0,x)
> > > > > > operations but interestingly those math operations are a very small
> > > > > > amount of the detection effort (1.02% worst case).
> > > > The numbers in the first block below is the rate in samples/second I
> > > > can process samples. For example, "Baseline" "45.0" is "1,662,796"
> > > > samples per second.
> > > > From an OpenMP perspective, I have eight cores but limited the effort
> > > > to five with the idea GNURadio overhead and other blocks have three
> > > > cores to do their thing, worst case. OpenMP gave me a >300% performance
> > > > gain in "OMP 5core,2xMAX) but the theoretical gain is 400%. Not perfect
> > > > but I'll take it. What these numbers tell me is OpenMP can have
> > > > significant value in the context of GNURadio.
> > > > My code was run on an AMD 9590 at 4.7GHz, 5GHz boost -- my development
> > > > platform. For reasons, I also ran it on a CubieBoard4 (ARM
> > > > architecture). I should also mention I have seen NO side effects of
> > > > running OpenMP within GNURadio other than considerable amusement.
> >
> > So using OpenMP inside a work function is a perfectly reasonable way to try
> > to accelerate (via parallelization) that particular work function -
> > obviously as some point you are fighting against the thread-per-block
> > parallelization of GNURadio, so telling OMP to use fewer cores than your
> > machine has is a reasonable way to deal with this. My experience with OMP
> > has indicated that the thread-spawning that happens each time you enter the
> > work() function has a cost, and therefore instantiating a thread-pool in
> > the block constructor may give better results, but in the end the real
> > question you have to ask is: what amount of work is required to achieve the
> > task at hand.
> > For example, collapsing the functions of multiple blocks into a single,
> > larger (super)block can increase performance because you aren't shuffling
> > data in-between blocks. Implementing custom thread pools is another
> > strategy. Writing lots of hand-optimized SIMD code (preferably inside
> > VOLK!) can help as well. Ultimately the question is: what is the least
> > amount of work required to make the thing do what you need 'good-enough',
> > where good-enough is some measure(s) of performance on the target platform.
> > Basically what I'm saying is, there isn't a single answer to the question
> > of 'what is the overhead of GNURadio', because not only is that a moving
> > target, but it depends on what platform you're targeting, and it depends on
> > what measure of 'overhead' you really care about. Not to mention the
> > various knobs (e.g. the different ways blocks can influence buffer-size -
> > etc.) you have to control, e.g. one-way latency, or computational load.
> >
Something to chew on. Thanks.
Not being a DSP person, the math is interesting; and not believable so
I am assuming I have something totally screwed.
10msps=1e-7/sample.
5GHz = 2e-10
Or an average of 500 CPU clocks between samples. Assuming 8 clocks per
instruction (an arbitrary and unsupported number) with zero overhead
(e.g., memory access), that is 62 instructions between samples on a
single core. Assuming that math is somewhere near correct, I can't
really be ashamed of my low-end value of 7msps processing rate across
five cores but a best case rejection rate of ~1e-8? That would be one
heck of a fast if() and virtual function call statement; and I don't
believe it but I haven't (yet) found anything to debunk my numbers.
The test code is pretty simple and meets observed wall clock:
int noutput_items = (int)s.get()->size();
gr_vector_int ninput_items { (int)s.get()->size() };
gr_vector_const_void_star input_items = {
malloc( sizeof(complex) * noutput_items )
};
gr_vector_void_star output_items = {
malloc( sizeof(complex) * noutput_items )
};
...
for( float level : full_range ) {
preamble->set_gain( level );
t_start = std::chrono::high_resolution_clock::now();
for( int loop=0; loop < LOOPS; ++loop ) {
(void)preamble->general_work( noutput_items, ninput_items,
input_items, output_items );
}
t_stop = std::chrono::high_resolution_clock::now();
t_span = std::chrono::duration_cast<std::chrono::duration<double>>
( t_stop - t_start );
std::cout << "Rx Detect Elapsed: "
<< t_span.count() << " sec"
<<", samp=" << s.get()->size()
<< ", samp/sec=" << (s.get()
->size()*LOOPS/t_span.count())
<< ", gain=" << preamble->gain()
<< std::endl;
}
> Doug
> > --
> Doug Geiger
> [email protected]
>
>
_______________________________________________
Discuss-gnuradio mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio