On Fri, 2015-08-14 at 15:17 -0400, Douglas Geiger wrote:
> On Fri, Aug 14, 2015 at 2:03 PM, Dennis Glatting <gnura...@pki2.com>
> wrote:
> > Sorry for the HTML...
> > 
> > I have been done some work applying OpenMP to GNURadio and
> > collected some data. This data was collected WITHOUT GNURadio
> > overhead. Specifically, I interfaced directly with my detector
> > passing 30 seconds (30 seconds * 10msps) of data in a buffer (i.e.,
> > I allocated and filled gr_vector_const_void_star, etc.) and
> > calculated the performance at different sensitivities. Herein lies
> > one problem.
> > 
> > When applying OpenMP against a buffer there has to be enough data
> > to make it worth while but GNURadio buffers are fairly small. I
> > don't see a reasonable way to increase buffer sizes for a single
> > source->block without modifying the constant in flat_flowgraph.cc
> > which has the side effect of the default size for all buffers. Yes?
> > 
> Is this a sink block (i.e. no outputs)? In general, IIRC, you have
> more control on output buffer size (since you own them, and input
> buffers are owned by upstream blocks). You can call
> set_output_multiple()/set_min_output_buffer(...)/set_min_noutput_item
> s(...) to influence output buffer size (and for a sync_block, and
> therefore a sync_decimator/sync_interpolator, that has a
> corresponding influence on the input buffer size). Others may correct
> me on how much influence sink blocks have in current releases...
>  
hackRF/BladeRF Source -> Preamble Detector (the block) -> multiple
blocks.
If the preamble detector detects a signal it forwards that set of
samples to a "framer" and a GUI Sink.
 
> > I am looking for a way to measure GNURadio overhead. There is a certain 
> > amount of overhead depending on the number of blocks, set() functions, GUI 
> > Sinks, etc. and I'd like to know what that overhead is. Ideas? 
> > What exactly are you interested in measuring when you say 'overhead'? Are 
> > you talking about memory usage? CPU usage? Latency (and if you're 
> > interested in latency, do you mean one-way, two-way)?
>  
CPU. Latency. One way.
I have samples coming in at a certain rate into a limited sized buffer.
I need to know the servicing interval (latency) in an attempt to
architect a solution to prevent or reduce SDR overruns. 
There is a scheduler decision process to release the block for
execution and the overhead before calling general_work() (e.g., in
block_executor.cc), such as: update read/write pointers, are tags
present, maybe service performance counters, etc. At high sample rates
that can reduce the processing rate of samples.
> > One thought is to set a hardware pin low in the source block and set it 
> > high in the detector block then measuring with a scope. The problem is 
> > these pins often incur kernel overhead by opening something in /dev, 
> > writing a string, then closing the device and waiting for the kernel to get 
> > around to actually toggling the pin. Measurements showed this is wildly 
> > unpredictable. Another option is to toggle a ping on an SDR but the same 
> > problem exists with additional USB transaction delays.
> > This sounds like you are interested in one-way latency... maybe?
> >  
> > Anyway, in the data below a "signal" buffer is defined as ~1200 samples 
> > (i.e., MAXimum message size), or 2x1200=2400 complex number "chunks". I 
> > found 2xMAX a reasonable value because it is within a reasonable buffer 
> > amount from GNURadio with my alteration to flat_flowgraph.cc. The OpenMP 
> > code really looks like this:
> > > > #pragma omp parallel for num_threads(ncpu),schedule(dynamic,1)  \
> >   if( work_list.size() > 1 )
> >       for( size_t i = 0; i <  work_list.size(); ++i ) {
> >     > > do work...
> >       }
> > > > Pretty simple.
> > > > That said, unless GNURadio can provide a selective and reasonably large 
> > > > amount of samples to process then the value of applying OpenMP is 
> > > > probably moot.
> > > > > > Below, the term "sensitivity" is a bit of a misnomer because as 
> > > > > > sensitivity increases signal rejection increases; but the text is 
> > > > > > what it is. More specifically, there are roughly twelves sets of 
> > > > > > criteria that need to be met before signal presence is declared. 
> > > > > > Some of those criteria involve std::log10() and std::pow(10.0,x) 
> > > > > > operations but interestingly those math operations are a very small 
> > > > > > amount of the detection effort (1.02% worst case).
> > > > The numbers in the first block below is the rate in samples/second I 
> > > > can process samples. For example, "Baseline" "45.0" is "1,662,796" 
> > > > samples per second.
> > > > From an OpenMP perspective, I have eight cores but limited the effort 
> > > > to five with the idea GNURadio overhead and other blocks have three 
> > > > cores to do their thing, worst case. OpenMP gave me a >300% performance 
> > > > gain in "OMP 5core,2xMAX) but the theoretical gain is 400%. Not perfect 
> > > > but I'll take it. What these numbers tell me is OpenMP can have 
> > > > significant value in the context of GNURadio.
> > > > My code was run on an AMD 9590 at 4.7GHz, 5GHz boost -- my development 
> > > > platform. For reasons, I also ran it on a CubieBoard4 (ARM 
> > > > architecture). I should also mention I have seen NO side effects of 
> > > > running OpenMP within GNURadio other than considerable amusement.
> > 

> > So using OpenMP inside a work function is a perfectly reasonable way to try 
> > to accelerate (via parallelization) that particular work function - 
> > obviously as some point you are fighting against the thread-per-block 
> > parallelization of GNURadio, so telling OMP to use fewer cores than your 
> > machine has is a reasonable way to deal with this. My experience with OMP 
> > has indicated that the thread-spawning that happens each time you enter the 
> > work() function has a cost, and therefore instantiating a thread-pool in 
> > the block constructor may give better results, but in the end the real 
> > question you have to ask is: what amount of work is required to achieve the 
> > task at hand.
> > For example, collapsing the functions of multiple blocks into a single, 
> > larger (super)block can increase performance because you aren't shuffling 
> > data in-between blocks. Implementing custom thread pools is another 
> > strategy. Writing lots of hand-optimized SIMD code (preferably inside 
> > VOLK!) can help as well. Ultimately the question is: what is the least 
> > amount of work required to make the thing do what you need 'good-enough', 
> > where good-enough is some measure(s) of performance on the target platform. 
> > Basically what I'm saying is, there isn't a single answer to the question 
> > of 'what is the overhead of GNURadio', because not only is that a moving 
> > target, but it depends on what platform you're targeting, and it depends on 
> > what measure of 'overhead' you really care about. Not to mention the 
> > various knobs (e.g. the different ways blocks can influence buffer-size - 
> > etc.) you have to control, e.g. one-way latency, or computational load.
> > 
Something to chew on. Thanks.
Not being a DSP person, the math is interesting; and not believable so
I am assuming I have something totally screwed.
  10msps=1e-7/sample.
  5GHz  = 2e-10
Or an average of 500 CPU clocks between samples. Assuming 8 clocks per
instruction (an arbitrary and unsupported number) with zero overhead
(e.g., memory access), that is 62 instructions between samples on a
single core. Assuming that math is somewhere near correct, I can't
really be ashamed of my low-end value of 7msps processing rate across
five cores but a best case rejection rate of ~1e-8? That would be one
heck of a fast if() and virtual function call statement; and I don't
believe it but I haven't (yet) found anything to debunk my numbers.
The test code is pretty simple and meets observed wall clock:
  int                       noutput_items = (int)s.get()->size();
  gr_vector_int             ninput_items { (int)s.get()->size() };
  gr_vector_const_void_star input_items = {
    malloc( sizeof(complex) * noutput_items )
  };
  gr_vector_void_star       output_items = {
    malloc( sizeof(complex) * noutput_items )
  };
...
  for( float level : full_range ) {
    preamble->set_gain( level );
    t_start = std::chrono::high_resolution_clock::now();
    for( int loop=0; loop < LOOPS; ++loop ) {
      (void)preamble->general_work( noutput_items, ninput_items,
                                    input_items, output_items );
    }
    t_stop = std::chrono::high_resolution_clock::now();
    t_span = std::chrono::duration_cast<std::chrono::duration<double>>
      ( t_stop - t_start );
    std::cout << "Rx Detect Elapsed: "
              << t_span.count() << " sec"
              <<", samp=" << s.get()->size()
              << ", samp/sec=" << (s.get()
->size()*LOOPS/t_span.count())
              << ", gain=" << preamble->gain()
              << std::endl;
  }
>  Doug

> > -- 
> Doug Geiger
> doug.gei...@bioradiation.net
> 

> 
_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to