On Fri, 2015-08-14 at 15:17 -0400, Douglas Geiger wrote: > On Fri, Aug 14, 2015 at 2:03 PM, Dennis Glatting <gnura...@pki2.com> > wrote: > > Sorry for the HTML... > > > > I have been done some work applying OpenMP to GNURadio and > > collected some data. This data was collected WITHOUT GNURadio > > overhead. Specifically, I interfaced directly with my detector > > passing 30 seconds (30 seconds * 10msps) of data in a buffer (i.e., > > I allocated and filled gr_vector_const_void_star, etc.) and > > calculated the performance at different sensitivities. Herein lies > > one problem. > > > > When applying OpenMP against a buffer there has to be enough data > > to make it worth while but GNURadio buffers are fairly small. I > > don't see a reasonable way to increase buffer sizes for a single > > source->block without modifying the constant in flat_flowgraph.cc > > which has the side effect of the default size for all buffers. Yes? > > > Is this a sink block (i.e. no outputs)? In general, IIRC, you have > more control on output buffer size (since you own them, and input > buffers are owned by upstream blocks). You can call > set_output_multiple()/set_min_output_buffer(...)/set_min_noutput_item > s(...) to influence output buffer size (and for a sync_block, and > therefore a sync_decimator/sync_interpolator, that has a > corresponding influence on the input buffer size). Others may correct > me on how much influence sink blocks have in current releases... > hackRF/BladeRF Source -> Preamble Detector (the block) -> multiple blocks. If the preamble detector detects a signal it forwards that set of samples to a "framer" and a GUI Sink. > > I am looking for a way to measure GNURadio overhead. There is a certain > > amount of overhead depending on the number of blocks, set() functions, GUI > > Sinks, etc. and I'd like to know what that overhead is. Ideas? > > What exactly are you interested in measuring when you say 'overhead'? Are > > you talking about memory usage? CPU usage? Latency (and if you're > > interested in latency, do you mean one-way, two-way)? > CPU. Latency. One way. I have samples coming in at a certain rate into a limited sized buffer. I need to know the servicing interval (latency) in an attempt to architect a solution to prevent or reduce SDR overruns. There is a scheduler decision process to release the block for execution and the overhead before calling general_work() (e.g., in block_executor.cc), such as: update read/write pointers, are tags present, maybe service performance counters, etc. At high sample rates that can reduce the processing rate of samples. > > One thought is to set a hardware pin low in the source block and set it > > high in the detector block then measuring with a scope. The problem is > > these pins often incur kernel overhead by opening something in /dev, > > writing a string, then closing the device and waiting for the kernel to get > > around to actually toggling the pin. Measurements showed this is wildly > > unpredictable. Another option is to toggle a ping on an SDR but the same > > problem exists with additional USB transaction delays. > > This sounds like you are interested in one-way latency... maybe? > > > > Anyway, in the data below a "signal" buffer is defined as ~1200 samples > > (i.e., MAXimum message size), or 2x1200=2400 complex number "chunks". I > > found 2xMAX a reasonable value because it is within a reasonable buffer > > amount from GNURadio with my alteration to flat_flowgraph.cc. The OpenMP > > code really looks like this: > > > > #pragma omp parallel for num_threads(ncpu),schedule(dynamic,1) \ > > if( work_list.size() > 1 ) > > for( size_t i = 0; i < work_list.size(); ++i ) { > > > > do work... > > } > > > > Pretty simple. > > > > That said, unless GNURadio can provide a selective and reasonably large > > > > amount of samples to process then the value of applying OpenMP is > > > > probably moot. > > > > > > Below, the term "sensitivity" is a bit of a misnomer because as > > > > > > sensitivity increases signal rejection increases; but the text is > > > > > > what it is. More specifically, there are roughly twelves sets of > > > > > > criteria that need to be met before signal presence is declared. > > > > > > Some of those criteria involve std::log10() and std::pow(10.0,x) > > > > > > operations but interestingly those math operations are a very small > > > > > > amount of the detection effort (1.02% worst case). > > > > The numbers in the first block below is the rate in samples/second I > > > > can process samples. For example, "Baseline" "45.0" is "1,662,796" > > > > samples per second. > > > > From an OpenMP perspective, I have eight cores but limited the effort > > > > to five with the idea GNURadio overhead and other blocks have three > > > > cores to do their thing, worst case. OpenMP gave me a >300% performance > > > > gain in "OMP 5core,2xMAX) but the theoretical gain is 400%. Not perfect > > > > but I'll take it. What these numbers tell me is OpenMP can have > > > > significant value in the context of GNURadio. > > > > My code was run on an AMD 9590 at 4.7GHz, 5GHz boost -- my development > > > > platform. For reasons, I also ran it on a CubieBoard4 (ARM > > > > architecture). I should also mention I have seen NO side effects of > > > > running OpenMP within GNURadio other than considerable amusement. > >
> > So using OpenMP inside a work function is a perfectly reasonable way to try > > to accelerate (via parallelization) that particular work function - > > obviously as some point you are fighting against the thread-per-block > > parallelization of GNURadio, so telling OMP to use fewer cores than your > > machine has is a reasonable way to deal with this. My experience with OMP > > has indicated that the thread-spawning that happens each time you enter the > > work() function has a cost, and therefore instantiating a thread-pool in > > the block constructor may give better results, but in the end the real > > question you have to ask is: what amount of work is required to achieve the > > task at hand. > > For example, collapsing the functions of multiple blocks into a single, > > larger (super)block can increase performance because you aren't shuffling > > data in-between blocks. Implementing custom thread pools is another > > strategy. Writing lots of hand-optimized SIMD code (preferably inside > > VOLK!) can help as well. Ultimately the question is: what is the least > > amount of work required to make the thing do what you need 'good-enough', > > where good-enough is some measure(s) of performance on the target platform. > > Basically what I'm saying is, there isn't a single answer to the question > > of 'what is the overhead of GNURadio', because not only is that a moving > > target, but it depends on what platform you're targeting, and it depends on > > what measure of 'overhead' you really care about. Not to mention the > > various knobs (e.g. the different ways blocks can influence buffer-size - > > etc.) you have to control, e.g. one-way latency, or computational load. > > Something to chew on. Thanks. Not being a DSP person, the math is interesting; and not believable so I am assuming I have something totally screwed. 10msps=1e-7/sample. 5GHz = 2e-10 Or an average of 500 CPU clocks between samples. Assuming 8 clocks per instruction (an arbitrary and unsupported number) with zero overhead (e.g., memory access), that is 62 instructions between samples on a single core. Assuming that math is somewhere near correct, I can't really be ashamed of my low-end value of 7msps processing rate across five cores but a best case rejection rate of ~1e-8? That would be one heck of a fast if() and virtual function call statement; and I don't believe it but I haven't (yet) found anything to debunk my numbers. The test code is pretty simple and meets observed wall clock: int noutput_items = (int)s.get()->size(); gr_vector_int ninput_items { (int)s.get()->size() }; gr_vector_const_void_star input_items = { malloc( sizeof(complex) * noutput_items ) }; gr_vector_void_star output_items = { malloc( sizeof(complex) * noutput_items ) }; ... for( float level : full_range ) { preamble->set_gain( level ); t_start = std::chrono::high_resolution_clock::now(); for( int loop=0; loop < LOOPS; ++loop ) { (void)preamble->general_work( noutput_items, ninput_items, input_items, output_items ); } t_stop = std::chrono::high_resolution_clock::now(); t_span = std::chrono::duration_cast<std::chrono::duration<double>> ( t_stop - t_start ); std::cout << "Rx Detect Elapsed: " << t_span.count() << " sec" <<", samp=" << s.get()->size() << ", samp/sec=" << (s.get() ->size()*LOOPS/t_span.count()) << ", gain=" << preamble->gain() << std::endl; } > Doug > > -- > Doug Geiger > doug.gei...@bioradiation.net > >
_______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio