You can make this one program by using a broadcast data set. This will stream the file into the average function and the filter together, but it will have to buffer (possibly spill) the data as well in one of the two streams. The collect() approach is probably easier and okay, if the file reading does not kill you.
On Tue, Nov 3, 2015 at 11:19 PM, Fabian Hueske <fhue...@gmail.com> wrote: > Hi Johannes, > > that's the way to do it, IMO. > > Cheers, Fabian > > 2015-11-03 22:39 GMT+01:00 Kirschnick, Johannes < > johannes.kirschn...@tu-berlin.de>: > > > Hi List, > > > > I am stuck on a simple problem and though somebody might point me into > the > > right direction. > > > > Basically I’m trying to > > > > > > - Get a measure of my dataset, say the average of some field > > > > - Filter all items which fall below the threshold > > > > I’m currently using just a straight forward implementation, of > > > > - first aggregating > > > > - collecting the average > > > > - broadcasting this value to the filter > > > > This ends up reading the entire dataset twice, so I thought there got to > > be a way of either using iterations in a clever way or some other method? > > > > Thanks > > > > Johannes > > >