Stratified sampling would also be beneficial for the DataSet API. I think it would be best if this method is also added to DataSetUtils or made available via the flink-contrib module. Furthermore, I think that it would be easiest if you created the JIRA for this feature, because you know what you want to add. For that you have to register at https://issues.apache.org/jira, if you haven't done this, and then we can add you as a contributor. Based on the JIRA description we can discuss possible implementations then.
Cheers, Till On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone <par...@kth.se> wrote: > Hey Do, > > I think that more sophisticated samplers could make a better fit in the ML > library and not in the core API but I am not very familiar with the > milestones there. > Maybe the maintainers of the batch ML library could check if sampling > techniques could be useful there I guess. > > Paris > > > On 11 Jul 2016, at 16:15, Le Quoc Do <lequo...@gmail.com> wrote: > > > > Hi all, > > > > Thank you all for your answers. > > By the way, I also recognized that Flink doesn't support "stratified > > sampling" function (only simple random sampling) for DataSet. > > It would be nice if someone can create a Jira for it, and assign the task > > to me so that I can work for it. > > > > Thank you, > > Do > > > > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri < > > vasilikikala...@gmail.com> wrote: > > > >> Hi Do, > >> > >> Paris and Martha worked on sampling techniques for data streams on Flink > >> last year. If you want to implement your own samplers, you might find > >> Martha's master thesis helpful [1]. > >> > >> -Vasia. > >> > >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf > >> > >> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com> > >> wrote: > >> > >>> Hi Do, > >>> > >>> In DataStream you can always implement your own > >>> sampling function, hopefully without too much effort. > >>> > >>> Adding such functionality it to the API could be a good idea. > >>> But given that in sampling there is no “one-size-fits-all” > >>> solution (as not every use case needs random sampling and not > >>> all random samplers fit to all workloads), I am not sure if we > >>> should start adding different sampling operators. > >>> > >>> Thanks, > >>> Kostas > >>> > >>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote: > >>>> > >>>> Hi Do, > >>>> > >>>> DataSet provides a stable @Public interface. DataSetUtils is marked > >>>> @PublicEvolving which is intended for public use, has stable behavior, > >>> but > >>>> method signatures may change. It's also good to limit DataSet to > common > >>>> methods whereas the utility methods tend to be used for specific > >>>> applications. > >>>> > >>>> I don't have the pulse of streaming but this sounds like a useful > >> feature > >>>> that could be added. > >>>> > >>>> Greg > >>>> > >>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com> > >> wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> I'm working on approximate computing using sampling techniques. I > >>>>> recognized that Flink supports the sample function for Dataset > >>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just > >> wondering > >>> why > >>>>> you didn't merge the function to > >> org/apache/flink/api/java/DataSet.java > >>>>> since the sample function works as a transformation operator? > >>>>> > >>>>> The second question is that are you planning to support the sample > >>>>> function for DataStream (within windows) since I did not see it in > >>>>> DataStream code ? > >>>>> > >>>>> Thank you, > >>>>> Do > >>>>> > >>> > >>> > >> > >