Hi Till, I have created the JIRA: https://issues.apache.org/jira/browse/FLINK-4205
Thank you, Do On Tue, Jul 12, 2016 at 6:05 PM, Till Rohrmann <trohrm...@apache.org> wrote: > Stratified sampling would also be beneficial for the DataSet API. I think > it would be best if this method is also added to DataSetUtils or made > available via the flink-contrib module. Furthermore, I think that it would > be easiest if you created the JIRA for this feature, because you know what > you want to add. For that you have to register at > https://issues.apache.org/jira, if you haven't done this, and then we can > add you as a contributor. Based on the JIRA description we can > discuss possible implementations then. > > Cheers, > Till > > On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone <par...@kth.se> wrote: > > > Hey Do, > > > > I think that more sophisticated samplers could make a better fit in the > ML > > library and not in the core API but I am not very familiar with the > > milestones there. > > Maybe the maintainers of the batch ML library could check if sampling > > techniques could be useful there I guess. > > > > Paris > > > > > On 11 Jul 2016, at 16:15, Le Quoc Do <lequo...@gmail.com> wrote: > > > > > > Hi all, > > > > > > Thank you all for your answers. > > > By the way, I also recognized that Flink doesn't support "stratified > > > sampling" function (only simple random sampling) for DataSet. > > > It would be nice if someone can create a Jira for it, and assign the > task > > > to me so that I can work for it. > > > > > > Thank you, > > > Do > > > > > > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri < > > > vasilikikala...@gmail.com> wrote: > > > > > >> Hi Do, > > >> > > >> Paris and Martha worked on sampling techniques for data streams on > Flink > > >> last year. If you want to implement your own samplers, you might find > > >> Martha's master thesis helpful [1]. > > >> > > >> -Vasia. > > >> > > >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf > > >> > > >> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com > > > > >> wrote: > > >> > > >>> Hi Do, > > >>> > > >>> In DataStream you can always implement your own > > >>> sampling function, hopefully without too much effort. > > >>> > > >>> Adding such functionality it to the API could be a good idea. > > >>> But given that in sampling there is no “one-size-fits-all” > > >>> solution (as not every use case needs random sampling and not > > >>> all random samplers fit to all workloads), I am not sure if we > > >>> should start adding different sampling operators. > > >>> > > >>> Thanks, > > >>> Kostas > > >>> > > >>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote: > > >>>> > > >>>> Hi Do, > > >>>> > > >>>> DataSet provides a stable @Public interface. DataSetUtils is marked > > >>>> @PublicEvolving which is intended for public use, has stable > behavior, > > >>> but > > >>>> method signatures may change. It's also good to limit DataSet to > > common > > >>>> methods whereas the utility methods tend to be used for specific > > >>>> applications. > > >>>> > > >>>> I don't have the pulse of streaming but this sounds like a useful > > >> feature > > >>>> that could be added. > > >>>> > > >>>> Greg > > >>>> > > >>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com> > > >> wrote: > > >>>> > > >>>>> Hi all, > > >>>>> > > >>>>> I'm working on approximate computing using sampling techniques. I > > >>>>> recognized that Flink supports the sample function for Dataset > > >>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just > > >> wondering > > >>> why > > >>>>> you didn't merge the function to > > >> org/apache/flink/api/java/DataSet.java > > >>>>> since the sample function works as a transformation operator? > > >>>>> > > >>>>> The second question is that are you planning to support the sample > > >>>>> function for DataStream (within windows) since I did not see it in > > >>>>> DataStream code ? > > >>>>> > > >>>>> Thank you, > > >>>>> Do > > >>>>> > > >>> > > >>> > > >> > > > > >