PR for top: https://github.com/apache/beam/pull/29106
On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev <dev@beam.apache.org> wrote: > +1 on this idea. Thanks! > > On Thu, Oct 19, 2023 at 3:40 PM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Yeah, I already implemented these partitioners for my use case (I just >> pasted the classnames/docstrings for them) and I used both combiners.Top >> and combiners.Sample. >> >> In fact, before writing these partitioners I had misunderstood those >> combiners and thought they would partition my pcollections. Not sure if >> that might be a common pitfall. >> >> On Thu, Oct 19, 2023 at 3:32 PM Anand Inguva via dev <dev@beam.apache.org> >> wrote: >> >>> FYI, there is a Top transform[1] that will fetch the greatest n elements >>> in Python SDK. It is not a partitioner but It may be useful for your >>> reference. >>> >>> [1] >>> https://github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191 >>> >>> On Thu, Oct 19, 2023 at 3:21 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> Yes, both need to be small enough to fit into state. >>>> >>>> Yeah a percentage sampler would also be great, we have a bunch of use >>>> cases for that ourselves. Not sure if it'd be too clever, but I was >>>> imagining three public sampling partitioners: FixedSample, >>>> PercentageSample, and Sample. Sample could automatically choose between >>>> FixedSample and PercentageSample based on whether a percentage is given or >>>> a large `n` is given. >>>> >>>> For `PercentageSample`, I was imagining we'd just take a count of the >>>> number of elements and then assign every element a `rand` and keep the ones >>>> that are larger than `n / Count(inputs)` (or percentage). For runners that >>>> have fast counting, it should perform quickly. Open to other ideas though. >>>> >>>> Cheers, >>>> Joey >>>> >>>> >>>> >>>> On Thu, Oct 19, 2023 at 3:10 PM Danny McCormick via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> I'm interested adding something like this, I could see these being >>>>> generally useful for a number of cases (one that immediately comes to mind >>>>> is partitioning datasets into train/test/validation sets and writing each >>>>> to a different place). >>>>> >>>>> I'm assuming Top (or FixedSample) needs to be small enough to fit into >>>>> state? I would also be interested in being able to do percentages as well >>>>> (something like partitioners.Sample(percent=10)), though that might be >>>>> much >>>>> more challenging for an unbounded data set (maybe we could do something as >>>>> simple as a probabilistic target_percentage). >>>>> >>>>> Happy to help review a design doc or PR. >>>>> >>>>> Thanks, >>>>> Danny >>>>> >>>>> On Thu, Oct 19, 2023 at 10:06 AM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> While writing a few pipelines, I was surprised by how few >>>>>> partitioners there were in the python SDK. I wrote a couple that are >>>>>> pretty >>>>>> generic and possibly generally useful. Just wanted to do a quick poll to >>>>>> see if they seem useful enough to be in the sdk's library of transforms. >>>>>> If >>>>>> so, I can put together a PTransform Design Doc[1] for them. Just wanted >>>>>> to >>>>>> confirm before spending time on the doc. >>>>>> >>>>>> Here are the two that I wrote, I'll just paste the class names and >>>>>> docstrings: >>>>>> >>>>>> class FixedSample(beam.PTransform): >>>>>> """ >>>>>> A PTransform that takes a PCollection and partitions it into two >>>>>> PCollections. >>>>>> The first PCollection is a random sample of the input >>>>>> PCollection, and the >>>>>> second PCollection is the remaining elements of the input >>>>>> PCollection. >>>>>> >>>>>> This is useful for creating holdout / test sets in machine >>>>>> learning. >>>>>> >>>>>> Example usage: >>>>>> >>>>>> >>> with beam.Pipeline() as p: >>>>>> ... sample, remaining = (p >>>>>> ... | beam.Create(list(range(10))) >>>>>> ... | partitioners.FixedSample(3)) >>>>>> ... # sample will contain three randomly selected >>>>>> elements from the >>>>>> ... # input PCollection >>>>>> ... # remaining will contain the remaining seven elements >>>>>> >>>>>> """ >>>>>> >>>>>> class Top(beam.PTransform): >>>>>> """ >>>>>> A PTransform that takes a PCollection and partitions it into two >>>>>> PCollections. >>>>>> The first PCollection contains the largest n elements of the >>>>>> input PCollection, >>>>>> and the second PCollection contains the remaining elements of the >>>>>> input >>>>>> PCollection. >>>>>> >>>>>> Parameters: >>>>>> n: The number of elements to take from the input PCollection. >>>>>> key: A function that takes an element of the input >>>>>> PCollection and returns >>>>>> a value to compare for the purpose of determining the top >>>>>> n elements, >>>>>> similar to Python's built-in sorted function. >>>>>> reverse: If True, the top n elements will be the n smallest >>>>>> elements of the >>>>>> input PCollection. >>>>>> >>>>>> Example usage: >>>>>> >>>>>> >>> with beam.Pipeline() as p: >>>>>> ... top, remaining = (p >>>>>> ... | beam.Create(list(range(10))) >>>>>> ... | partitioners.Top(3)) >>>>>> ... # top will contain [7, 8, 9] >>>>>> ... # remaining will contain [0, 1, 2, 3, 4, 5, 6] >>>>>> >>>>>> """ >>>>>> >>>>>> They're basically partitioner versions of the aggregationers Top and >>>>>> Sample >>>>>> >>>>>> Best, >>>>>> Joey >>>>>> >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/document/d/1NpCipgvT6lMgf1nuuPPwZoKp5KsteplFancGqOgy8OY/edit#heading=h.x9snb54sjlu9 >>>>>> >>>>>