Re: [PYTHON] partitioner utilities?

Joey Tran Mon, 23 Oct 2023 09:04:05 -0700

PR for top: https://github.com/apache/beam/pull/29106


On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev <dev@beam.apache.org> wrote:

> +1 on this idea. Thanks!
>
> On Thu, Oct 19, 2023 at 3:40 PM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Yeah, I already implemented these partitioners for my use case (I just
>> pasted the classnames/docstrings for them) and I used both combiners.Top
>> and combiners.Sample.
>>
>> In fact, before writing these partitioners I had misunderstood those
>> combiners and thought they would partition my pcollections. Not sure if
>> that might be a common pitfall.
>>
>> On Thu, Oct 19, 2023 at 3:32 PM Anand Inguva via dev <dev@beam.apache.org>
>> wrote:
>>
>>> FYI, there is a Top transform[1] that will fetch the greatest n elements
>>> in Python SDK. It is not a partitioner but It may be useful for your
>>> reference.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191
>>>
>>> On Thu, Oct 19, 2023 at 3:21 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>>
>>>> Yes, both need to be small enough to fit into state.
>>>>
>>>> Yeah a percentage sampler would also be great, we have a bunch of use
>>>> cases for that ourselves. Not sure if it'd be too clever, but I was
>>>> imagining three public sampling partitioners: FixedSample,
>>>> PercentageSample, and Sample. Sample could automatically choose between
>>>> FixedSample and PercentageSample based on whether a percentage is given or
>>>> a large `n` is given.
>>>>
>>>> For `PercentageSample`, I was imagining we'd just take a count of the
>>>> number of elements and then assign every element a `rand` and keep the ones
>>>> that are larger than `n / Count(inputs)` (or percentage). For runners that
>>>> have fast counting, it should perform quickly. Open to other ideas though.
>>>>
>>>> Cheers,
>>>> Joey
>>>>
>>>>
>>>>
>>>> On Thu, Oct 19, 2023 at 3:10 PM Danny McCormick via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I'm interested adding something like this, I could see these being
>>>>> generally useful for a number of cases (one that immediately comes to mind
>>>>> is partitioning datasets into train/test/validation sets and writing each
>>>>> to a different place).
>>>>>
>>>>> I'm assuming Top (or FixedSample) needs to be small enough to fit into
>>>>> state? I would also be interested in being able to do percentages as well
>>>>> (something like partitioners.Sample(percent=10)), though that might be 
>>>>> much
>>>>> more challenging for an unbounded data set (maybe we could do something as
>>>>> simple as a probabilistic target_percentage).
>>>>>
>>>>> Happy to help review a design doc or PR.
>>>>>
>>>>> Thanks,
>>>>> Danny
>>>>>
>>>>> On Thu, Oct 19, 2023 at 10:06 AM Joey Tran <joey.t...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> While writing a few pipelines, I was surprised by how few
>>>>>> partitioners there were in the python SDK. I wrote a couple that are 
>>>>>> pretty
>>>>>> generic and possibly generally useful. Just wanted to do a quick poll to
>>>>>> see if they seem useful enough to be in the sdk's library of transforms. 
>>>>>> If
>>>>>> so, I can put together a PTransform Design Doc[1] for them. Just wanted 
>>>>>> to
>>>>>> confirm before spending time on the doc.
>>>>>>
>>>>>> Here are the two that I wrote, I'll just paste the class names and
>>>>>> docstrings:
>>>>>>
>>>>>> class FixedSample(beam.PTransform):
>>>>>>     """
>>>>>>     A PTransform that takes a PCollection and partitions it into two
>>>>>> PCollections.
>>>>>>     The first PCollection is a random sample of the input
>>>>>> PCollection, and the
>>>>>>     second PCollection is the remaining elements of the input
>>>>>> PCollection.
>>>>>>
>>>>>>     This is useful for creating holdout / test sets in machine
>>>>>> learning.
>>>>>>
>>>>>>     Example usage:
>>>>>>
>>>>>>         >>> with beam.Pipeline() as p:
>>>>>>         ...     sample, remaining = (p
>>>>>>         ...         | beam.Create(list(range(10)))
>>>>>>         ...         | partitioners.FixedSample(3))
>>>>>>         ...     # sample will contain three randomly selected
>>>>>> elements from the
>>>>>>         ...     # input PCollection
>>>>>>         ...     # remaining will contain the remaining seven elements
>>>>>>
>>>>>>     """
>>>>>>
>>>>>> class Top(beam.PTransform):
>>>>>>     """
>>>>>>     A PTransform that takes a PCollection and partitions it into two
>>>>>> PCollections.
>>>>>>     The first PCollection contains the largest n elements of the
>>>>>> input PCollection,
>>>>>>     and the second PCollection contains the remaining elements of the
>>>>>> input
>>>>>>     PCollection.
>>>>>>
>>>>>>     Parameters:
>>>>>>         n: The number of elements to take from the input PCollection.
>>>>>>         key: A function that takes an element of the input
>>>>>> PCollection and returns
>>>>>>             a value to compare for the purpose of determining the top
>>>>>> n elements,
>>>>>>             similar to Python's built-in sorted function.
>>>>>>         reverse: If True, the top n elements will be the n smallest
>>>>>> elements of the
>>>>>>             input PCollection.
>>>>>>
>>>>>>     Example usage:
>>>>>>
>>>>>>         >>> with beam.Pipeline() as p:
>>>>>>         ...     top, remaining = (p
>>>>>>         ...         | beam.Create(list(range(10)))
>>>>>>         ...         | partitioners.Top(3))
>>>>>>         ...     # top will contain [7, 8, 9]
>>>>>>         ...     # remaining will contain [0, 1, 2, 3, 4, 5, 6]
>>>>>>
>>>>>>     """
>>>>>>
>>>>>> They're basically partitioner versions of the aggregationers Top and
>>>>>> Sample
>>>>>>
>>>>>> Best,
>>>>>> Joey
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1NpCipgvT6lMgf1nuuPPwZoKp5KsteplFancGqOgy8OY/edit#heading=h.x9snb54sjlu9
>>>>>>
>>>>>

Re: [PYTHON] partitioner utilities?

Reply via email to