Hey all, While writing a few pipelines, I was surprised by how few partitioners there were in the python SDK. I wrote a couple that are pretty generic and possibly generally useful. Just wanted to do a quick poll to see if they seem useful enough to be in the sdk's library of transforms. If so, I can put together a PTransform Design Doc[1] for them. Just wanted to confirm before spending time on the doc.
Here are the two that I wrote, I'll just paste the class names and docstrings: class FixedSample(beam.PTransform): """ A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection is a random sample of the input PCollection, and the second PCollection is the remaining elements of the input PCollection. This is useful for creating holdout / test sets in machine learning. Example usage: >>> with beam.Pipeline() as p: ... sample, remaining = (p ... | beam.Create(list(range(10))) ... | partitioners.FixedSample(3)) ... # sample will contain three randomly selected elements from the ... # input PCollection ... # remaining will contain the remaining seven elements """ class Top(beam.PTransform): """ A PTransform that takes a PCollection and partitions it into two PCollections. The first PCollection contains the largest n elements of the input PCollection, and the second PCollection contains the remaining elements of the input PCollection. Parameters: n: The number of elements to take from the input PCollection. key: A function that takes an element of the input PCollection and returns a value to compare for the purpose of determining the top n elements, similar to Python's built-in sorted function. reverse: If True, the top n elements will be the n smallest elements of the input PCollection. Example usage: >>> with beam.Pipeline() as p: ... top, remaining = (p ... | beam.Create(list(range(10))) ... | partitioners.Top(3)) ... # top will contain [7, 8, 9] ... # remaining will contain [0, 1, 2, 3, 4, 5, 6] """ They're basically partitioner versions of the aggregationers Top and Sample Best, Joey [1] https://docs.google.com/document/d/1NpCipgvT6lMgf1nuuPPwZoKp5KsteplFancGqOgy8OY/edit#heading=h.x9snb54sjlu9