Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev
I'm wondering if it would make sense to have a built-in Beam transformation for calculating the Cartesian product of PCollections. Just this past week, I've encountered two separate cases where calculating a Cartesian product was a bottleneck. The in-memory option of using something like Python's

Re: Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev
> > > > My team has an internal implementation of a CartesianProduct > transform, based on using hashing to split a pcollection into a finite > number of groups and CoGroupByKey. > > > > Could this be contributed to Beam? > If it would be of broader interest, I would be happy to work on this for t

beam.Create(range(N)) without building a sequence in memory

2022-09-19 Thread Stephan Hoyer via dev
Many of my Beam pipelines start with partitioning over some large, statically known number of inputs that could be created from a list of sequential integers. In Python, these sequential integers can be efficiently represented with a range() object, which stores the start/top and interval. However

Hierarchical fanout with Beam combiners?

2023-05-26 Thread Stephan Hoyer via dev
We have some use-cases where we are combining over very large sets (e.g., computing the average of 1e5 to 1e6 elements, corresponding to hourly weather observations over the past 50 years). "with_hot_key_fanout" seems to be rather essential for performing these calculations, but as far as I can te

Re: Hierarchical fanout with Beam combiners?

2023-05-27 Thread Stephan Hoyer via dev
anout=30, which means combining up to 3 GB of data on a single machine. This is probably fine but doesn't leave a large machine for error. > On Fri, May 26, 2023 at 1:57 PM Stephan Hoyer via dev < > dev@beam.apache.org> wrote: > >> We have some use-cases where we are co

Ensuring a task does not get executed concurrently

2023-06-12 Thread Stephan Hoyer via dev
Can the Beam data model (specifically the Python SDK) support executing functions that are idempotent but not concurrency-safe? I am thinking of a task like setting up a database (or in my case, a Zarr store in Xarray-Beam ) where it is no