I'm wondering if it would make sense to have a built-in Beam transformation for calculating the Cartesian product of PCollections.
Just this past week, I've encountered two separate cases where calculating a Cartesian product was a bottleneck. The in-memory option of using something like Python's itertools.product() is convenient, but it only scales to a single node. Unfortunately, implementing a scalable Cartesian product seems to be somewhat non-trivial. I found two version of this question on StackOverflow, but neither contains a code solution: https://stackoverflow.com/questions/35008721/how-to-get-the-cartesian-product-of-two-pcollections https://stackoverflow.com/questions/41050477/how-to-do-a-cartesian-product-of-two-pcollections-in-dataflow/ There's a fair amount of nuance in an efficient and scalable implementation. My team has an internal implementation of a CartesianProduct transform, based on using hashing to split a pcollection into a finite number of groups and CoGroupByKey. On the other hand, if any of the input pcollections are small, using side inputs would probably be the way to go to avoid the need for a shuffle. Any thoughts? Cheers, Stephan
