Pandas UDF cogroup.applyInPandas with multiple dataframes

Santosh Pingale Tue, 24 Jan 2023 06:05:10 -0800

Hey all

I have an interesting problem in hand. We have cases where we want to pass 
multiple(20 to 30) data frames to cogroup.applyInPandas function.


RDD currently supports cogroup with upto 4 dataframes (ZippedPartitionsRDD4)  
where as cogroup with pandas can handle only 2 dataframes (with 
ZippedPartitionsRDD2). In our use case, we do not have much control over how 
many data frames we may need in the cogroup.applyInPandas function.

To achieve this, we can:
(a) Implement ZippedPartitionsRDD5, 
ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with 
respective iterators, serializers and so on. This ensures we keep type safety 
intact but a lot more boilerplate code has to be written to achieve this.
(b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and then 
getItem in a nested fashion. Then convert data to pandas df in the python 
function. This looks like a good workaround but mistakes are very easy to 
happen. We also don't look at typesafety here from user's point of view.
(c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type set to 
Seq[T] which allows for arbitrary number of children to be set. Here we have 
very little boilerplate but we sacrifice type safety.
(d) ... some new suggestions... ?

I have done preliminary work on option (c). It works like a charm but before I 
proceed, is my concern about sacrificed type safety overblown, and do we have 
an approach (d)?
(a) is something that is too much of an investment for it to be useful. (b) is 
okay enough workaround, but it is not very efficient.

signature.asc
Description: Message signed with OpenPGP

Pandas UDF cogroup.applyInPandas with multiple dataframes

Reply via email to