Hey all I have an interesting problem in hand. We have cases where we want to pass multiple(20 to 30) data frames to cogroup.applyInPandas function.
RDD currently supports cogroup with upto 4 dataframes (ZippedPartitionsRDD4) where as cogroup with pandas can handle only 2 dataframes (with ZippedPartitionsRDD2). In our use case, we do not have much control over how many data frames we may need in the cogroup.applyInPandas function. To achieve this, we can: (a) Implement ZippedPartitionsRDD5, ZippedPartitionsRDD..ZippedPartitionsRDD30..ZippedPartitionsRDD50 with respective iterators, serializers and so on. This ensures we keep type safety intact but a lot more boilerplate code has to be written to achieve this. (b) Do not use cogroup.applyInPandas, rather use RDD.keyBy.cogroup and then getItem in a nested fashion. Then convert data to pandas df in the python function. This looks like a good workaround but mistakes are very easy to happen. We also don't look at typesafety here from user's point of view. (c) Implement ZippedPartitionsRDDN and NaryLike with childrenNodes type set to Seq[T] which allows for arbitrary number of children to be set. Here we have very little boilerplate but we sacrifice type safety. (d) ... some new suggestions... ? I have done preliminary work on option (c). It works like a charm but before I proceed, is my concern about sacrificed type safety overblown, and do we have an approach (d)? (a) is something that is too much of an investment for it to be useful. (b) is okay enough workaround, but it is not very efficient.
signature.asc
Description: Message signed with OpenPGP