Yes there is definitely some similarity to groupby What is this used for: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/logical/LogicalTableFunctionScan.html Can i model a mapPartitions T -> U as ^ ?
On Thu, Mar 4, 2021 at 12:06 PM Rui Wang <[email protected]> wrote: > I feel like the mapPartitions can be implemented as a SELECT + GROUP BY, > where GROUP BY is to partition the data, then per partition computation is > handled by the SELECT. > > -Rui > > On Thu, Mar 4, 2021 at 11:56 AM Debajyoti Roy <[email protected]> wrote: > > > Thanks again Julian. > > > > Since, mapPartitions is really a specialized map would it be best to > model > > it as a SELECT (similar to functions inside an expression) ? > > Barring cases where h > h' and mapPartitions acts like a filter. > > > > On Thu, Mar 4, 2021 at 11:41 AM Julian Hyde <[email protected]> > > wrote: > > > > > SQL has equivalents of many functional programming idioms: > > > > > > * map is SELECT > > > * filter is WHERE > > > * flatMap is similar to CROSS APPLY > > > > > > That said, SQL’s strength is that the operations are not optimized for > > any > > > particular physical organization of data (e.g. working on sorted or > > > partitioned data). mapPartitions is in this category. Of course a > > physical > > > implementation of one of SQL’s logical operators might use > mapPartitions. > > > > > > Julian > > > > > > > > > > > > > > > > > > > On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <[email protected]> > > wrote: > > > > > > > > Thanks for the responses, adding some more color below. > > > > > > > > Spark's API adopted concepts from the functional programming paradigm > > > (map, > > > > filter, flatmap,...) into data processing. Spark did add several > > > relational > > > > operators like join, union, select, etc. However, there are certain > > APIs > > > > that are really hard to model in terms of standard relational > > operators. > > > > Let me take one example of mapPartitions. > > > > > > > > mapPartitions( T -> U ): > > > > w columns and h rows can turn into totally different w' != w columns > > and > > > h' > > > > != h rows. Since processing happens per partition, this API is a > great > > > > choice for vectorized heavyweight initialization cost operations e.g. > > > batch > > > > inferencing. > > > > > > > > In terms of relational models, mapPartitions can be modeled just > like a > > > > function inside an expression operator. However, there can be > > interesting > > > > cases e.g. h > h' and mapMartitions starts to feel like a filter. Can > > > there > > > > be other challenges and opportunities in terms of planner and > optimizer > > > > because mapPartitions is definitely NOT like any other function > inside > > an > > > > expression as shown below: > > > > > > > > SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}', > > > > 'sentiment_analysis_4', 10000) FROM my_twitter_data... > > > > > > > > So what is a better usage example for mapPartitions expressed as SQL > ? > > I > > > am > > > > really struggling with that part and I agree with Julian that is the > > key. > > > > > > > > Regards, > > > > Debajyoti Roy > > > > > > > > On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <[email protected]> > > > wrote: > > > > > > > >> I searched for mapPartitions and flatMapGroupsWithState, and it > looks > > as > > > >> if you are talking about Apache Spark operations. Can you give some > > > >> examples of typical queries that would use these operations? > > > >> > > > >> It’s possible that these operations accomplish things that are not > > > >> possible in the relational model; but I think it’s more likely that > > > these > > > >> are algorithms that can implement relational operations such as > > windowed > > > >> aggregate functions. If you give some examples, we can see whether > > they > > > can > > > >> be expressed in SQL or relational algebra. > > > >> > > > >> Julian > > > >> > > > >> > > > >>> On Mar 3, 2021, at 10:54 PM, Rui Wang <[email protected]> > wrote: > > > >>> > > > >>> Well I think the expected approach is to translate other operations > > to > > > >>> relational operators by yourself ;-) > > > >>> > > > >>> I think Calcite won't be the place to have extensions for such > > > >> translation. > > > >>> The main concern is that those non relational operations are > > > >> "non-standard". > > > >>> > > > >>> -Rui > > > >>> > > > >>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <[email protected] > > > > > >> wrote: > > > >>> > > > >>>> Hi All, > > > >>>> > > > >>>> For operators like filter, join, union, aggregate, window the > > > >>>> logical RelNode choices are obvious within > > > >> org.apache.calcite.rel.logical > > > >>>> package. > > > >>>> > > > >>>> However, for more complex applications that require operations > like > > > >>>> mapPartitions, flatMapGroupsWithState, etc. what would be some > > choices > > > >> for > > > >>>> logical rel node and relational expression classes in Apache > Calcite > > > >>>> (independent of engine)? > > > >>>> > > > >>>> What is a good way to model operators that are not traditionally > > > >> relational > > > >>>> and hence do not exist in Calcite (looking for hooks / extension > > > points > > > >> / > > > >>>> roadmaps)? > > > >>>> > > > >>>> Thanks in advance for any pointers, (disclaimer: I am new to > > Calcite) > > > >>>> Debajyoti Roy > > > >>>> > > > >> > > > >> > > > > > > > > >
