Created JIRA ticket: https://issues.apache.org/jira/browse/SPARK-15533
@Koert - Please keep API feedback coming. One thing - in the future, can you send api feedbacks to the dev@ list instead of user@? On Wed, May 25, 2016 at 1:05 PM, Cheng Lian <l...@databricks.com> wrote: > Agree, since they can be easily replaced by .flatMap (to do explosion) and > .select (to rename output columns) > > Cheng > > > On 5/25/16 12:30 PM, Reynold Xin wrote: > > Based on this discussion I'm thinking we should deprecate the two explode > functions. > > On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com> > ko...@tresata.com> wrote: > >> wenchen, >> that definition of explode seems identical to flatMap, so you dont need >> it either? >> >> michael, >> i didn't know about the column expression version of explode, that makes >> sense. i will experiment with that instead. >> >> On Wed, May 25, 2016 at 3:03 PM, Wenchen Fan <wenc...@databricks.com> >> wrote: >> >>> I think we only need this version: `def explode[B : Encoder](f: A >>> => TraversableOnce[B]): Dataset[B]` >>> >>> For untyped one, `df.select(explode($"arrayCol").as("item"))` should be >>> the best choice. >>> >>> On Wed, May 25, 2016 at 11:55 AM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> These APIs predate Datasets / encoders, so that is why they are Row >>>> instead of objects. We should probably rethink that. >>>> >>>> Honestly, I usually end up using the column expression version of >>>> explode now that it exists (i.e. explode($"arrayCol").as("Item")). It >>>> would be great to understand more why you are using these instead. >>>> >>>> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> we currently have 2 explode definitions in Dataset: >>>>> >>>>> def explode[A <: Product : TypeTag](input: Column*)(f: Row => >>>>> TraversableOnce[A]): DataFrame >>>>> >>>>> def explode[A, B : TypeTag](inputColumn: String, outputColumn: >>>>> String)(f: A => TraversableOnce[B]): DataFrame >>>>> >>>>> 1) the separation of the functions into their own argument lists is >>>>> nice, but unfortunately scala's type inference doesn't handle this well, >>>>> meaning that the generic types always have to be explicitly provided. i >>>>> assume this was done to allow the "input" to be a varargs in the first >>>>> method, and then kept the same in the second for reasons of symmetry. >>>>> >>>>> 2) i am surprised the first definition returns a DataFrame. this seems >>>>> to suggest DataFrame usage (so DataFrame to DataFrame), but there is no >>>>> way >>>>> to specify the output column names, which limits its usability for >>>>> DataFrames. i frequently end up using the first definition for DataFrames >>>>> anyhow because of the need to return more than 1 column (and the data has >>>>> columns unknown at compile time that i need to carry along making flatMap >>>>> on Dataset clumsy/unusable), but relying on the output columns being >>>>> called >>>>> _1 and _2 and renaming then afterwards seems like an anti-pattern. >>>>> >>>>> 3) using Row objects isn't very pretty. why not f: A => >>>>> TraversableOnce[B] or something like that for the first definition? how >>>>> about: >>>>> def explode[A: TypeTag, B: TypeTag](input: Seq[Column], output: >>>>> Seq[Column])(f: A => TraversableOnce[B]): DataFrame >>>>> >>>>> best, >>>>> koert >>>>> >>>> >>>> >>> >> >