Re: renaming SchemaRDD -> DataFrame

Sandy Ryza Mon, 26 Jan 2015 17:48:13 -0800

Both SchemaRDD and DataFrame sound fine to me, though I like the former
slightly better because it's more descriptive.


Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
be more clear from a user-facing perspective to at least choose a package
name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema module for Spark
SQL to rely on, but I imagine that might be too large a change at this
point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[email protected]>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[email protected]>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[email protected]> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> [email protected]> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <[email protected]>
> >>> To: Reynold Xin <[email protected]>
> >>> Cc: "[email protected]" <[email protected]>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[email protected]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to