Re: renaming SchemaRDD -> DataFrame

Kushal Datta Mon, 26 Jan 2015 18:03:55 -0800

I want to address the issue that Matei raised about the heavy lifting
required for a full SQL support. It is amazing that even after 30 years of
research there is not a single good open source columnar database like
Vertica. There is a column store option in MySQL, but it is not nearly as
sophisticated as Vertica or MonetDB. But there's a true need for such a
system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[email protected]> wrote:


> Both SchemaRDD and DataFrame sound fine to me, though I like the former
> slightly better because it's more descriptive.
>
> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would
> be more clear from a user-facing perspective to at least choose a package
> name for it that omits "sql".
>
> I would also be in favor of adding a separate Spark Schema module for Spark
> SQL to rely on, but I imagine that might be too large a change at this
> point?
>
> -Sandy
>
> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <[email protected]>
> wrote:
>
> > (Actually when we designed Spark SQL we thought of giving it another
> name,
> > like Spark Schema, but we decided to stick with SQL since that was the
> most
> > obvious use case to many users.)
> >
> > Matei
> >
> > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <[email protected]>
> > wrote:
> > >
> > > While it might be possible to move this concept to Spark Core
> long-term,
> > supporting structured data efficiently does require quite a bit of the
> > infrastructure in Spark SQL, such as query planning and columnar storage.
> > The intent of Spark SQL though is to be more than a SQL server -- it's
> > meant to be a library for manipulating structured data. Since this is
> > possible to build over the core API, it's pretty natural to organize it
> > that way, same as Spark Streaming is a library.
> > >
> > > Matei
> > >
> > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[email protected]> wrote:
> > >>
> > >> "The context is that SchemaRDD is becoming a common data format used
> for
> > >> bringing data into Spark from external systems, and used for various
> > >> components of Spark, e.g. MLlib's new pipeline API."
> > >>
> > >> i agree. this to me also implies it belongs in spark core, not sql
> > >>
> > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > >> [email protected]> wrote:
> > >>
> > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> > Area
> > >>> Spark Meetup YouTube contained a wealth of background information on
> > this
> > >>> idea (mostly from Patrick and Reynold :-).
> > >>>
> > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > >>>
> > >>> ________________________________
> > >>> From: Patrick Wendell <[email protected]>
> > >>> To: Reynold Xin <[email protected]>
> > >>> Cc: "[email protected]" <[email protected]>
> > >>> Sent: Monday, January 26, 2015 4:01 PM
> > >>> Subject: Re: renaming SchemaRDD -> DataFrame
> > >>>
> > >>>
> > >>> One thing potentially not clear from this e-mail, there will be a 1:1
> > >>> correspondence where you can get an RDD to/from a DataFrame.
> > >>>
> > >>>
> > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[email protected]>
> > wrote:
> > >>>> Hi,
> > >>>>
> > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> wanted
> > to
> > >>>> get the community's opinion.
> > >>>>
> > >>>> The context is that SchemaRDD is becoming a common data format used
> > for
> > >>>> bringing data into Spark from external systems, and used for various
> > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> > more
> > >>> and
> > >>>> more users to be programming directly against SchemaRDD API rather
> > than
> > >>> the
> > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
> originally
> > >>>> designed for writing test cases, always has the data-frame like API.
> > In
> > >>>> 1.3, we are redesigning the API to make the API usable for end
> users.
> > >>>>
> > >>>>
> > >>>> There are two motivations for the renaming:
> > >>>>
> > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> > >>>>
> > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> > (even
> > >>>> though it would contain some RDD functions like map, flatMap, etc),
> > and
> > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> > >>> Instead.
> > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> > >>>>
> > >>>>
> > >>>> My understanding is that very few users program directly against the
> > >>>> SchemaRDD API at the moment, because they are not well documented.
> > >>> However,
> > >>>> oo maintain backward compatibility, we can create a type alias
> > DataFrame
> > >>>> that is still named SchemaRDD. This will maintain source
> compatibility
> > >>> for
> > >>>> Scala. That said, we will have to update all existing materials to
> use
> > >>>> DataFrame rather than SchemaRDD.
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>>
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to