thats great. guess i was looking at a somewhat stale master branch... On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin <r...@databricks.com> wrote:
> Koert, > > As Mark said, I have already refactored the API so that nothing is > catalyst is exposed (and users won't need them anyway). Data types, Row > interfaces are both outside catalyst package and in org.apache.spark.sql. > > On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> hey matei, >> i think that stuff such as SchemaRDD, columar storage and perhaps also >> query planning can be re-used by many systems that do analysis on >> structured data. i can imagine panda-like systems, but also datalog or >> scalding-like (which we use at tresata and i might rebase on SchemaRDD at >> some point). SchemaRDD should become the interface for all these. and >> columnar storage abstractions should be re-used between all these. >> >> currently the sql tie in is way beyond just the (perhaps unfortunate) >> naming convention. for example a core part of the SchemaRD abstraction is >> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone >> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser >> (if i understand it correctly, i have not used catalyst, but it looks >> neat). i should not need to include a SQL parser just to use structured >> data in say a panda-like framework. >> >> best, koert >> >> >> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> While it might be possible to move this concept to Spark Core long-term, >>> supporting structured data efficiently does require quite a bit of the >>> infrastructure in Spark SQL, such as query planning and columnar storage. >>> The intent of Spark SQL though is to be more than a SQL server -- it's >>> meant to be a library for manipulating structured data. Since this is >>> possible to build over the core API, it's pretty natural to organize it >>> that way, same as Spark Streaming is a library. >>> >>> Matei >>> >>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> > >>> > "The context is that SchemaRDD is becoming a common data format used >>> for >>> > bringing data into Spark from external systems, and used for various >>> > components of Spark, e.g. MLlib's new pipeline API." >>> > >>> > i agree. this to me also implies it belongs in spark core, not sql >>> > >>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < >>> > michaelma...@yahoo.com.invalid> wrote: >>> > >>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay >>> Area >>> >> Spark Meetup YouTube contained a wealth of background information on >>> this >>> >> idea (mostly from Patrick and Reynold :-). >>> >> >>> >> https://www.youtube.com/watch?v=YWppYPWznSQ >>> >> >>> >> ________________________________ >>> >> From: Patrick Wendell <pwend...@gmail.com> >>> >> To: Reynold Xin <r...@databricks.com> >>> >> Cc: "dev@spark.apache.org" <dev@spark.apache.org> >>> >> Sent: Monday, January 26, 2015 4:01 PM >>> >> Subject: Re: renaming SchemaRDD -> DataFrame >>> >> >>> >> >>> >> One thing potentially not clear from this e-mail, there will be a 1:1 >>> >> correspondence where you can get an RDD to/from a DataFrame. >>> >> >>> >> >>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>> Hi, >>> >>> >>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and >>> wanted to >>> >>> get the community's opinion. >>> >>> >>> >>> The context is that SchemaRDD is becoming a common data format used >>> for >>> >>> bringing data into Spark from external systems, and used for various >>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect >>> more >>> >> and >>> >>> more users to be programming directly against SchemaRDD API rather >>> than >>> >> the >>> >>> core RDD API. SchemaRDD, through its less commonly used DSL >>> originally >>> >>> designed for writing test cases, always has the data-frame like API. >>> In >>> >>> 1.3, we are redesigning the API to make the API usable for end users. >>> >>> >>> >>> >>> >>> There are two motivations for the renaming: >>> >>> >>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD. >>> >>> >>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore >>> (even >>> >>> though it would contain some RDD functions like map, flatMap, etc), >>> and >>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing. >>> >> Instead. >>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods. >>> >>> >>> >>> >>> >>> My understanding is that very few users program directly against the >>> >>> SchemaRDD API at the moment, because they are not well documented. >>> >> However, >>> >>> oo maintain backward compatibility, we can create a type alias >>> DataFrame >>> >>> that is still named SchemaRDD. This will maintain source >>> compatibility >>> >> for >>> >>> Scala. That said, we will have to update all existing materials to >>> use >>> >>> DataFrame rather than SchemaRDD. >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> >>> >> >>> >>> >> >