[SQL] Self join with ArrayType columns problems

2015-01-26 Thread PierreB
Using Spark 1.2.0, we are facing some weird behaviour when performing self join on a table with some ArrayType field. (potential bug ?) I have set up a minimal non working example here: https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f

Re: Are there any plans to run Spark on top of Succinct

2015-01-26 Thread Michael Armbrust
There was work being done at Berkeley on prototyping support for Succinct in Spark SQL. Rachit might have more information. On Thu, Jan 22, 2015 at 7:04 AM, Dean Wampler wrote: > Interesting. I was wondering recently if anyone has explored working with > compressed data directly. > > Dean Wampl

renaming SchemaRDD -> DataFrame

2015-01-26 Thread Reynold Xin
Hi, We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Patrick Wendell
One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin wrote: > Hi, > > We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to > get the community's opinion.

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Michael Malak
And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell To:

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Koert Kuipers
what i am trying to say is: structured data != sql On Mon, Jan 26, 2015 at 7:26 PM, Koert Kuipers wrote: > "The context is that SchemaRDD is becoming a common data format used for > bringing data into Spark from external systems, and used for various > components of Spark, e.g. MLlib's new pipel

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Koert Kuipers
"The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API." i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Mi

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Matei Zaharia
(Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei > On Jan 26, 2015, at 5:31 PM, Matei Zaharia wrote: > > While it might be possible to move this concept

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Matei Zaharia
While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant t

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Sandy Ryza
Both SchemaRDD and DataFrame sound fine to me, though I like the former slightly better because it's more descriptive. Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would be more clear from a user-facing perspective to at least choose a package name for it that omits "sql".

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Kushal Datta
I want to address the issue that Matei raised about the heavy lifting required for a full SQL support. It is amazing that even after 30 years of research there is not a single good open source columnar database like Vertica. There is a column store option in MySQL, but it is not nearly as sophistic

talk on interface design

2015-01-26 Thread Reynold Xin
Hi all, In Spark, we have done reasonable well historically in interface and API design, especially compared with some other Big Data systems. However, we have also made mistakes along the way. I want to share a talk I gave about interface design at Databricks' internal retreat. https://speakerde

Re: talk on interface design

2015-01-26 Thread Andrew Ash
In addition to the references you have at the end of the presentation, there's a great set of practical examples based on the learnings from Qt posted here: http://www21.in.tum.de/~blanchet/api-design.pdf Chapter 4's way of showing a principle and then an example from Qt is particularly instructio

[VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-26 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can