[VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-16 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.2! The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because ...

Spark internal Logging trait potential thread unsafe

2016-06-16 Thread Prajwal Tuladhar
Hi, The way log instance inside Logger trait is current being initialized doesn't seem to be thread safe [1]. Current implementation only guarantees initializeLogIfNecessary() is initialized in lazy + thread safe way. Is there a reason why it can't be just: [2] @transient private lazy val log_ :

Regarding on the dataframe stat frequent

2016-06-16 Thread Luyi Wang
Hey there: The frequent item in dataframe stat package seems not accurate. In the documentation,it did mention that it has false positive but still seems incorrect. Wondering if this is all known problem or not? Here is a quick example showing the problem. val sqlContext = new SQLContext(sc) i

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
Its not throwing away any information from the point of view of the SQL optimizer. The schema preserves all the type information that the catalyst uses. The type information T in Dataset[T] is only used at the API level to ensure compilation-time type checks of the user program. On Thu, Jun 16, 20

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
I'm clear on what a type alias is. My question is more that moving from e.g. Dataset[T] to Dataset[Row] involves throwing away information. Reading through code that uses the Dataframe alias, it's a little hard for me to know when that's intentional or not. On Thu, Jun 16, 2016 at 2:50 PM, Tath

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
There are different ways to view this. If its confusing to think that Source API returning DataFrames, its equivalent to thinking that you are returning a Dataset[Row], and DataFrame is just a shorthand. And DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is to Array[String

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher
Yea WRT Options maybe I'm thinking about it incorrectly or misrepresenting it as relating to Encoders or to pure Option encoder. The semantics I'm thinking of are around the deserialization of a type T and lifting it into Option[T] via the Option.apply function which converts null to None. Tying ba

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Is this really an internal / external distinction? For a concrete example, Source.getBatch seems to be a public interface, but returns DataFrame. On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das wrote: > DataFrame is a type alias of Dataset[Row], so externally it seems like > Dataset is the main t

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust
There is no public API for writing encoders at the moment, though we are hoping to open this up in Spark 2.1. What is not working about encoders for options? Which version of Spark are you running? This is working as I would expect? https://databricks-prod-cloudfront.cloud.databricks.com/public

DMTCP and debug a failed stage in spark

2016-06-16 Thread Ovidiu-Cristian MARCU
Hi, I have a TPCDS query that fails in the stage 80 which is a ResultStage (SparkSQL). Ideally I would like to ‘checkpoint’ a previous stage which was executed successfully and replay the failed stage for debug purposes. Anyone managed to do something similar that could point some hints? Maybe s

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
DataFrame is a type alias of Dataset[Row], so externally it seems like Dataset is the main type and DataFrame is a derivative type. However, internally, since everything is processed as Rows, everything uses DataFrames, Type classes used in a Dataset is internally converted to rows for processing.

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Sorry, meant DataFrame vs Dataset On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote: > Is there a principled reason why sql.streaming.* and > sql.execution.streaming.* are making extensive use of DataFrame > instead of Datasource? > > Or is that just a holdover from code written before the m

Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Is there a principled reason why sql.streaming.* and sql.execution.streaming.* are making extensive use of DataFrame instead of Datasource? Or is that just a holdover from code written before the move / type alias? - To unsubscri

Re: Spark SQL Count Distinct

2016-06-16 Thread Reynold Xin
You should be fine in 1.6 onward. Count distinct doesn't require data to fit in memory there. On Thu, Jun 16, 2016 at 1:57 AM, Avshalom wrote: > Hi all, > > We would like to perform a count distinct query based on a certain filter. > e.g. our data is of the form: > > userId, Name, Restaurant na

Hello

2016-06-16 Thread mylisttech
Dear All, Looking for guidance. I am Interested in contributing to the Spark MLlib. Could you please take a few minutes to guide me as to what you would consider an ideal path / skill an individual should posses. I know R / Python / Java / C and C++ I have a firm understanding of algorithms

Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher
Are there any user or dev guides for writing Encoders? I'm trying to read through the source code to figure out how to write a proper Option[T] encoder, but it's not straightforward without deep spark-sql source knowledge. Is it unexpected for users to need to write their own Encoders with the avai

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Tom Graves
+1 Tom On Wednesday, June 15, 2016 2:01 PM, Reynold Xin wrote: It's been a while and we have accumulated quite a few bug fixes in branch-1.6. I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get in last minute? On a related note, I'm thinking about cutting 2

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Jacek Laskowski
That's be awesome to have another 2.0 RC! I know many people who'd consider it as a call to action to play with 2.0. +1000 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskows

Spark SQL Count Distinct

2016-06-16 Thread Avshalom
Hi all, We would like to perform a count distinct query based on a certain filter. e.g. our data is of the form: userId, Name, Restaurant name, Restaurant Type === 100,John, Pizza Hut,Pizza 100,John, Del Pepe, Pasta 100,John,