date:20160616

Spark SQL Count Distinct

2016-06-16 Thread Avshalom

Hi all, We would like to perform a count distinct query based on a certain filter. e.g. our data is of the form: userId, Name, Restaurant name, Restaurant Type === 100,John, Pizza Hut,Pizza 100,John, Del Pepe, Pasta 100,John,

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Jacek Laskowski

That's be awesome to have another 2.0 RC! I know many people who'd consider it as a call to action to play with 2.0. +1000 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskows

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Tom Graves

+1 Tom On Wednesday, June 15, 2016 2:01 PM, Reynold Xin wrote: It's been a while and we have accumulated quite a few bug fixes in branch-1.6. I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get in last minute? On a related note, I'm thinking about cutting 2

Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher

Are there any user or dev guides for writing Encoders? I'm trying to read through the source code to figure out how to write a proper Option[T] encoder, but it's not straightforward without deep spark-sql source knowledge. Is it unexpected for users to need to write their own Encoders with the avai

Hello

2016-06-16 Thread mylisttech

Dear All, Looking for guidance. I am Interested in contributing to the Spark MLlib. Could you please take a few minutes to guide me as to what you would consider an ideal path / skill an individual should posses. I know R / Python / Java / C and C++ I have a firm understanding of algorithms

Re: Spark SQL Count Distinct

2016-06-16 Thread Reynold Xin

You should be fine in 1.6 onward. Count distinct doesn't require data to fit in memory there. On Thu, Jun 16, 2016 at 1:57 AM, Avshalom wrote: > Hi all, > > We would like to perform a count distinct query based on a certain filter. > e.g. our data is of the form: > > userId, Name, Restaurant na

Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger

Is there a principled reason why sql.streaming.* and sql.execution.streaming.* are making extensive use of DataFrame instead of Datasource? Or is that just a holdover from code written before the move / type alias? - To unsubscri

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger

Sorry, meant DataFrame vs Dataset On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote: > Is there a principled reason why sql.streaming.* and > sql.execution.streaming.* are making extensive use of DataFrame > instead of Datasource? > > Or is that just a holdover from code written before the m

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das

DataFrame is a type alias of Dataset[Row], so externally it seems like Dataset is the main type and DataFrame is a derivative type. However, internally, since everything is processed as Rows, everything uses DataFrames, Type classes used in a Dataset is internally converted to rows for processing.

DMTCP and debug a failed stage in spark

2016-06-16 Thread Ovidiu-Cristian MARCU

Hi, I have a TPCDS query that fails in the stage 80 which is a ResultStage (SparkSQL). Ideally I would like to ‘checkpoint’ a previous stage which was executed successfully and replay the failed stage for debug purposes. Anyone managed to do something similar that could point some hints? Maybe s

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust

There is no public API for writing encoders at the moment, though we are hoping to open this up in Spark 2.1. What is not working about encoders for options? Which version of Spark are you running? This is working as I would expect? https://databricks-prod-cloudfront.cloud.databricks.com/public

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger

Is this really an internal / external distinction? For a concrete example, Source.getBatch seems to be a public interface, but returns DataFrame. On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das wrote: > DataFrame is a type alias of Dataset[Row], so externally it seems like > Dataset is the main t

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher

Yea WRT Options maybe I'm thinking about it incorrectly or misrepresenting it as relating to Encoders or to pure Option encoder. The semantics I'm thinking of are around the deserialization of a type T and lifting it into Option[T] via the Option.apply function which converts null to None. Tying ba

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das

There are different ways to view this. If its confusing to think that Source API returning DataFrames, its equivalent to thinking that you are returning a Dataset[Row], and DataFrame is just a shorthand. And DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is to Array[String

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger

I'm clear on what a type alias is. My question is more that moving from e.g. Dataset[T] to Dataset[Row] involves throwing away information. Reading through code that uses the Dataframe alias, it's a little hard for me to know when that's intentional or not. On Thu, Jun 16, 2016 at 2:50 PM, Tath

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das

Its not throwing away any information from the point of view of the SQL optimizer. The schema preserves all the type information that the catalyst uses. The type information T in Dataset[T] is only used at the API level to ensure compilation-time type checks of the user program. On Thu, Jun 16, 20

Regarding on the dataframe stat frequent

2016-06-16 Thread Luyi Wang

Hey there: The frequent item in dataframe stat package seems not accurate. In the documentation,it did mention that it has false positive but still seems incorrect. Wondering if this is all known problem or not? Here is a quick example showing the problem. val sqlContext = new SQLContext(sc) i

Spark internal Logging trait potential thread unsafe

2016-06-16 Thread Prajwal Tuladhar

Hi, The way log instance inside Logger trait is current being initialized doesn't seem to be thread safe [1]. Current implementation only guarantees initializeLogIfNecessary() is initialized in lazy + thread safe way. Is there a reason why it can't be just: [2] @transient private lazy val log_ :

[VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-16 Thread Reynold Xin

Please vote on releasing the following candidate as Apache Spark version 1.6.2! The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because ...

Spark SQL Count Distinct

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

Encoder Guide / Option[T] Encoder

Hello

Re: Spark SQL Count Distinct

Structured streaming use of DataFrame vs Datasource

Re: Structured streaming use of DataFrame vs Datasource

Re: Structured streaming use of DataFrame vs Datasource

DMTCP and debug a failed stage in spark

Re: Encoder Guide / Option[T] Encoder

Re: Structured streaming use of DataFrame vs Datasource

Re: Encoder Guide / Option[T] Encoder

Re: Structured streaming use of DataFrame vs Datasource

Re: Structured streaming use of DataFrame vs Datasource

Re: Structured streaming use of DataFrame vs Datasource

Regarding on the dataframe stat frequent

Spark internal Logging trait potential thread unsafe

[VOTE] Release Apache Spark 1.6.2 (RC1)

19 matches

Site Navigation

Mail list logo

Footer information