Spark SQL Count Distinct

2016-06-16 Thread Avshalom
Hi all,

We would like to perform a count distinct query based on a certain filter.
e.g. our data is of the form:

userId, Name, Restaurant name, Restaurant Type
===
100,John, Pizza Hut,Pizza
100,John, Del Pepe, Pasta
100,John, Hagen Daz,  Ice Cream
100,John, Dominos, Pasta
200,Mandy,  Del Pepe, Pasta

And we would like to know the number of distinct Pizza eaters.

The issue is, we have roughly ~200 million entries, so even with a large
cluster, we could still be in a risk of memory overload if the distinct
implementation has to load all of the data into RAM.
The Spark Core implementation which uses reduce to 1 and sum doesn't have
this risk.

I've found this old thread which compares Spark Core and Spark SQL count
distinct performance:

http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html

  

>From reading the source code, seems like the current Spark SQL count
distinct implementation is not based on a Hash Set anymore, but we're still
concerned that it won't be as safe as the Spark Core implementation.
We don't mind waiting a long time for the computation to end, but we don't
want to reach out of memory errors.

Would highly appreciate any input.

Thanks
Avshalom



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Jacek Laskowski
That's be awesome to have another 2.0 RC! I know many people who'd
consider it as a call to action to play with 2.0.

+1000

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jun 15, 2016 at 9:01 PM, Reynold Xin  wrote:
> It's been a while and we have accumulated quite a few bug fixes in
> branch-1.6. I'm thinking about cutting 1.6.2 rc this week. Any patches
> somebody want to get in last minute?
>
> On a related note, I'm thinking about cutting 2.0.0 rc this week too. I
> looked at the 60 unresolved tickets and almost all of them look like they
> can be retargeted are are just some doc updates. I'm going to be more
> aggressive and pushing individual people about resolving those, in case this
> drags on forever.
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-16 Thread Tom Graves
+1
Tom 

On Wednesday, June 15, 2016 2:01 PM, Reynold Xin  
wrote:
 

 It's been a while and we have accumulated quite a few bug fixes in branch-1.6. 
I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get 
in last minute?
On a related note, I'm thinking about cutting 2.0.0 rc this week too. I looked 
at the 60 unresolved tickets and almost all of them look like they can be 
retargeted are are just some doc updates. I'm going to be more aggressive and 
pushing individual people about resolving those, in case this drags on forever.





   

Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher
Are there any user or dev guides for writing Encoders? I'm trying to read
through the source code to figure out how to write a proper Option[T]
encoder, but it's not straightforward without deep spark-sql source
knowledge. Is it unexpected for users to need to write their own Encoders
with the availability of ExpressionEncoder.apply and the bean encoder
method?

As additional color for the Option[T] encoder, I have tried using the
ExpressionEncoder but it does not treat nulls properly and passes them
through. I'm not sure if this is a side-effect of
https://github.com/apache/spark/pull/13425 where a beneficial change was
made to have missing parts of joins continue through as nulls instead of
the default value for the data type (like -1 for ints). But my thought is
that would then also apply for the generic Option encoder generated as it
would see a null column value and skip passing it into the Option.apply

-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Hello

2016-06-16 Thread mylisttech
Dear All,


Looking for guidance.

I am Interested in contributing to the Spark MLlib. Could you please take a few 
minutes to guide me as to what you would consider an ideal path / skill an 
individual should posses.

I know R / Python / Java / C and C++

I have a firm understanding of algorithms and Machine learning. I do know spark 
at a "workable knowledge level".

Where should I start and what should I try to do first  ( spark internal level 
) and then pick up items on JIRA OR new specifications on Spark.

R has a great set of packages - would it be difficult to migrate them to Spark 
R set. I could try it with your support or if it's desired. 


I wouldn't mind doing testing of some defects etc as an initial learning curve 
if that would assist the community.

Please, guide.

Regards,
Harmeet



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL Count Distinct

2016-06-16 Thread Reynold Xin
You should be fine in 1.6 onward. Count distinct doesn't require data to
fit in memory there.


On Thu, Jun 16, 2016 at 1:57 AM, Avshalom  wrote:

> Hi all,
>
> We would like to perform a count distinct query based on a certain filter.
> e.g. our data is of the form:
>
> userId, Name, Restaurant name, Restaurant Type
> ===
> 100,John, Pizza Hut,Pizza
> 100,John, Del Pepe, Pasta
> 100,John, Hagen Daz,  Ice Cream
> 100,John, Dominos, Pasta
> 200,Mandy,  Del Pepe, Pasta
>
> And we would like to know the number of distinct Pizza eaters.
>
> The issue is, we have roughly ~200 million entries, so even with a large
> cluster, we could still be in a risk of memory overload if the distinct
> implementation has to load all of the data into RAM.
> The Spark Core implementation which uses reduce to 1 and sum doesn't have
> this risk.
>
> I've found this old thread which compares Spark Core and Spark SQL count
> distinct performance:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> >
>
> From reading the source code, seems like the current Spark SQL count
> distinct implementation is not based on a Hash Set anymore, but we're still
> concerned that it won't be as safe as the Spark Core implementation.
> We don't mind waiting a long time for the computation to end, but we don't
> want to reach out of memory errors.
>
> Would highly appreciate any input.
>
> Thanks
> Avshalom
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Is there a principled reason why sql.streaming.* and
sql.execution.streaming.* are making extensive use of DataFrame
instead of Datasource?

Or is that just a holdover from code written before the move / type alias?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Sorry, meant DataFrame vs Dataset

On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger  wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just a holdover from code written before the move / type alias?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing. . Therefore internally DataFrame is like "main" type that
is used.

On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger  wrote:

> Sorry, meant DataFrame vs Dataset
>
> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger 
> wrote:
> > Is there a principled reason why sql.streaming.* and
> > sql.execution.streaming.* are making extensive use of DataFrame
> > instead of Datasource?
> >
> > Or is that just a holdover from code written before the move / type
> alias?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


DMTCP and debug a failed stage in spark

2016-06-16 Thread Ovidiu-Cristian MARCU
Hi,

I have a TPCDS query that fails in the stage 80 which is a ResultStage 
(SparkSQL).
Ideally I would like to ‘checkpoint’ a previous stage which was executed 
successfully and replay the failed stage for debug purposes.
Anyone managed to do something similar that could point some hints?
Maybe someone used a tool like DMTCP [1] and it can be applied in this 
situation?

[1] http://dmtcp.sourceforge.net/ 

Best,
Ovidiu
  

Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Michael Armbrust
There is no public API for writing encoders at the moment, though we are
hoping to open this up in Spark 2.1.

What is not working about encoders for options?  Which version of Spark are
you running?  This is working as I would expect?

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1073771007111588/2840265927289860/latest.html

On Thu, Jun 16, 2016 at 8:37 AM, Richard Marscher 
wrote:

> Are there any user or dev guides for writing Encoders? I'm trying to read
> through the source code to figure out how to write a proper Option[T]
> encoder, but it's not straightforward without deep spark-sql source
> knowledge. Is it unexpected for users to need to write their own Encoders
> with the availability of ExpressionEncoder.apply and the bean encoder
> method?
>
> As additional color for the Option[T] encoder, I have tried using the
> ExpressionEncoder but it does not treat nulls properly and passes them
> through. I'm not sure if this is a side-effect of
> https://github.com/apache/spark/pull/13425 where a beneficial change was
> made to have missing parts of joins continue through as nulls instead of
> the default value for the data type (like -1 for ints). But my thought is
> that would then also apply for the generic Option encoder generated as it
> would see a null column value and skip passing it into the Option.apply
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>


Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Is this really an internal / external distinction?

For a concrete example, Source.getBatch seems to be a public
interface, but returns DataFrame.

On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
 wrote:
> DataFrame is a type alias of Dataset[Row], so externally it seems like
> Dataset is the main type and DataFrame is a derivative type.
> However, internally, since everything is processed as Rows, everything uses
> DataFrames, Type classes used in a Dataset is internally converted to rows
> for processing. . Therefore internally DataFrame is like "main" type that is
> used.
>
> On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger  wrote:
>>
>> Sorry, meant DataFrame vs Dataset
>>
>> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger 
>> wrote:
>> > Is there a principled reason why sql.streaming.* and
>> > sql.execution.streaming.* are making extensive use of DataFrame
>> > instead of Datasource?
>> >
>> > Or is that just a holdover from code written before the move / type
>> > alias?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Encoder Guide / Option[T] Encoder

2016-06-16 Thread Richard Marscher
Yea WRT Options maybe I'm thinking about it incorrectly or misrepresenting
it as relating to Encoders or to pure Option encoder. The semantics I'm
thinking of are around the deserialization of a type T and lifting it into
Option[T] via the Option.apply function which converts null to None. Tying
back into joinWith where you get nulls on outer-joins. It's possible to map
after joinWith and use the Option.apply in the map function, but there was
a thought it might be able to be done at deserialization. Maybe that's not
possible.

On Thu, Jun 16, 2016 at 3:03 PM, Michael Armbrust 
wrote:

> There is no public API for writing encoders at the moment, though we are
> hoping to open this up in Spark 2.1.
>
> What is not working about encoders for options?  Which version of Spark
> are you running?  This is working as I would expect?
>
>
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1073771007111588/2840265927289860/latest.html
>
> On Thu, Jun 16, 2016 at 8:37 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Are there any user or dev guides for writing Encoders? I'm trying to read
>> through the source code to figure out how to write a proper Option[T]
>> encoder, but it's not straightforward without deep spark-sql source
>> knowledge. Is it unexpected for users to need to write their own Encoders
>> with the availability of ExpressionEncoder.apply and the bean encoder
>> method?
>>
>> As additional color for the Option[T] encoder, I have tried using the
>> ExpressionEncoder but it does not treat nulls properly and passes them
>> through. I'm not sure if this is a side-effect of
>> https://github.com/apache/spark/pull/13425 where a beneficial change was
>> made to have missing parts of joins continue through as nulls instead of
>> the default value for the data type (like -1 for ints). But my thought is
>> that would then also apply for the generic Option encoder generated as it
>> would see a null column value and skip passing it into the Option.apply
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com  | Our Blog
>>  | Twitter  |
>> Facebook  | LinkedIn
>> 
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
There are different ways to view this. If its confusing to think that
Source API returning DataFrames, its equivalent to thinking that you are
returning a Dataset[Row], and DataFrame is just a shorthand.
And DataFrame/Datasetp[Row] is to Dataset[String] is what java
Array[Object] is to Array[String]. DataFrame is more general in a way, as
every Dataset can be boiled down to a DataFrame. So to keep the Source APIs
general (and also source-compatible with 1.x), they return DataFrame.

On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger  wrote:

> Is this really an internal / external distinction?
>
> For a concrete example, Source.getBatch seems to be a public
> interface, but returns DataFrame.
>
> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
>  wrote:
> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> > Dataset is the main type and DataFrame is a derivative type.
> > However, internally, since everything is processed as Rows, everything
> uses
> > DataFrames, Type classes used in a Dataset is internally converted to
> rows
> > for processing. . Therefore internally DataFrame is like "main" type
> that is
> > used.
> >
> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger 
> wrote:
> >>
> >> Sorry, meant DataFrame vs Dataset
> >>
> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger 
> >> wrote:
> >> > Is there a principled reason why sql.streaming.* and
> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> > instead of Datasource?
> >> >
> >> > Or is that just a holdover from code written before the move / type
> >> > alias?
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>


Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
I'm clear on what a type alias is.  My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information.  Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.


On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
 wrote:
> There are different ways to view this. If its confusing to think that Source
> API returning DataFrames, its equivalent to thinking that you are returning
> a Dataset[Row], and DataFrame is just a shorthand. And
> DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object] is
> to Array[String]. DataFrame is more general in a way, as every Dataset can
> be boiled down to a DataFrame. So to keep the Source APIs general (and also
> source-compatible with 1.x), they return DataFrame.
>
> On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger  wrote:
>>
>> Is this really an internal / external distinction?
>>
>> For a concrete example, Source.getBatch seems to be a public
>> interface, but returns DataFrame.
>>
>> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
>>  wrote:
>> > DataFrame is a type alias of Dataset[Row], so externally it seems like
>> > Dataset is the main type and DataFrame is a derivative type.
>> > However, internally, since everything is processed as Rows, everything
>> > uses
>> > DataFrames, Type classes used in a Dataset is internally converted to
>> > rows
>> > for processing. . Therefore internally DataFrame is like "main" type
>> > that is
>> > used.
>> >
>> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger 
>> > wrote:
>> >>
>> >> Sorry, meant DataFrame vs Dataset
>> >>
>> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger 
>> >> wrote:
>> >> > Is there a principled reason why sql.streaming.* and
>> >> > sql.execution.streaming.* are making extensive use of DataFrame
>> >> > instead of Datasource?
>> >> >
>> >> > Or is that just a holdover from code written before the move / type
>> >> > alias?
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.

On Thu, Jun 16, 2016 at 2:05 PM, Cody Koeninger  wrote:

> I'm clear on what a type alias is.  My question is more that moving
> from e.g. Dataset[T] to Dataset[Row] involves throwing away
> information.  Reading through code that uses the Dataframe alias, it's
> a little hard for me to know when that's intentional or not.
>
>
> On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
>  wrote:
> > There are different ways to view this. If its confusing to think that
> Source
> > API returning DataFrames, its equivalent to thinking that you are
> returning
> > a Dataset[Row], and DataFrame is just a shorthand. And
> > DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object]
> is
> > to Array[String]. DataFrame is more general in a way, as every Dataset
> can
> > be boiled down to a DataFrame. So to keep the Source APIs general (and
> also
> > source-compatible with 1.x), they return DataFrame.
> >
> > On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger 
> wrote:
> >>
> >> Is this really an internal / external distinction?
> >>
> >> For a concrete example, Source.getBatch seems to be a public
> >> interface, but returns DataFrame.
> >>
> >> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
> >>  wrote:
> >> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> >> > Dataset is the main type and DataFrame is a derivative type.
> >> > However, internally, since everything is processed as Rows, everything
> >> > uses
> >> > DataFrames, Type classes used in a Dataset is internally converted to
> >> > rows
> >> > for processing. . Therefore internally DataFrame is like "main" type
> >> > that is
> >> > used.
> >> >
> >> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger 
> >> > wrote:
> >> >>
> >> >> Sorry, meant DataFrame vs Dataset
> >> >>
> >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger  >
> >> >> wrote:
> >> >> > Is there a principled reason why sql.streaming.* and
> >> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> >> > instead of Datasource?
> >> >> >
> >> >> > Or is that just a holdover from code written before the move / type
> >> >> > alias?
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>
> >> >
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Regarding on the dataframe stat frequent

2016-06-16 Thread Luyi Wang
Hey there:

The frequent item in dataframe stat package seems not accurate. In the
documentation,it did mention that it has false positive but still seems
incorrect.

Wondering if this is all known problem or not?


Here is a quick example showing the problem.

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5,
"d")).toDF("id", "category")
val history = rows.toDF("id", "category")

history.stat.freqItems(Array("category"),0.5).show
history.stat.freqItems(Array("category"),0.3).show
history.stat.freqItems(Array("category"),0.51).show
history.stat.freqItems(Array("category")).show


Here is the output

+--+
|category_freqItems|
+--+
|[d, a]|
+--+

+--+
|category_freqItems|
+--+
|   [a]|
+--+

+--+
|category_freqItems|
+--+
|[]|
+--+

+--+
|category_freqItems|
+--+
|  [b, d, a, c]|
+--+



The problem results from the freqItemCounter class's add function which is
used in the function singlePassFreqItems aggregation stage.

Regarding on the paper, the return size of the frequent set can't be larger
than 1/minimum_support,which we indicated as k hereby, so that  in
singlePassFreqItems
the counterMap is created with this size.

The logic of the add function is following:

To add up the counter of a item, when it already exists in the map,  the
counter is added up.If it doesn't exist and also map size less than k, it
inserts.  if it doesn't exist and also current size just equal to size k,
then it will compare the inserted count with the minimum value. if the
counter of the to be inserted item is larger than or equals to the current
minimum, item is inserted and all items with counter value larger than
current minimum would and smaller and equals to will be removed.  If
counter of the to be inserted item is smaller than the current minimum,
item won't be inserted and counters of all items in the map will be deduct
the inserted counter value.

Problem:

Since it would retain the items larger than the current minimum,  if the
current minimum is just happened to be the count of second most frequent
item. it would be removed if the to be inserted item has the same count. In
this case, possibly a smaller one would be inserted in the map afterward
and returned later.

Given one example here. "a" appears 3 times, "b" and "c" both appears 2
times, "d" appears only once, total 8 times, For minimum support 0.5, the
map is initiated with size 2.   The correct answer should return items
appears more than 4 times, which is empty. However it returns "a" and "d".
The reason it returned two items is because of map size. The reason "d" is
returned is because that "b" and "c" appear the same amount and more than
"d", but they are cleaned when either one of them already inserted and the
map reach the size limitation. and when "d" is to be inserted, size is
smaller and it is inserted.


val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5,
"c"),(6,"c"),(7,"d")).toDF("id", "category")
val history = rows.toDF("id", "category")

history.stat.freqItems(Array("category"),0.5).show
history.stat.freqItems(Array("category"),0.3).show
history.stat.freqItems(Array("category"),0.51).show
history.stat.freqItems(Array("category")).show


+--+
|category_freqItems|
+--+
|[d, a]|
+--+

+--+
|category_freqItems|
+--+
| [b, a, c]|
+--+

+--+
|category_freqItems|
+--+
|[]|
+--+

+--+
|category_freqItems|
+--+
|  [b, d, a, c]|
+--+


Hope this explains the problem.

Thanks.

-Luyi.


Spark internal Logging trait potential thread unsafe

2016-06-16 Thread Prajwal Tuladhar
Hi,

The way log instance inside Logger trait is current being initialized
doesn't seem to be thread safe [1]. Current implementation only guarantees
initializeLogIfNecessary() is initialized in lazy + thread safe way.

Is there a reason why it can't be just: [2]

@transient private lazy val log_ : Logger = {
initializeLogIfNecessary(false)
LoggerFactory.getLogger(logName)
  }


And with that initializeLogIfNecessary() can be called without double
locking.

-- 
--
Cheers,
Praj

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L44-L50
[2]
https://github.com/apache/spark/blob/8ef3399aff04bf8b7ab294c0f55bcf195995842b/core/src/main/scala/org/apache/spark/internal/Logging.scala#L35


[VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-16 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
1.6.2!

The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v1.6.2-rc1
(4168d9c94a9564f6b3e62f5d669acde13a7c7cf6)

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1184

The documentation corresponding to this release can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-1.6.2-rc1-docs/


===
== How can I help test this release? ==
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 1.6.1.


== What justifies a -1 vote for this release? ==

This is a maintenance release in the 1.6.x series.  Bugs already present in
1.6.1, missing features, or bugs related to new features will not
necessarily block this release.