Re: Assembly build on spark 2.0.0

2016-08-27 Thread Srikanth Sampath
Found the answer.  This is the reason
https://issues.apache.org/jira/browse/SPARK-11157

-Srikanth

On Sat, Aug 27, 2016 at 8:54 AM, Srikanth Sampath  wrote:

> Hi,
> Thanks Radek.  However mvn package does not build the uber jar.  I am
> looking for an uber jar and not a distribution.  I have seen references to
> the uber jar here
> 
>
> What I see in the spark 2.0 codeline (assembly/pom.xml) builds a
> distribution. I
>
> 
>
>   bigtop-dist
>
>   
>
>   
>
> 
>
>   
>
> org.apache.maven.plugins
>
> maven-assembly-plugin
>
> 
>
>   
>
> dist
>
> package
>
> 
>
>   single
>
> 
>
> 
>
>   
>
> src/main/assembly/assembly.xml
> 
>
>   
>
> 
>
>   
>
> 
>
>   
>
> ...
>
>   
>
> 
>
>
> In src/main/assembly/assembly.xml we see
>
> 
>
>   dist
>
>   
>
> tar.gz
>
> dir
>
>   
>
>   false
>
> .
>
>
>
> On Sat, Aug 27, 2016 at 1:02 AM, Radoslaw Gruchalski  > wrote:
>
>> mvn package might be the command you’re looking for.
>>
>> –
>> Best regards,
>> Radek Gruchalski
>> ra...@gruchalski.com
>>
>>
>> On August 26, 2016 at 3:59:24 PM, Srikanth Sampath (
>> ssampath.apa...@gmail.com) wrote:
>>
>> Hi,
>> mvn assembly is creating a .tgz distribution.  How can I create a plain
>> jar archive?  I would like to create a spark-assembly-.jar
>> -Srikanth
>>
>>
>


Re: Assembly build on spark 2.0.0

2016-08-27 Thread Radoslaw Gruchalski
Ah, an uberjar. Normally one would build the uberjar with a Maven Shade plugin. 
Haven't looked into Spark code much recently, it wouldn't make much sense 
having a separate maven command to build an uberjar while building a 
distribution because, from memory, if you open the tgz file, the uberjar sits 
in the lib directory.
-- 
Best regards,
Rad

_
From: Srikanth Sampath 
Sent: Saturday, August 27, 2016 5:24 am
Subject: Re: Assembly build on spark 2.0.0
To: Radoslaw Gruchalski 
Cc:  


Hi,Thanks Radek.  However mvn package does not build the uber jar.  I am 
looking for an uber jar and not a distribution.  I have seen references to the 
uber jar here What I see in the spark 2.0 codeline (assembly/pom.xml) builds a 
distribution. I

    

      bigtop-dist

      

      

        

          

            org.apache.maven.plugins

            maven-assembly-plugin

            

              

                dist

                package

                

                  single

                

                

                  

                    src/main/assembly/assembly.xml

                  

                

              

            

          

...        

      

    




In src/main/assembly/assembly.xml we see



  dist

  

    tar.gz

    dir

  

  false



.



On Sat, Aug 27, 2016 at 1:02 AM, Radoslaw Gruchalski  
wrote:
mvn package might be the command you’re looking for.




– 
Best regards,
RadekGruchalski
ra...@gruchalski.com

 


On August 26, 2016 at 3:59:24 PM, Srikanth Sampath (ssampath.apa...@gmail.com) 
wrote: Hi,mvn assembly is creating a .tgz distribution.  How can Icreate a 
plain jar archive?  I would like to create 
aspark-assembly-.jar-Srikanth






Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Julien Dumazert
Hi all,

I'm forwarding a question 

 I recently asked on Stack Overflow about benchmarking Spark performance when 
working with case classes stored in Parquet files. 

I am assessing the performance of different ways of loading Parquet files in 
Spark and the differences are staggering.

In our Parquet files, we have nested case classes of the type:

case class C(/* a dozen of attributes*/)
case class B(/* a dozen of attributes*/, cs: Seq[C])
case class A(/* a dozen of attributes*/, bs: Seq[B])
It takes a while to load them from Parquet files. So I've done a benchmark of 
different ways of loading case classes from Parquet files and summing a field 
using Spark 1.6 and 2.0.

Here is a summary of the benchmark I did:

val df: DataFrame = sqlContext.read.parquet("path/to/file.gz.parquet").persist()
df.count()

// Spark 1.6

// Play Json
// 63.169s
df.toJSON.flatMap(s => Try(Json.parse(s).as[A]).toOption)
 .map(_.fieldToSum).sum()

// Direct access to field using Spark Row
// 2.811s
df.map(row => row.getAs[Long]("fieldToSum")).sum()

// Some small library we developed that access fields using Spark Row
// 10.401s
df.toRDD[A].map(_.fieldToSum).sum()

// Dataframe hybrid SQL API
// 0.239s
df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)

// Dataset with RDD-style code
// 34.223s
df.as[A].map(_.fieldToSum).reduce(_ + _)

// Dataset with column selection
// 0.176s
df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)


// Spark 2.0

// Performance is similar except for:

// Direct access to field using Spark Row
// 23.168s
df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)

// Some small library we developed that access fields using Spark Row
// 32.898s
f1DF.toRDD[A].map(_.fieldToSum).sum()
I understand why the performance of methods using Spark Row is degraded when 
upgrading to Spark 2.0, since Dataframe is now a mere alias of Dataset[Row]. 
That's the cost of unifying the interfaces, I guess.

On the other hand, I'm quite disappointed that the promise of Dataset is not 
kept: performance when using RDD-style coding (maps and flatMaps) is worse than 
when using Dataset like Dataframe with SQL-like DSL.

Basically, to have good performance, we need to give up type safety.

What is the reason for such difference between Dataset used as RDD and Dataset 
used as Dataframe?
Is there a way to improve encoding performance in Dataset to equate RDD-style 
coding and SQL-style coding performance? For data engineering, it's much 
cleaner to have RDD-style coding.
Also, working with the SQL-like DSL would require to flatten our data model and 
not use nested case classes. Am I right that good performance is only achieved 
with flat data models?
Some more questions:

4. Is the performance regression between Spark 1.6 and Spark 2.0 an identified 
problem? Will it be addressed in future releases? Or is the performance 
regression very specific to my case and I should handle my data differently?

5. Is the performance difference between RDD-style coding and SQL-style coding 
with Dataset an identified problem? Will it be addressed in future releases? 
Maybe there's no way to do something about it for reasons I can't see with my 
limited understanding of Spark internals. Or should I migrate to the SQL-style 
interface, yet losing type safety?

Best regards,
Julien





Re: Performance of loading parquet files into case classes in Spark

2016-08-27 Thread Maciej Bryński
2016-08-27 15:27 GMT+02:00 Julien Dumazert :

> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)


I think reduce and sum has very different performance.
Did you try sql.functions.sum ?
Or of you want to benchmark access to Row object then  count() function
will be better idea.

Regards,
-- 
Maciek Bryński


Cache'ing performance

2016-08-27 Thread Maciej Bryński
Hi,
I did some benchmark of cache function today.

*RDD*
sc.parallelize(0 until Int.MaxValue).cache().count()

*Datasets*
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
*RDD - 6s*
*Datasets - 13,5 s*

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bryński


Re: Structured Streaming with Kafka sources/sinks

2016-08-27 Thread Koert Kuipers
thats great

is this effort happening anywhere that is publicly visible? github?

On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin  wrote:

> We (the team at Databricks) are working on one currently.
>
>
> On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger 
> wrote:
>
>> https://issues.apache.org/jira/browse/SPARK-15406
>>
>> I'm not working on it (yet?), never got an answer to the question of
>> who was planning to work on it.
>>
>> On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao 
>> wrote:
>> > Hi all,
>> >
>> >
>> >
>> > I’m trying to write Structured Streaming test code and will deal with
>> Kafka
>> > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
>> >
>> >
>> >
>> > I found some Databricks slides saying that Kafka sources/sinks will be
>> > implemented in Spark 2.0, so is there anybody working on this? And when
>> will
>> > it be released?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Chenzhao Guo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:   Maciej Bryński 
To: Spark dev list 
Date:   2016/08/28 05:40
Subject:Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bryński




Re: Cache'ing performance

2016-08-27 Thread linguin . m . s
Hi,

How does the performance difference change when turning off compression?
It is enabled by default.

// maropu

Sent by iPhone

2016/08/28 10:13、Kazuaki Ishizaki  のメッセージ:

> Hi
> I think that it is a performance issue in both DataFrame and Dataset cache. 
> It is not due to only Encoders. The DataFrame version 
> "spark.range(Int.MaxValue).toDF.cache().count()" is also slow.
> 
> While a cache for DataFrame and Dataset is stored as a columnar format with 
> some compressed data representation, we have revealed there is room to 
> improve performance. We have already created pull requests to address them. 
> These pull requests are under review. 
> https://github.com/apache/spark/pull/11956
> https://github.com/apache/spark/pull/14091
> 
> We would appreciate your feedback to these pull requests.
> 
> Best Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:Maciej Bryński 
> To:Spark dev list 
> Date:2016/08/28 05:40
> Subject:Cache'ing performance
> 
> 
> 
> Hi,
> I did some benchmark of cache function today.
> 
> RDD
> sc.parallelize(0 until Int.MaxValue).cache().count()
> 
> Datasets
> spark.range(Int.MaxValue).cache().count()
> 
> For me Datasets was 2 times slower.
> 
> Results (3 nodes, 20 cores and 48GB RAM each)
> RDD - 6s
> Datasets - 13,5 s
> 
> Is that expected behavior for Datasets and Encoders ?
> 
> Regards,
> -- 
> Maciek Bryński
> 


Re: Cache'ing performance

2016-08-27 Thread Kazuaki Ishizaki
Hi,
Good point. I have just measured performance with 
"spark.sql.inMemoryColumnarStorage.compressed=false."
It improved the performance than with default. However, it is still slower 
RDD version on my environment.

It seems to be consistent with the PR 
https://github.com/apache/spark/pull/11956. This PR shows room to 
performance improvement for float/double values that are not compressed.

Kazuaki Ishizaki



From:   linguin@gmail.com
To: Maciej Bry��ski 
Cc: Spark dev list 
Date:   2016/08/28 11:30
Subject:Re: Cache'ing performance



Hi,

How does the performance difference change when turning off compression?
It is enabled by default.

// maropu

Sent by iPhone

2016/08/28 10:13、Kazuaki Ishizaki  のメッセ�`ジ:

Hi
I think that it is a performance issue in both DataFrame and Dataset 
cache. It is not due to only Encoders. The DataFrame version 
"spark.range(Int.MaxValue).toDF.cache().count()" is also slow.

While a cache for DataFrame and Dataset is stored as a columnar format 
with some compressed data representation, we have revealed there is room 
to improve performance. We have already created pull requests to address 
them. These pull requests are under review. 
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/14091

We would appreciate your feedback to these pull requests.

Best Regards,
Kazuaki Ishizaki



From:Maciej Bry��ski 
To:Spark dev list 
Date:2016/08/28 05:40
Subject:Cache'ing performance



Hi,
I did some benchmark of cache function today.

RDD
sc.parallelize(0 until Int.MaxValue).cache().count()

Datasets
spark.range(Int.MaxValue).cache().count()

For me Datasets was 2 times slower.

Results (3 nodes, 20 cores and 48GB RAM each)
RDD - 6s
Datasets - 13,5 s

Is that expected behavior for Datasets and Encoders ?

Regards,
-- 
Maciek Bry��ski