date:20160727

Re: How to give name to Spark jobs shown in Spark UI

2016-07-27 Thread unk1102

Thank Rahul I think you didn't read question properly I have one main spark
job which I name using the approach you described. As part of main spark
job I create multiple threads which essentially becomes child spark jobs
and those jobs has no direct way of naming.

On Jul 27, 2016 11:17, "rahulkumar-aws [via Apache Spark User List]" <
ml-node+s1001560n27414...@n3.nabble.com> wrote:

> You can set name in SparkConf() or if You are using Spark submit set
> --name flag
>
> *val sparkconf = new SparkConf()*
> * .setMaster("local[4]")*
> * .setAppName("saveFileJob")*
> *val sc = new SparkContext(sparkconf)*
>
>
> or spark-submit :
>
> *./bin/spark-submit --name "FileSaveJob" --master local[4]  fileSaver.jar*
>
>
>
>
> On Mon, Jul 25, 2016 at 9:46 PM, neil90 [via Apache Spark User List] <[hidden
> email] > wrote:
>
>> As far as I know you can give a name to the SparkContext. I recommend
>> using a cluster monitoring tool like Ganglia to determine were its slow in
>> your spark jobs.
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27406.html
>> To start a new topic under Apache Spark User List, email [hidden email]
>> 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
> Software Developer Sigmoid (SigmoidAnalytics), India
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27414.html
> To unsubscribe from How to give name to Spark jobs shown in Spark UI, click
> here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: The Future Of DStream

2016-07-27 Thread Ofir Manor

Structured Streaming in 2.0 is declared as alpha - plenty of bits still
missing:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
I assume that it will be declared stable / GA in a future 2.x release, and
then it will co-exist with DStream for quite a while before someone will
suggest to start a deprecation process that will eventually lead to its
removal...
As a user, I guess we will need to apply judgement about when to switch to
Structured Streaming - each of us have a different risk/value tradeoff,
based on our specific situation...

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  wrote:

> Hi guys
>
> Structure Stream is coming with spark 2.0,  but I noticed that DStream is
> still here
>
> What's the future of the DStream, will it be deprecated and removed
> eventually? Or co-existed with  Structure Stream forever?
>
> Thanks
> Chang
>
>

Re:[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300


Congratulations!


在 2016-07-27 14:00:22，"Reynold Xin"  写道：

Hi all,


Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ 
patches from 300+ contributors.


To download Spark 2.0, head over to the download page: 
http://spark.apache.org/downloads.html


To view the release notes: 
http://spark.apache.org/releases/spark-release-2-0-0.html





(note: it can take a few hours for everything to be propagated, so you might 
get 404 on some download links.  If you see any issues with the release notes 
or webpage *please contact me directly, off-list*)

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta

Gosh,

whether ORC came from this or that, it runs queries in HIVE with TEZ at a
speed that is better than SPARK.

Has anyone heard of KUDA? Its better than Parquet. But I think that someone
might just start saying that KUDA has difficult lineage as well. After all
dynastic rules dictate.

Personally I feel that if something stores my data compressed and makes me
access it faster I do not care where it comes from or how difficult the
child birth was :)


Regards,
Gourav

On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
sbpothin...@gmail.com> wrote:

> Just correction:
>
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
> default.
>
> Do not know If Spark leveraging this new repo?
>
> 
>  org.apache.orc
> orc
> 1.1.2
> pom
> 
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>
> parquet was inspired by dremel but written from the ground up as a library
> with support for a variety of big data systems (hive, pig, impala,
> cascading, etc.). it is also easy to add new support, since its a proper
> library.
>
> orc bas been enhanced while deployed at facebook in hive and at yahoo in
> hive. just hive. it didn't really exist by itself. it was part of the big
> java soup that is called hive, without an easy way to extract it. hive does
> not expose proper java apis. it never cared for that.
>
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Interesting opinion, thank you
>>
>> Still, on the website parquet is basically inspired by Dremel (Google)
>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>> [2].
>>
>> Other than this presentation [3], do you guys know any other benchmark?
>>
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3]
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>
>> when parquet came out it was developed by a community of companies, and
>> was designed as a library to be supported by multiple big data projects.
>> nice
>>
>> orc on the other hand initially only supported hive. it wasn't even
>> designed as a library that can be re-used. even today it brings in the
>> kitchen sink of transitive dependencies. yikes
>>
>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>
>>> I think both are very similar, but with slightly different goals. While
>>> they work transparently for each Hadoop application you need to enable
>>> specific support in the application for predicate push down.
>>> In the end you have to check which application you are using and do some
>>> tests (with correct predicate push down configuration). Keep in mind that
>>> both formats work best if they are sorted on filter columns (which is your
>>> responsibility) and if their optimatizations are correctly configured (min
>>> max index, bloom filter, compression etc) .
>>>
>>> If you need to ingest sensor data you may want to store it first in
>>> hbase and then batch process it in large files in Orc or parquet format.
>>>
>>> On 26 Jul 2016, at 04:09, janardhan shetty 
>>> wrote:
>>>
>>> Just wondering advantages and disadvantages to convert data into ORC or
>>> Parquet.
>>>
>>> In the documentation of Spark there are numerous examples of Parquet
>>> format.
>>>
>>> Any strong reasons to chose Parquet over ORC file format ?
>>>
>>> Also : current data compression is bzip2
>>>
>>>
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>> This seems like biased.
>>>
>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta

Sorry,

in my email above I was referring to KUDU, and there is goes how can KUDU
be right if it is mentioned in forums first with a wrong spelling. Its got
a difficult beginning where people were trying to figure out its name.


Regards,
Gourav Sengupta

On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta 
wrote:

> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current data compression is bzip2


 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
 This seems like biased.


>>>
>>
>

Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Ofir Manor

Hold the release! There is a minor documentation issue :)
But seriously, congrats all on this massive achievement!

Anyway, I think it would be very helpful to add a link to the Structured
Streaming Developer Guide (Alpha) to both the documentation home page and
from the beginning of the "old" Spark Streaming Programming Guide, as I
think many users will look for them. I had a "deep link" to that page so I
haven't noticed that it is very hard to find until now. I'm referring to
this page:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 9:00 AM, Reynold Xin  wrote:

> Hi all,
>
> Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes
> 2500+ patches from 300+ contributors.
>
> To download Spark 2.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> http://spark.apache.org/releases/spark-release-2-0-0.html
>
>
> (note: it can take a few hours for everything to be propagated, so you
> might get 404 on some download links.  If you see any issues with the
> release notes or webpage *please contact me directly, off-list*)
>
>

Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia

Yup, they will definitely coexist. Structured Streaming is currently alpha and 
will probably be complete in the next few releases, but Spark Streaming will 
continue to exist, because it gives the user more low-level control. It's 
similar to DataFrames vs RDDs (RDDs are the lower-level API for when you want 
control, while DataFrames do more optimizations automatically by restricting 
the computation model).

Matei

> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
> 
> Structured Streaming in 2.0 is declared as alpha - plenty of bits still 
> missing:
>  
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  
> 
> I assume that it will be declared stable / GA in a future 2.x release, and 
> then it will co-exist with DStream for quite a while before someone will 
> suggest to start a deprecation process that will eventually lead to its 
> removal...
> As a user, I guess we will need to apply judgement about when to switch to 
> Structured Streaming - each of us have a different risk/value tradeoff, based 
> on our specific situation...
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  > wrote:
> Hi guys
> 
> Structure Stream is coming with spark 2.0,  but I noticed that DStream is 
> still here
> 
> What's the future of the DStream, will it be deprecated and removed 
> eventually? Or co-existed with  Structure Stream forever?
> 
> Thanks
> Chang
> 
>

Re:Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread prosp4300

Thanks for this immediate correction :)


在 2016-07-27 15:17:54，"Gourav Sengupta"  写道：

Sorry, 


in my email above I was referring to KUDU, and there is goes how can KUDU be 
right if it is mentioned in forums first with a wrong spelling. Its got a 
difficult beginning where people were trying to figure out its name.




Regards,
Gourav Sengupta


On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta  
wrote:

Gosh,


whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed 
that is better than SPARK.


Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
might just start saying that KUDA has difficult lineage as well. After all 
dynastic rules dictate.


Personally I feel that if something stores my data compressed and makes me 
access it faster I do not care where it comes from or how difficult the child 
birth was :)




Regards,
Gourav


On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni  
wrote:

Just correction:


ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 


Do not know If Spark leveraging this new repo?



 org.apache.orc
orc
1.1.2
pom














Sent from my iPhone
On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:


parquet was inspired by dremel but written from the ground up as a library with 
support for a variety of big data systems (hive, pig, impala, cascading, etc.). 
it is also easy to add new support, since its a proper library.


orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. 
just hive. it didn't really exist by itself. it was part of the big java soup 
that is called hive, without an easy way to extract it. hive does not expose 
proper java apis. it never cared for that.



On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
 wrote:

Interesting opinion, thank you


Still, on the website parquet is basically inspired by Dremel (Google) [1] and 
part of orc has been enhanced while deployed for Facebook, Yahoo [2].


Other than this presentation [3], do you guys know any other benchmark?


[1]https://parquet.apache.org/documentation/latest/
[2]https://orc.apache.org/docs/
[3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet


On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:



when parquet came out it was developed by a community of companies, and was 
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even designed as 
a library that can be re-used. even today it brings in the kitchen sink of 
transitive dependencies. yikes



On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:

I think both are very similar, but with slightly different goals. While they 
work transparently for each Hadoop application you need to enable specific 
support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests 
(with correct predicate push down configuration). Keep in mind that both 
formats work best if they are sorted on filter columns (which is your 
responsibility) and if their optimatizations are correctly configured (min max 
index, bloom filter, compression etc) . 


If you need to ingest sensor data you may want to store it first in hbase and 
then batch process it in large files in Orc or parquet format.

On 26 Jul 2016, at 04:09, janardhan shetty  wrote:


Just wondering advantages and disadvantages to convert data into ORC or Parquet.


In the documentation of Spark there are numerous examples of Parquet format.



Any strong reasons to chose Parquet over ORC file format ?


Also : current data compression is bzip2



http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
This seems like biased.

Re: Setting spark.sql.shuffle.partitions Dynamically

2016-07-27 Thread Takeshi Yamamuro

Hi,

How about trying adaptive execution in spark?
https://issues.apache.org/jira/browse/SPARK-9850
This feature is turned off by default because it seems experimental.

// maropu



On Wed, Jul 27, 2016 at 3:26 PM, Brandon White 
wrote:

> Hello,
>
> My platform runs hundreds of Spark jobs every day each with its own
> datasize from 20mb to 20TB. This means that we need to set resources
> dynamically. One major pain point for doing this is
> spark.sql.shuffle.partitions, the number of partitions to use when
> shuffling data for joins or aggregations. It is to be arbitrarily hard
> coded to 200. The only way to set this config is in the spark submit
> command or in the SparkConf before the executor is created.
>
> This creates a lot of problems when I want to set this config dynamically
> based on the in memory size of a dataframe. I only know the in memory size
> of the dataframe halfway through the spark job. So I would need to stop the
> context and recreate it in order to set this config.
>
> Is there any better way to set this? How
> does  spark.sql.shuffle.partitions work differently than .repartition?
>
> Brandon
>



-- 
---
Takeshi Yamamuro

Re:Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300



Additionally, in the paragraph about MLlib, three links missed, it is better to 
provide the links to give us more information, thanks a lot

See this blog post for details
See this talk to learn more
This talk lists many of these new features.


在 2016-07-27 15:18:41，"Ofir Manor"  写道：

Hold the release! There is a minor documentation issue :)
But seriously, congrats all on this massive achievement!


Anyway, I think it would be very helpful to add a link to the Structured 
Streaming Developer Guide (Alpha) to both the documentation home page and from 
the beginning of the "old" Spark Streaming Programming Guide, as I think many 
users will look for them. I had a "deep link" to that page so I haven't noticed 
that it is very hard to find until now. I'm referring to this page:
   
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html








Ofir Manor


Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io



On Wed, Jul 27, 2016 at 9:00 AM, Reynold Xin  wrote:

Hi all,


Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ 
patches from 300+ contributors.


To download Spark 2.0, head over to the download page: 
http://spark.apache.org/downloads.html


To view the release notes: 
http://spark.apache.org/releases/spark-release-2-0-0.html





(note: it can take a few hours for everything to be propagated, so you might 
get 404 on some download links.  If you see any issues with the release notes 
or webpage *please contact me directly, off-list*)

Re: The Future Of DStream

2016-07-27 Thread Chang Chen

I don't understand what kind of low level control that DStream can do while
Structure Streaming can not

Thanks
Chang

On Wednesday, July 27, 2016, Matei Zaharia  wrote:

> Yup, they will definitely coexist. Structured Streaming is currently alpha
> and will probably be complete in the next few releases, but Spark Streaming
> will continue to exist, because it gives the user more low-level control.
> It's similar to DataFrames vs RDDs (RDDs are the lower-level API for when
> you want control, while DataFrames do more optimizations automatically by
> restricting the computation model).
>
> Matei
>
> On Jul 27, 2016, at 12:03 AM, Ofir Manor  > wrote:
>
> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
> missing:
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
> I assume that it will be declared stable / GA in a future 2.x release, and
> then it will co-exist with DStream for quite a while before someone will
> suggest to start a deprecation process that will eventually lead to its
> removal...
> As a user, I guess we will need to apply judgement about when to switch to
> Structured Streaming - each of us have a different risk/value tradeoff,
> based on our specific situation...
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
> 
>
> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  > wrote:
>
>> Hi guys
>>
>> Structure Stream is coming with spark 2.0,  but I noticed that DStream is
>> still here
>>
>> What's the future of the DStream, will it be deprecated and removed
>> eventually? Or co-existed with  Structure Stream forever?
>>
>> Thanks
>> Chang
>>
>>
>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-27 Thread Nick Pentreath

This is exactly the core problem in the linked issue - normally you would
use the TrainValidationSplit or CrossValidator to do hyper-parameter
selection using cross-validation. You could tune the factor size,
regularization parameter and alpha (for implicit preference data), for
example.

Because of the NaN issue you cannot use the cross-validators currently with
ALS. So you would have to do it yourself manually (dropping the NaNs from
the prediction results as Krishna says).



On Mon, 25 Jul 2016 at 11:40 Rohit Chaddha 
wrote:

> Hi Krishna,
>
> Great .. I had no idea about this.  I tried your suggestion by using
> na.drop() and got a rmse = 1.5794048211812495
> Any suggestions how this can be reduced and the model improved ?
>
> Regards,
> Rohit
>
> On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar 
> wrote:
>
>> Thanks Nick. I also ran into this issue.
>> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
>> then use the dataset for the evaluator. In real life, probably detect the
>> NaN and recommend most popular on some window.
>> HTH.
>> Cheers
>> 
>>
>> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> It seems likely that you're running into
>>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when
>>> the test dataset in the train/test split contains users or items that were
>>> not in the training set. Hence the model doesn't have computed factors for
>>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>>> turn results in NaN for the evaluator result.
>>>
>>> I have a PR open on that issue that will hopefully address this soon.
>>>
>>>
>>> On Sun, 24 Jul 2016 at 17:49 VG  wrote:
>>>
 ping. Anyone has some suggestions/advice for me .
 It will be really helpful.

 VG
 On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen 
> wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using
>> ALSModel and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>
>>

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-27 Thread Julien Nauroy

Just a follow-up on my last question: the RowEncoder has to be defined AFTER 
declaring the columns, or else the new columns won't be serialized and will 
disappear after the flatMap. 
So the code should look like: 
var df1 = spark.read.parquet(fileName) 


df1 = df1.withColumn("newCol", df1.col("anyExistingCol")) 
df1.printSchema() // here newCol exists 

implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df1.schema) 
df1 = df1.flatMap(x => List(x)) 
df1.printSchema() // newCol still exists! 


Julien 



- Mail original -

De: "Julien Nauroy"  
À: "Sun Rui"  
Cc: user@spark.apache.org 
Envoyé: Dimanche 24 Juillet 2016 12:43:42 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

Hi again, 

Just another strange behavior I stumbled upon. Can anybody reproduce it? 
Here's the code snippet in scala: 
var df1 = spark.read.parquet(fileName) 


df1 = df1.withColumn("newCol", df1.col("anyExistingCol")) 
df1.printSchema() // here newCol exists 
df1 = df1.flatMap(x => List(x)) 
df1.printSchema() // newCol has disappeared 

Any idea what I could be doing wrong? Why would newCol disappear? 


Cheers, 
Julien 



- Mail original -

De: "Julien Nauroy"  
À: "Sun Rui"  
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 23:39:08 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

Thanks, it works like a charm now! 

Not sure how I could have found it by myself though. 
Maybe the error message when you don't specify the encoder should point to 
RowEncoder. 


Cheers, 
Julien 

- Mail original -

De: "Sun Rui"  
À: "Julien Nauroy"  
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 16:27:43 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

You should use : 
import org.apache.spark.sql.catalyst.encoders.RowEncoder 

val df = spark.read.parquet(fileName) 

implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) 

val df1 = df.flatMap { x => List(x) } 



On Jul 23, 2016, at 22:01, Julien Nauroy < julien.nau...@u-psud.fr > wrote: 

Thanks for your quick reply. 

I've tried with this encoder: 
implicit def RowEncoder: org.apache.spark.sql.Encoder[Row] = 
org.apache.spark.sql.Encoders.kryo[Row] 
Using a suggestion from 
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset-in-spark-1-6
 

How did you setup your encoder? 


- Mail original -

De: "Sun Rui" < sunrise_...@163.com > 
À: "Julien Nauroy" < julien.nau...@u-psud.fr > 
Cc: user@spark.apache.org 
Envoyé: Samedi 23 Juillet 2016 15:55:21 
Objet: Re: Using flatMap on Dataframes with Spark 2.0 

I did a try. the schema after flatMap is the same, which is expected. 

What’s your Row encoder? 



On Jul 23, 2016, at 20:36, Julien Nauroy < julien.nau...@u-psud.fr > wrote: 

Hi, 

I'm trying to call flatMap on a Dataframe with Spark 2.0 (rc5). 
The code is the following: 
var data = spark.read.parquet(fileName).flatMap(x => List(x)) 

Of course it's an overly simplified example, but the result is the same. 
The dataframe schema goes from this: 
root 
|-- field1: double (nullable = true) 
|-- field2: integer (nullable = true) 
(etc) 

to this: 
root 
|-- value: binary (nullable = true) 

Plus I have to provide an encoder for Row. 
I expect to get the same schema after calling flatMap. 
Any idea what I could be doing wrong? 


Best regards, 
Julien

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread u...@moosheimer.com

Hi Gourav,

Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory 
db with data storage while Parquet is "only" a columnar storage format.

As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's 
more a wish :-).

Regards,
Uwe

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
> 
> Gosh,
> 
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
> speed that is better than SPARK.
> 
> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
> might just start saying that KUDA has difficult lineage as well. After all 
> dynastic rules dictate.
> 
> Personally I feel that if something stores my data compressed and makes me 
> access it faster I do not care where it comes from or how difficult the child 
> birth was :)
> 
> 
> Regards,
> Gourav
> 
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>  wrote:
>> Just correction:
>> 
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>> default. 
>> 
>> Do not know If Spark leveraging this new repo?
>> 
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>> 
>> 
>>> parquet was inspired by dremel but written from the ground up as a library 
>>> with support for a variety of big data systems (hive, pig, impala, 
>>> cascading, etc.). it is also easy to add new support, since its a proper 
>>> library.
>>> 
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>>> hive. just hive. it didn't really exist by itself. it was part of the big 
>>> java soup that is called hive, without an easy way to extract it. hive does 
>>> not expose proper java apis. it never cared for that.
>>> 
 On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
  wrote:
 Interesting opinion, thank you
 
 Still, on the website parquet is basically inspired by Dremel (Google) [1] 
 and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
 
 Other than this presentation [3], do you guys know any other benchmark?
 
 [1]https://parquet.apache.org/documentation/latest/
 [2]https://orc.apache.org/docs/
 [3] 
 http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
 
> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
> 
> when parquet came out it was developed by a community of companies, and 
> was designed as a library to be supported by multiple big data projects. 
> nice
> 
> orc on the other hand initially only supported hive. it wasn't even 
> designed as a library that can be re-used. even today it brings in the 
> kitchen sink of transitive dependencies. yikes
> 
> 
>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>> I think both are very similar, but with slightly different goals. While 
>> they work transparently for each Hadoop application you need to enable 
>> specific support in the application for predicate push down. 
>> In the end you have to check which application you are using and do some 
>> tests (with correct predicate push down configuration). Keep in mind 
>> that both formats work best if they are sorted on filter columns (which 
>> is your responsibility) and if their optimatizations are correctly 
>> configured (min max index, bloom filter, compression etc) . 
>> 
>> If you need to ingest sensor data you may want to store it first in 
>> hbase and then batch process it in large files in Orc or parquet format.
>> 
>>> On 26 Jul 2016, at 04:09, janardhan shetty  
>>> wrote:
>>> 
>>> Just wondering advantages and disadvantages to convert data into ORC or 
>>> Parquet. 
>>> 
>>> In the documentation of Spark there are numerous examples of Parquet 
>>> format. 
>>> 
>>> Any strong reasons to chose Parquet over ORC file format ?
>>> 
>>> Also : current data compression is bzip2
>>> 
>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>  
>>> This seems like biased.
>

Re: The Future Of DStream

2016-07-27 Thread Ofir Manor

For the 2.0 release, look for "Unsupported Operations" here:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Also, there are bigger gaps - like no Kafka support, no way to plug
user-defined sources or sinks etc

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 11:24 AM, Chang Chen  wrote:

>
> I don't understand what kind of low level control that DStream can do
> while Structure Streaming can not
>
> Thanks
> Chang
>
> On Wednesday, July 27, 2016, Matei Zaharia 
> wrote:
>
>> Yup, they will definitely coexist. Structured Streaming is currently
>> alpha and will probably be complete in the next few releases, but Spark
>> Streaming will continue to exist, because it gives the user more low-level
>> control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API
>> for when you want control, while DataFrames do more optimizations
>> automatically by restricting the computation model).
>>
>> Matei
>>
>> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
>>
>> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
>> missing:
>>
>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>> I assume that it will be declared stable / GA in a future 2.x release,
>> and then it will co-exist with DStream for quite a while before someone
>> will suggest to start a deprecation process that will eventually lead to
>> its removal...
>> As a user, I guess we will need to apply judgement about when to switch
>> to Structured Streaming - each of us have a different risk/value tradeoff,
>> based on our specific situation...
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  wrote:
>>
>>> Hi guys
>>>
>>> Structure Stream is coming with spark 2.0,  but I noticed that DStream
>>> is still here
>>>
>>> What's the future of the DStream, will it be deprecated and removed
>>> eventually? Or co-existed with  Structure Stream forever?
>>>
>>> Thanks
>>> Chang
>>>
>>>
>>
>>

spark

2016-07-27 Thread ناهید بهجتی نجف آبادی

Hi!
I have a problem with spark.It's better you notice that I'm very new in
these. I downloaded spark-1.6.2 source code and I wana build it. when I try
to build spark with "/build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-DskipTests clean package" this error shows up :
Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile(scala-compile-first) on
project spark-test-tags-2.10. Execution scala-compile-first of goal
net.alchim31.maven 


please help me!! what can I do to fix this??!!


thanks

Re: The Future Of DStream

2016-07-27 Thread Chang Chen

Things like kafka and user-defined sources are not supported yet, just
because Structure Streaming is in alpha stage.

Things like sort are not supported because of implementation difficulty,
and I don't think DStream can support either

What I want to know is the difference between API (or abstraction), for
example, It is quite easy to use same codes for processing batch data
because of unbounded table abstraction ( which comes from google's Dataflow
paper), that's why the internal engine is based on logical plan, spark plan
and RDD. In contrast, DStream can't do same thing easily

Actually, Dataset supports map,flatMap and reduce,  and hence I can do any
user-defined work in theory, that's why I ask what kind of low-level
control that DStream can do while Structure Stream can not.

Thanks
Chang





On Wed, Jul 27, 2016 at 6:03 PM, Ofir Manor  wrote:

> For the 2.0 release, look for "Unsupported Operations" here:
>
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
> Also, there are bigger gaps - like no Kafka support, no way to plug
> user-defined sources or sinks etc
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Wed, Jul 27, 2016 at 11:24 AM, Chang Chen  wrote:
>
>>
>> I don't understand what kind of low level control that DStream can do
>> while Structure Streaming can not
>>
>> Thanks
>> Chang
>>
>> On Wednesday, July 27, 2016, Matei Zaharia 
>> wrote:
>>
>>> Yup, they will definitely coexist. Structured Streaming is currently
>>> alpha and will probably be complete in the next few releases, but Spark
>>> Streaming will continue to exist, because it gives the user more low-level
>>> control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API
>>> for when you want control, while DataFrames do more optimizations
>>> automatically by restricting the computation model).
>>>
>>> Matei
>>>
>>> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
>>>
>>> Structured Streaming in 2.0 is declared as alpha - plenty of bits still
>>> missing:
>>>
>>> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>>> I assume that it will be declared stable / GA in a future 2.x release,
>>> and then it will co-exist with DStream for quite a while before someone
>>> will suggest to start a deprecation process that will eventually lead to
>>> its removal...
>>> As a user, I guess we will need to apply judgement about when to switch
>>> to Structured Streaming - each of us have a different risk/value tradeoff,
>>> based on our specific situation...
>>>
>>> Ofir Manor
>>>
>>> Co-Founder & CTO | Equalum
>>>
>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>
>>> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen 
>>> wrote:
>>>
 Hi guys

 Structure Stream is coming with spark 2.0,  but I noticed that DStream
 is still here

 What's the future of the DStream, will it be deprecated and removed
 eventually? Or co-existed with  Structure Stream forever?

 Thanks
 Chang


>>>
>>>
>

Re:Re:Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300



The page mentioned before is the release notes that miss the links
http://spark.apache.org/releases/spark-release-2-0-0.html#mllib


At 2016-07-27 15:56:00, "prosp4300"  wrote:



Additionally, in the paragraph about MLlib, three links missed, it is better to 
provide the links to give us more information, thanks a lot

See this blog post for details
See this talk to learn more
This talk lists many of these new features.


在 2016-07-27 15:18:41，"Ofir Manor"  写道：

Hold the release! There is a minor documentation issue :)
But seriously, congrats all on this massive achievement!


Anyway, I think it would be very helpful to add a link to the Structured 
Streaming Developer Guide (Alpha) to both the documentation home page and from 
the beginning of the "old" Spark Streaming Programming Guide, as I think many 
users will look for them. I had a "deep link" to that page so I haven't noticed 
that it is very hard to find until now. I'm referring to this page:
   
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html








Ofir Manor


Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io



On Wed, Jul 27, 2016 at 9:00 AM, Reynold Xin  wrote:

Hi all,


Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ 
patches from 300+ contributors.


To download Spark 2.0, head over to the download page: 
http://spark.apache.org/downloads.html


To view the release notes: 
http://spark.apache.org/releases/spark-release-2-0-0.html





(note: it can take a few hours for everything to be propagated, so you might 
get 404 on some download links.  If you see any issues with the release notes 
or webpage *please contact me directly, off-list*)

Re: spark

2016-07-27 Thread Jacek Laskowski

Hi,

Are you on Java 7 or 8? Can you include the error just before this
"Failed to execute"? There was a build issue with spark-test-tags-2.10
once.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jul 27, 2016 at 12:45 PM, ناهید بهجتی نجف آبادی
 wrote:
> Hi!
> I have a problem with spark.It's better you notice that I'm very new in
> these. I downloaded spark-1.6.2 source code and I wana build it. when I try
> to build spark with "/build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
> -DskipTests clean package" this error shows up :
> Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile(scala-compile-first) on
> project spark-test-tags-2.10. Execution scala-compile-first of goal
> net.alchim31.maven 
>
>
> please help me!! what can I do to fix this??!!
>
>
> thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: libraryDependencies

2016-07-27 Thread Jacek Laskowski

Hi,

How did you reference "sparksample"? If it ended up in
/Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample I
believe it was referenced as a git-based project in sbt. Is that
correct?

Also, when you "provided" Spark libs you won't be able to run Spark
apps in sbt. See
https://github.com/sbt/sbt-assembly#-provided-configuration. The trick
is to create a test app that executes main of your standalone app.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jul 26, 2016 at 9:18 PM, Martin Somers  wrote:
>
> my build file looks like
>
> libraryDependencies  ++= Seq(
>   // other dependencies here
>   "org.apache.spark" %% "spark-core" % "1.6.2" % "provided",
>   "org.apache.spark" %% "spark-mllib_2.11" % "1.6.0",
>   "org.scalanlp" % "breeze_2.11" % "0.7",
>   // native libraries are not included by default. add this if
> you want them (as of 0.7)
>   // native libraries greatly improve performance, but increase
> jar sizes.
>   "org.scalanlp" % "breeze-natives_2.11" % "0.7",
> )
>
> not 100% sure on the version numbers if they are indeed correct
> getting an error of
>
> [info] Resolving jline#jline;2.12.1 ...
> [info] Done updating.
> [info] Compiling 1 Scala source to
> /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/target/scala-2.11/classes...
> [error]
> /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample/src/main/scala/MyApp.scala:2:
> object mllib is not a member of package org.apache.spark
> [error] import org.apache.spark.mllib.linalg.distributed.RowMatrix
> 
> ...
>
>
> Im trying to import in
>
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.mllib.linalg.SingularValueDecomposition
>
> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>
>
> import breeze.linalg._
> import breeze.linalg.{ Matrix => B_Matrix }
> import breeze.linalg.{ Vector => B_Matrix }
> import breeze.linalg.DenseMatrix
>
> object MyApp {
>   def main(args: Array[String]): Unit = {
> //code here
> }
>
>
> It might not be the correct way of doing this
>
> Anyone got any suggestion
> tks
> M
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

tpcds for spark2.0

2016-07-27 Thread kevin

hi,all:
I want to have a test about tpcds99 sql run on spark2.0.
I user https://github.com/databricks/spark-sql-perf

about the master version ,when I run :val tpcds = new TPCDS (sqlContext =
sqlContext) I got error:

scala> val tpcds = new TPCDS (sqlContext = sqlContext)
error: missing or invalid dependency detected while loading class file
'Benchmarkable.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
the problematic classpath.)
A full rebuild may help if 'Benchmarkable.class' was compiled against an
incompatible version of com.
error: missing or invalid dependency detected while loading class file
'Benchmarkable.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
the problematic classpath.)
A full rebuild may help if 'Benchmarkable.class' was compiled against an
incompatible version of com.typesafe.

about spark-sql-perf-0.4.3 when I run
:tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false,
false, false, false) I got error:

Generating table catalog_sales in database to
hdfs://master1:9000/tpctest/catalog_sales with save mode Overwrite.
16/07/27 18:59:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
slave1): java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

Re: Is RowMatrix missing in org.apache.spark.ml package?

2016-07-27 Thread Robin East

Can you use the version from mllib? 
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 26 Jul 2016, at 18:20, Rohit Chaddha  wrote:
> 
> It is present in mlib but I don't seem to find it in ml package.
> Any suggestions please ? 
> 
> -Rohit

Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Jestin Ma

I know that Sparksession is replacing the SQL and HiveContexts, but what
about SparkConf and SparkContext? Are those still relevant in our programs?

Thank you!
Jestin

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke

Kudu has been from my impression be designed to offer somethings between hbase 
and parquet for write intensive loads - it is not faster for warehouse type of 
querying compared to parquet (merely slower, because that is not its use case). 
  I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I 
am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
>> speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
>> might just start saying that KUDA has difficult lineage as well. After all 
>> dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me 
>> access it faster I do not care where it comes from or how difficult the 
>> child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a library 
 with support for a variety of big data systems (hive, pig, impala, 
 cascading, etc.). it is also easy to add new support, since its a proper 
 library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo in 
 hive. just hive. it didn't really exist by itself. it was part of the big 
 java soup that is called hive, without an easy way to extract it. hive 
 does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel (Google) 
> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo 
> [2].
> 
> Other than this presentation [3], do you guys know any other benchmark?
> 
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3] 
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
> 
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>> 
>> when parquet came out it was developed by a community of companies, and 
>> was designed as a library to be supported by multiple big data projects. 
>> nice
>> 
>> orc on the other hand initially only supported hive. it wasn't even 
>> designed as a library that can be re-used. even today it brings in the 
>> kitchen sink of transitive dependencies. yikes
>> 
>> 
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>> I think both are very similar, but with slightly different goals. While 
>>> they work transparently for each Hadoop application you need to enable 
>>> specific support in the application for predicate push down. 
>>> In the end you have to check which application you are using and do 
>>> some tests (with correct predicate push down configuration). Keep in 
>>> mind that both formats work best if they are sorted on filter columns 
>>> (which is your responsibility) and if their optimatizations are 
>>> correctly configured (min max index, bloom filter, compression etc) . 
>>> 
>>> If you need to ingest sensor data you may want to store it first in 
>>> hbase and then batch process it in large files in Orc or parquet format.
>>> 
 On 26 Jul 2016, at 04:09, janardhan shetty  
 wrote:
 
 Just wondering advantages and disadvantages to convert data into ORC 
 or Parquet. 
 
 In the documentation of Spark there are numerous examples of Parquet 
 format. 
 
 Any strong reasons to chose Parquet over ORC file format ?
 
 Also : current data compression is bzip2
 
 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
  
 This seems like biased.
>>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha

Because everyone is here discussing this ever-changing-for-better-reason
topic of storage formats and serdes, any opinion/thoughts/experience with
Apache Arrow? It sounds like a nice idea, but how ready is it?

On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke  wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
> u...@moosheimer.com > wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta  >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current data compression is bzip2


 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
 This seems like biased.


>>>
>>
>


-- 
Best Re

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui

If you want to keep using RDD API, then you still need to create SparkContext 
first.

If you want to use just Dataset/DataFrame/SQL API, then you can directly create 
a SparkSession. Generally the SparkContext is hidden although it is internally 
created and held within the SparkSession. Anytime you need the SparkContext, 
you can get it from SparkSession.sparkContext.   while SparkConf is accepted 
when creating a SparkSession, the formal way to set/get configurations for a 
SparkSession is through SparkSession.conf.set()/get()
> On Jul 27, 2016, at 21:02, Jestin Ma  wrote:
> 
> I know that Sparksession is replacing the SQL and HiveContexts, but what 
> about SparkConf and SparkContext? Are those still relevant in our programs?
> 
> Thank you!
> Jestin



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui

If you want to keep using RDD API, then you still need to create SparkContext 
first.

If you want to use just Dataset/DataFrame/SQL API, then you can directly create 
a SparkSession. Generally the SparkContext is hidden although it is internally 
created and held within the SparkSession. Anytime you need the SparkContext, 
you can get it from SparkSession.sparkContext.   while SparkConf is accepted 
when creating a SparkSession, the formal way to set/get configurations for a 
SparkSession is through SparkSession.conf.set()/get()
> On Jul 27, 2016, at 21:02, Jestin Ma  wrote:
> 
> I know that Sparksession is replacing the SQL and HiveContexts, but what 
> about SparkConf and SparkContext? Are those still relevant in our programs?
> 
> Thank you!
> Jestin



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Timothy Potter

Take this simple join:

SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
solr.movie_id = m.movie_id ORDER BY aggCount DESC

I would like the ability to push the inner sub-query aliased as "solr"
down into the data source engine, in this case Solr as it will
greatlly reduce the amount of data that has to be transferred from
Solr into Spark. I would imagine this issue comes up frequently if the
underlying engine is a JDBC data source as well ...

Is this possible? Of course, my example is a bit cherry-picked so
determining if a sub-query can be pushed down into the data source
engine is probably not a trivial task, but I'm wondering if Spark has
the hooks to allow me to try ;-)

Cheers,
Tim

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Marco Colombo

Why don't you create a dataframe filtered, map it as temporary table and
then use it in your query? You can also cache it, of multiple queries on
the same inner queries are requested.

Il mercoledì 27 luglio 2016, Timothy Potter  ha
scritto:

> Take this simple join:
>
> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>
> I would like the ability to push the inner sub-query aliased as "solr"
> down into the data source engine, in this case Solr as it will
> greatlly reduce the amount of data that has to be transferred from
> Solr into Spark. I would imagine this issue comes up frequently if the
> underlying engine is a JDBC data source as well ...
>
> Is this possible? Of course, my example is a bit cherry-picked so
> determining if a sub-query can be pushed down into the data source
> engine is probably not a trivial task, but I'm wondering if Spark has
> the hooks to allow me to try ;-)
>
> Cheers,
> Tim
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>
>

-- 
Ing. Marco Colombo

Building Spark 2 from source that does not include the Hive jars

2016-07-27 Thread Mich Talebzadeh

Hi,

This has worked before including 1.6.1 etc

Build Spark without Hive jars. The idea being to use Spark as Hive
execution engine.

There is some notes on Hive on Spark: Getting Started


The usual process is to do

dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"

However, now I am getting this warning
[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 10:08 min (Wall Clock)
[INFO] Finished at: 2016-07-27T15:07:11+01:00
[INFO] Final Memory: 98M/1909M
[INFO]

+ rm -rf /data6/hduser/spark-2.0.0/dist
+ mkdir -p /data6/hduser/spark-2.0.0/dist/jars
+ echo 'Spark [WARNING] The requested profile "parquet-provided" could not
be activated because it does not exist. built for Hadoop [WARNING] The
requested profile "parquet-provided" could not be activated because it does
not exist.'
+ echo 'Build flags: -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided'


And this is the only tgz file I see

./spark-[WARNING] The requested profile "parquet-provided" could not be
activated because it does not exist.-bin-hadoop2-without-hive.tgz

Any clues what is happening and the correct way of creating the build:

My interest is to extract the jar file similar to below from the build

 spark-assembly-1.3.1-hadoop2.4.0.jar

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Timothy Potter

I'm not looking for a one-off solution for a specific query that can
be solved on the client side as you suggest, but rather a generic
solution that can be implemented within the DataSource impl itself
when it knows a sub-query can be pushed down into the engine. In other
words, I'd like to intercept the query planning process to be able to
push-down computation into the engine when it makes sense.

On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
 wrote:
> Why don't you create a dataframe filtered, map it as temporary table and
> then use it in your query? You can also cache it, of multiple queries on the
> same inner queries are requested.
>
>
> Il mercoledì 27 luglio 2016, Timothy Potter  ha
> scritto:
>>
>> Take this simple join:
>>
>> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>>
>> I would like the ability to push the inner sub-query aliased as "solr"
>> down into the data source engine, in this case Solr as it will
>> greatlly reduce the amount of data that has to be transferred from
>> Solr into Spark. I would imagine this issue comes up frequently if the
>> underlying engine is a JDBC data source as well ...
>>
>> Is this possible? Of course, my example is a bit cherry-picked so
>> determining if a sub-query can be pushed down into the data source
>> engine is probably not a trivial task, but I'm wondering if Spark has
>> the hooks to allow me to try ;-)
>>
>> Cheers,
>> Tim
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> Ing. Marco Colombo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: read only specific jsons

2016-07-27 Thread vr spark

HI ,
I tried and getting exception still..any other suggestion?

clickDF = cDF.filter(cDF['request.clientIP'].isNotNull())

It fails for some cases and errors our with below message

AnalysisException: u'No such struct field clientIP in cookies,
nscClientIP1, nscClientIP2, uAgent;'

On Tue, Jul 26, 2016 at 12:05 PM, Cody Koeninger  wrote:

> Have you tried filtering out corrupt records with something along the
> lines of
>
>  df.filter(df("_corrupt_record").isNull)
>
> On Tue, Jul 26, 2016 at 1:53 PM, vr spark  wrote:
> > i am reading data from kafka using spark streaming.
> >
> > I am reading json and creating dataframe.
> > I am using pyspark
> >
> > kvs = KafkaUtils.createDirectStream(ssc, kafkaTopic1, kafkaParams)
> >
> > lines = kvs.map(lambda x: x[1])
> >
> > lines.foreachRDD(mReport)
> >
> > def mReport(clickRDD):
> >
> >clickDF = sqlContext.jsonRDD(clickRDD)
> >
> >clickDF.registerTempTable("clickstream")
> >
> >PagesDF = sqlContext.sql(
> >
> > "SELECT   request.clientIP as ip "
> >
> > "FROM clickstream "
> >
> > "WHERE request.clientIP is not null "
> >
> > " limit 2000 "
> >
> >
> > The problem is that not all the jsons from the stream have the same
> format.
> >
> > It works when it reads a json which has ip.
> >
> > Some of the json strings do not have client ip in their schema.
> >
> > So i am getting error and my job is failing when it encounters such a
> json.
> >
> > How do read only those json which has ip in their schema?
> >
> > Please suggest.
>

Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Jestin Ma

Hi, I'm doing performance testing and currently have 1 master node and 4
worker nodes and am submitting in client mode from a 6th cluster node.

I know we can have a master and worker on the same node. Speaking in terms
of performance and practicality, is it possible/suggested to have another
working running on either the 6th node or the master node?

Thank you!

Re: Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Mich Talebzadeh

Hi Justine.

As I understand you are using Spark in standalone mode meaning that you
start your master and slaves/worker processes.

You can specify the number of works for each node in
$SPARK_HOME/conf/spark-env.sh file as below

# Options for the daemons used in the standalone deploy mode
export SPARK_WORKER_INSTANCES=3 ##, to set the number of worker processes
per node

And you specify the host for master and slaves in conf/slaves file

When you start start-master.sh and start-slaves.sh, you will see the worker
processes

Now if you have localhost in slaves file you will start worker processes in
your master node so to speak. There is nothing wrong with that as long as
your master node has resources for spark app.

Once you stared you will see something like below using jps commad:

21697 Worker
18242 Master
21496 Worker
21597 Worker

Where is your edge (where you are submitting your Spark app)?

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 27 July 2016 at 18:19, Jestin Ma  wrote:

> Hi, I'm doing performance testing and currently have 1 master node and 4
> worker nodes and am submitting in client mode from a 6th cluster node.
>
> I know we can have a master and worker on the same node. Speaking in terms
> of performance and practicality, is it possible/suggested to have another
> working running on either the 6th node or the master node?
>
> Thank you!
>
>

Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-27 Thread janardhan shetty

1.  Any links or blogs to develop *custom* transformers ? ex: Tokenizer

2. Any links or blogs to develop *custom* estimators ? ex: any ml algorithm

spark 1.6.0 read s3 files error.

2016-07-27 Thread freedafeng

cdh 5.7.1. pyspark. 

codes: ===
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('s3 ---')
sc = SparkContext(conf=conf)

myRdd =
sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz")

count = myRdd.count()
print "The count is", count

===
standalone mode: command line:

AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? ./bin/spark-submit
--driver-memory 4G  --master  spark://master:7077 --conf
"spark.default.parallelism=70"  /root/workspace/test/s3.py

Error:
)
16/07/27 17:27:26 INFO spark.SparkContext: Created broadcast 0 from textFile
at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "/root/workspace/test/s3.py", line 12, in 
count = myRdd.count()
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 1004, in count
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 995, in sum
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 869, in fold
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py",
line 771, in collect
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File
"/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
   
org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
@155: invokevirtual
  Reason:
Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is
not assignable to 'org/jets3t/service/model/StorageObject'
  Current Frame:
bci: @155
flags: { }
locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore',
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object'
}
stack: { 'org/jets3t/service/S3Service', 'java/lang/String',
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object',
integer }
  Bytecode:
000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
010: 5659 b701 5713 0192 b601 5b2b b601 5b13
020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
030: b400 7db6 00e7 b601 5bb6 015e b901 9802
040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
0a0: 000a 4e2a 2d2b b700 c7b1   
  Exception Handler Table:
bci [0, 116] => handler: 162
bci [117, 159] => handler: 162
  Stackmap Table:
same_frame_extended(@65)
same_frame(@117)
same_locals_1_stack_item_frame(@162,Object[#139])
same_frame(@169)

at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)

.

TIA




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Andy Davidson

Congratulations on releasing 2.0!



spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever
http://spark.apache.org/docs/latest/index.html  has a link to the spark-ec2
github repo https://github.com/amplab/spark-ec2



Is this the right group to discuss spark-ec2?

Any idea how stable spark-ec2 is on spark-2.0?

Should we use master or branch-2.0? It looks like the default might be the
branch-1.6 ?

Thanks

Andy


P.s. The new stand alone documentation is a big improvement. I have a much
better idea of what spark-ec2 does and how to upgrade my system.

Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-27 Thread Robert Goodman

I tried to run the JavaAFTSurvivalRegressionExample on Spark 2.0 and the
example doesn't work. It looks like the problem is that the example is
using the MLLib Vector/VectorUDT to create the DataSet which needs to be
converted using MLUtils before using in the model. I haven't actually tried
this yet.

When I run the example (/bin/run-example
ml.JavaAFTSurvivalRegressionExample), I get the following stack trace

Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Column features must be of type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
at scala.Predef$.require(Predef.scala:224)
at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at
org.apache.spark.ml.regression.AFTSurvivalRegressionParams$class.validateAndTransformSchema(AFTSurvivalRegression.scala:106)
at
org.apache.spark.ml.regression.AFTSurvivalRegression.validateAndTransformSchema(AFTSurvivalRegression.scala:126)
at
org.apache.spark.ml.regression.AFTSurvivalRegression.fit(AFTSurvivalRegression.scala:199)
at
org.apache.spark.examples.ml.JavaAFTSurvivalRegressionExample.main(JavaAFTSurvivalRegressionExample.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Are you suppose to be able use the ML version of VectorUDT? The Spark 2.0
API docs for Java, don't show the class but I was able to import the class
into a java program.

   Thanks
 Bob

Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-27 Thread Steve Rowe

You can see the source for my transformer configurable bridge to Lucene 
analysis components here, in my company Lucidworks’ spark-solr project: 
.

Here’s a blog I wrote about using this transformer, as well as non-ML-context 
use in Spark of the underlying analysis component, here: 
.

--
Steve
www.lucidworks.com

> On Jul 27, 2016, at 1:31 PM, janardhan shetty  wrote:
> 
> 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
> 
> 2. Any links or blogs to develop custom estimators ? ex: any ml algorithm

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark 1.6.0 read s3 files error.

2016-07-27 Thread Andy Davidson

Hi Freedafeng

The following works for me df will be a data frame. fullPath is lists list
of various part files stored in s3.

fullPath = 
['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r
-0-a2121800-fa5b-44b1-a994-67795' ]

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('json').load(fullPath).select(³key1") #.cache()


My file is raw JSON. I you might have to tweak the above statement to work
with gz files

Andy


From:  freedafeng 
Date:  Wednesday, July 27, 2016 at 10:36 AM
To:  "user @spark" 
Subject:  spark 1.6.0 read s3 files error.

> cdh 5.7.1. pyspark.
> 
> codes: ===
> from pyspark import SparkContext, SparkConf
> 
> conf = SparkConf().setAppName('s3 ---')
> sc = SparkContext(conf=conf)
> 
> myRdd =
> sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4
> e65-b749-b4a2694b9199.json.gz")
> 
> count = myRdd.count()
> print "The count is", count
> 
> ===
> standalone mode: command line:
> 
> AWS_ACCESS_KEY_ID=??? AWS_SECRET_ACCESS_KEY=??? ./bin/spark-submit
> --driver-memory 4G  --master  spark://master:7077 --conf
> "spark.default.parallelism=70"  /root/workspace/test/s3.py
> 
> Error:
> )
> 16/07/27 17:27:26 INFO spark.SparkContext: Created broadcast 0 from textFile
> at NativeMethodAccessorImpl.java:-2
> Traceback (most recent call last):
>   File "/root/workspace/test/s3.py", line 12, in 
> count = myRdd.count()
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 1004, in count
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 995, in sum
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 869, in fold
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/pyspark
> .zip/pyspark/rdd.py",
> line 771, in collect
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.
> 9-src.zip/py4j/java_gateway.py",
> line 813, in __call__
>   File
> "/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/python/lib/py4j-0.
> 9-src.zip/py4j/protocol.py",
> line 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
>
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/Stri
> ng;Ljava/lang/String;)V
> @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is
> not assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore',
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object'
> }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String',
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object',
> integer }
>   Bytecode:
> 000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
> 
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3Fi
> leSystem.java:338)
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem
> .java:328)
> 
> .
> 
> TIA
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-
> error-tp27417.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
>

Re: spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Nicholas Chammas

Yes, spark-ec2 has been removed from the main project, as called out in the
Release Notes:

http://spark.apache.org/releases/spark-release-2-0-0.html#removals

You can still discuss spark-ec2 here or on Stack Overflow, as before. Bug
reports and the like should now go on that AMPLab GitHub project as opposed
to JIRA, though.

You should use branch-2.0.

On Wed, Jul 27, 2016 at 2:30 PM Andy Davidson 
wrote:

> Congratulations on releasing 2.0!
>
>
> spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever
>  http://spark.apache.org/docs/latest/index.html  has a link to the
> spark-ec2 github repo https://github.com/amplab/spark-ec2
>
>
> Is this the right group to discuss spark-ec2?
>
> Any idea how stable spark-ec2 is on spark-2.0?
>
> Should we use master or branch-2.0? It looks like the default might be the
> branch-1.6 ?
>
> Thanks
>
> Andy
>
>
> P.s. The new stand alone documentation is a big improvement. I have a
> much better idea of what spark-ec2 does and how to upgrade my system.
>
>
>
>
>
>
>
>
>
>
>
>

Re: Spark Web UI port 4040 not working

2016-07-27 Thread Marius Soutier

That's to be expected - the application UI is not started by the master, but by 
the driver. So the UI will run on the machine that submits the job.


> On 26.07.2016, at 15:49, Jestin Ma  wrote:
> 
> I did netstat -apn | grep 4040 on machine 6, and I see
> 
> tcp0  0 :::4040 :::*
> LISTEN  30597/java
> 
> What does this mean?
> 
> On Tue, Jul 26, 2016 at 6:47 AM, Jestin Ma  > wrote:
> I do not deploy using cluster mode and I don't use EC2.
> 
> I just read that launching as client mode: "the driver is launched directly 
> within the spark-submit process which acts as a client to the cluster."
> 
> My current setup is that I have cluster machines 1, 2, 3, 4, 5, with 1 being 
> the master. 
> I submit from another cluster machine 6 in client mode. So I'm taking that 
> the driver is launched in my machine 6.
> 
> On Tue, Jul 26, 2016 at 6:38 AM, Jacek Laskowski  > wrote:
> Hi,
> 
> Do you perhaps deploy using cluster mode? Is this EC2? You'd need to
> figure out where the driver runs and use the machine's IP.
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/ 
> Mastering Apache Spark http://bit.ly/mastering-apache-spark 
> 
> Follow me at https://twitter.com/jaceklaskowski 
> 
> 
> 
> On Tue, Jul 26, 2016 at 3:36 PM, Jestin Ma  > wrote:
> > I tried doing that on my master node.
> > I got nothing.
> > However, I grep'd port 8080 and I got the standalone UI.
> >
> > On Tue, Jul 26, 2016 at 12:39 AM, Chanh Le  > > wrote:
> >>
> >> You’re running in StandAlone Mode?
> >> Usually inside active task it will show the address of current job.
> >> or you can check in master node by using netstat -apn | grep 4040
> >>
> >>
> >>
> >> > On Jul 26, 2016, at 8:21 AM, Jestin Ma  >> > >
> >> > wrote:
> >> >
> >> > Hello, when running spark jobs, I can access the master UI (port 8080
> >> > one) no problem. However, I'm confused as to how to access the web UI to 
> >> > see
> >> > jobs/tasks/stages/etc.
> >> >
> >> > I can access the master UI at http://:8080. But port 4040
> >> > gives me a -connection cannot be reached-.
> >> >
> >> > Is the web UI http:// with a port of 4040?
> >> >
> >> > I'm running my Spark job on a cluster machine and submitting it to a
> >> > master node part of the cluster. I heard of ssh tunneling; is that 
> >> > relevant
> >> > here?
> >> >
> >> > Thank you!
> >>
> >
> 
>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke

Kudu has been from my impression be designed to offer somethings between hbase 
and parquet for write intensive loads - it is not faster for warehouse type of 
querying compared to parquet (merely slower, because that is not its use case). 
  I assume this is still the strategy of it.

For some scenarios it could make sense together with parquet and Orc. However I 
am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
>> speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
>> might just start saying that KUDA has difficult lineage as well. After all 
>> dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me 
>> access it faster I do not care where it comes from or how difficult the 
>> child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>  wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
 On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
 
>>> 
 parquet was inspired by dremel but written from the ground up as a library 
 with support for a variety of big data systems (hive, pig, impala, 
 cascading, etc.). it is also easy to add new support, since its a proper 
 library.
 
 orc bas been enhanced while deployed at facebook in hive and at yahoo in 
 hive. just hive. it didn't really exist by itself. it was part of the big 
 java soup that is called hive, without an easy way to extract it. hive 
 does not expose proper java apis. it never cared for that.
 
> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>  wrote:
> Interesting opinion, thank you
> 
> Still, on the website parquet is basically inspired by Dremel (Google) 
> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo 
> [2].
> 
> Other than this presentation [3], do you guys know any other benchmark?
> 
> [1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3] 
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
> 
>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>> 
>> when parquet came out it was developed by a community of companies, and 
>> was designed as a library to be supported by multiple big data projects. 
>> nice
>> 
>> orc on the other hand initially only supported hive. it wasn't even 
>> designed as a library that can be re-used. even today it brings in the 
>> kitchen sink of transitive dependencies. yikes
>> 
>> 
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>> I think both are very similar, but with slightly different goals. While 
>>> they work transparently for each Hadoop application you need to enable 
>>> specific support in the application for predicate push down. 
>>> In the end you have to check which application you are using and do 
>>> some tests (with correct predicate push down configuration). Keep in 
>>> mind that both formats work best if they are sorted on filter columns 
>>> (which is your responsibility) and if their optimatizations are 
>>> correctly configured (min max index, bloom filter, compression etc) . 
>>> 
>>> If you need to ingest sensor data you may want to store it first in 
>>> hbase and then batch process it in large files in Orc or parquet format.
>>> 
 On 26 Jul 2016, at 04:09, janardhan shetty  
 wrote:
 
 Just wondering advantages and disadvantages to convert data into ORC 
 or Parquet. 
 
 In the documentation of Spark there are numerous examples of Parquet 
 format. 
 
 Any strong reasons to chose Parquet over ORC file format ?
 
 Also : current data compression is bzip2
 
 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
  
 This seems like biased.
>>

Run times for Spark 1.6.2 compared to 2.1.0?

2016-07-27 Thread Colin Beckingham

I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It 
calculates a logistic model using MLlib. I compiled the 2.1 today from 
source and took the version 1 as a precompiled version with Hadoop. The 
odd thing is that on 1.6.2 the project produces an answer in 350 sec and 
the 2.1.0 takes 990 sec. Identical code using pyspark. I'm wondering if 
there is something in the setup params for 1.6 and 2.1, say number of 
executors or memory allocation, which might account for this? I'm using 
just the 4 cores of my machine as master and executors.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

spark-2.x what is the default version of java ?

2016-07-27 Thread Andy Davidson

I currently have to configure spark-1.x to use Java 8 and python 3.x. I
noticed that 
http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions
java 7 is deprecated.

Is the default now Java 8 ?

Thanks 

Andy

Deprecations
The following features have been deprecated in Spark 2.0, and might be
removed in future versions of Spark 2.x:
* Fine-grained mode in Apache Mesos
* Support for Java 7
* Support for Python 2.6

Re: spark-2.x what is the default version of java ?

2016-07-27 Thread Jacek Laskowski

Hi,

The default version of Java is 7. It's being discussed when to settle on 8
as the default version. Nobody knows when it happens.

Jacek

On 27 Jul 2016 11:00 p.m., "Andy Davidson" 
wrote:

> I currently have to configure spark-1.x to use Java 8 and python 3.x. I
> noticed that
> http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions
> java 7 is deprecated.
>
> Is the default now Java 8 ?
>
> Thanks
>
> Andy
>
> Deprecations
>
> The following features have been deprecated in Spark 2.0, and might be
> removed in future versions of Spark 2.x:
>
>- Fine-grained mode in Apache Mesos
>- Support for Java 7
>- Support for Python 2.6
>
>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread janardhan shetty

Seems like parquet format is better comparatively to orc when the dataset
is log data without nested structures? Is this fair understanding ?
On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
> u...@moosheimer.com > wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta  >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
>>>
 I think both are very similar, but with slightly different goals. While
 they work transparently for each Hadoop application you need to enable
 specific support in the application for predicate push down.
 In the end you have to check which application you are using and do
 some tests (with correct predicate push down configuration). Keep in mind
 that both formats work best if they are sorted on filter columns (which is
 your responsibility) and if their optimatizations are correctly configured
 (min max index, bloom filter, compression etc) .

 If you need to ingest sensor data you may want to store it first in
 hbase and then batch process it in large files in Orc or parquet format.

 On 26 Jul 2016, at 04:09, janardhan shetty 
 wrote:

 Just wondering advantages and disadvantages to convert data into ORC or
 Parquet.

 In the documentation of Spark there are numerous examples of Parquet
 format.

 Any strong reasons to chose Parquet over ORC file format ?

 Also : current data compression is bzip2


 http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
 This seems like biased.


>>>
>>
>

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Sudhir Babu Pothineni

It depends on what you are dong, here is the recent comparison of ORC, Parquet

https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet

Although from ORC authors, I thought fair comparison, We use ORC as System of 
Record on our Cloudera HDFS cluster, our experience is so far good.

Perquet is backed by Cloudera, which has more installations of Hadoop. ORC is 
by Hortonworks, so battle of file format continues...

Sent from my iPhone

> On Jul 27, 2016, at 4:54 PM, janardhan shetty  wrote:
> 
> Seems like parquet format is better comparatively to orc when the dataset is 
> log data without nested structures? Is this fair understanding ?
> 
>> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>> Kudu has been from my impression be designed to offer somethings between 
>> hbase and parquet for write intensive loads - it is not faster for warehouse 
>> type of querying compared to parquet (merely slower, because that is not its 
>> use case).   I assume this is still the strategy of it.
>> 
>> For some scenarios it could make sense together with parquet and Orc. 
>> However I am not sure what the advantage towards using hbase + parquet and 
>> Orc.
>> 
>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com"  wrote:
>>> 
>>> Hi Gourav,
>>> 
>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
>>> memory db with data storage while Parquet is "only" a columnar storage 
>>> format.
>>> 
>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
>>> that's more a wish :-).
>>> 
>>> Regards,
>>> Uwe
>>> 
>>> Mit freundlichen Grüßen / best regards
>>> Kay-Uwe Moosheimer
>>> 
 Am 27.07.2016 um 09:15 schrieb Gourav Sengupta :
 
 Gosh,
 
 whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
 speed that is better than SPARK.
 
 Has anyone heard of KUDA? Its better than Parquet. But I think that 
 someone might just start saying that KUDA has difficult lineage as well. 
 After all dynastic rules dictate.
 
 Personally I feel that if something stores my data compressed and makes me 
 access it faster I do not care where it comes from or how difficult the 
 child birth was :)
 
 
 Regards,
 Gourav
 
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>  wrote:
> Just correction:
> 
> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
> default. 
> 
> Do not know If Spark leveraging this new repo?
> 
> 
>  org.apache.orc
> orc
> 1.1.2
> pom
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>> 
> 
>> parquet was inspired by dremel but written from the ground up as a 
>> library with support for a variety of big data systems (hive, pig, 
>> impala, cascading, etc.). it is also easy to add new support, since its 
>> a proper library.
>> 
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>> hive. just hive. it didn't really exist by itself. it was part of the 
>> big java soup that is called hive, without an easy way to extract it. 
>> hive does not expose proper java apis. it never cared for that.
>> 
>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>>>  wrote:
>>> Interesting opinion, thank you
>>> 
>>> Still, on the website parquet is basically inspired by Dremel (Google) 
>>> [1] and part of orc has been enhanced while deployed for Facebook, 
>>> Yahoo [2].
>>> 
>>> Other than this presentation [3], do you guys know any other benchmark?
>>> 
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3] 
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>> 
 On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:
 
 when parquet came out it was developed by a community of companies, 
 and was designed as a library to be supported by multiple big data 
 projects. nice
 
 orc on the other hand initially only supported hive. it wasn't even 
 designed as a library that can be re-used. even today it brings in the 
 kitchen sink of transitive dependencies. yikes
 
 
> On Jul 26, 2016 5:09 AM, "Jörn Franke"  wrote:
> I think both are very similar, but with slightly different goals. 
> While they work transparently for each Hadoop application you need to 
> enable specific support in the application for predicate push down. 
> In the end you have to check which application you are using and do 
> some tests (with correct predicate push down configuration). Keep in 
> mind that both formats work best if they are sorted on filter columns 
>

how to copy local files to hdfs quickly?

2016-07-27 Thread Andy Davidson

I have a spark streaming app that saves JSON files to s3:// . It works fine

Now I need to calculate some basic summary stats and am running into
horrible performance problems.

I want to run a test to see if reading from hdfs instead of s3 makes
difference. I am able to quickly copy my the data from s3 to a machine in my
cluster how ever hadoop fs put is pain fully slow. Is there a better way to
copy large data to hdfs?

I should mention I am not using EMR . I.E. According to AWS support there is
no way to have $aws s3¹ copy directory to hdfs://

Hadoop distcp can not copy files from the local files system

Thanks in advance

Andy

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Mich Talebzadeh

And frankly this is becoming some sort of religious arguments now



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
wrote:

> It depends on what you are dong, here is the recent comparison of ORC,
> Parquet
>
>
> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> Although from ORC authors, I thought fair comparison, We use ORC as System
> of Record on our Cloudera HDFS cluster, our experience is so far good.
>
> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC
> is by Hortonworks, so battle of file format continues...
>
> Sent from my iPhone
>
> On Jul 27, 2016, at 4:54 PM, janardhan shetty 
> wrote:
>
> Seems like parquet format is better comparatively to orc when the dataset
> is log data without nested structures? Is this fair understanding ?
> On Jul 27, 2016 1:30 PM, "Jörn Franke"  wrote:
>
>> Kudu has been from my impression be designed to offer somethings between
>> hbase and parquet for write intensive loads - it is not faster for
>> warehouse type of querying compared to parquet (merely slower, because that
>> is not its use case).   I assume this is still the strategy of it.
>>
>> For some scenarios it could make sense together with parquet and Orc.
>> However I am not sure what the advantage towards using hbase + parquet and
>> Orc.
>>
>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com " <
>> u...@moosheimer.com > wrote:
>>
>> Hi Gourav,
>>
>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
>> memory db with data storage while Parquet is "only" a columnar
>> storage format.
>>
>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
>> that's more a wish :-).
>>
>> Regards,
>> Uwe
>>
>> Mit freundlichen Grüßen / best regards
>> Kay-Uwe Moosheimer
>>
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta > >:
>>
>> Gosh,
>>
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
>> speed that is better than SPARK.
>>
>> Has anyone heard of KUDA? Its better than Parquet. But I think that
>> someone might just start saying that KUDA has difficult lineage as well.
>> After all dynastic rules dictate.
>>
>> Personally I feel that if something stores my data compressed and makes
>> me access it faster I do not care where it comes from or how difficult the
>> child birth was :)
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
>> sbpothin...@gmail.com> wrote:
>>
>>> Just correction:
>>>
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>>> default.
>>>
>>> Do not know If Spark leveraging this new repo?
>>>
>>> 
>>>  org.apache.orc
>>> orc
>>> 1.1.2
>>> pom
>>> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sent from my iPhone
>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers  wrote:
>>>
>>> parquet was inspired by dremel but written from the ground up as a
>>> library with support for a variety of big data systems (hive, pig, impala,
>>> cascading, etc.). it is also easy to add new support, since its a proper
>>> library.
>>>
>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>>> hive. just hive. it didn't really exist by itself. it was part of the big
>>> java soup that is called hive, without an easy way to extract it. hive does
>>> not expose proper java apis. it never cared for that.
>>>
>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 Interesting opinion, thank you

 Still, on the website parquet is basically inspired by Dremel (Google)
 [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
 [2].

 Other than this presentation [3], do you guys know any other benchmark?

 [1]https://parquet.apache.org/documentation/latest/
 [2]https://orc.apache.org/docs/
 [3]
 http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

 On 26 Jul 2016, at 15:19, Koert Kuipers  wrote:

 when parquet came out it was developed by a community of companies, and
 was designed as a library to be supported by multiple big data projects.
 nice

 orc on the other hand initially only supported hive. it wasn't even
 designed as a library that can be re-used. even today it brings in the
 kitchen sink of transitive dependencies. yikes

 On Jul 26, 2016 5:09 AM, "Jör

How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty

How do I download 2.0? The main download page isn't showing it? And all the
other download links point to the same single download page.

This is the one I end up at:
http://spark.apache.org/downloads.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-download-2-0-The-main-download-page-isn-t-showing-it-tp27420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty

Nevermind, it literally just appeared right after I posted this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-download-2-0-The-main-download-page-isn-t-showing-it-tp27420p27421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Andrew Ash

You sometimes have to hard refresh to get the page to update.

On Wed, Jul 27, 2016 at 5:12 PM, Jim O'Flaherty 
wrote:

> Nevermind, it literally just appeared right after I posted this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-download-2-0-The-main-download-page-isn-t-showing-it-tp27420p27421.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-27 Thread Ascot Moss

Hi,

Please help!

When saving the model, I got following error and cannot save the model to
hdfs:

(my source code, my spark is v1.6.2)
my_model.save(sc, "/my_model")

-
16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose tasks
have all completed, from pool

16/07/28 08:36:19 INFO DAGScheduler: ResultStage 69 (saveAsTextFile at
treeEnsembleModels.scala:447) finished in 0.901 s

16/07/28 08:36:19 INFO DAGScheduler: Job 38 finished: saveAsTextFile at
treeEnsembleModels.scala:447, took 2.513396 s

Killed
-


Q1: Is there any limitation on saveAsTextFile?
Q2: or where to find the error log file location?

Regards

DecisionTree currently only supports maxDepth <= 30

2016-07-27 Thread Ascot Moss

Hi,

Is there any reason behind to limit  maxDepth <= 30?  Can it be deeper?


Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: DecisionTree currently only supports maxDepth <= 30, but was given
maxDepth = 50.

at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:169)


Regards

A question about Spark Cluster vs Local Mode

2016-07-27 Thread Ascot Moss

Hi,

If I submit the same job to spark in cluster mode, does it mean in cluster
mode it will be run in cluster memory pool and it will fail if it runs out
of cluster's memory?

--driver-memory 64g \

--executor-memory 16g \

Regards

Re: read only specific jsons

2016-07-27 Thread Cody Koeninger

No, I literally meant filter on _corrupt_record, which has a magic
meaning in dataframe api to identify lines that didn't match the
schema.

On Wed, Jul 27, 2016 at 12:19 PM, vr spark  wrote:
> HI ,
> I tried and getting exception still..any other suggestion?
>
> clickDF = cDF.filter(cDF['request.clientIP'].isNotNull())
>
> It fails for some cases and errors our with below message
>
> AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1,
> nscClientIP2, uAgent;'
>
>
> On Tue, Jul 26, 2016 at 12:05 PM, Cody Koeninger  wrote:
>>
>> Have you tried filtering out corrupt records with something along the
>> lines of
>>
>>  df.filter(df("_corrupt_record").isNull)
>>
>> On Tue, Jul 26, 2016 at 1:53 PM, vr spark  wrote:
>> > i am reading data from kafka using spark streaming.
>> >
>> > I am reading json and creating dataframe.
>> > I am using pyspark
>> >
>> > kvs = KafkaUtils.createDirectStream(ssc, kafkaTopic1, kafkaParams)
>> >
>> > lines = kvs.map(lambda x: x[1])
>> >
>> > lines.foreachRDD(mReport)
>> >
>> > def mReport(clickRDD):
>> >
>> >clickDF = sqlContext.jsonRDD(clickRDD)
>> >
>> >clickDF.registerTempTable("clickstream")
>> >
>> >PagesDF = sqlContext.sql(
>> >
>> > "SELECT   request.clientIP as ip "
>> >
>> > "FROM clickstream "
>> >
>> > "WHERE request.clientIP is not null "
>> >
>> > " limit 2000 "
>> >
>> >
>> > The problem is that not all the jsons from the stream have the same
>> > format.
>> >
>> > It works when it reads a json which has ip.
>> >
>> > Some of the json strings do not have client ip in their schema.
>> >
>> > So i am getting error and my job is failing when it encounters such a
>> > json.
>> >
>> > How do read only those json which has ip in their schema?
>> >
>> > Please suggest.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Andy Davidson

Hi Ascot

When you run in cluster mode it means your cluster manager will cause your
driver to execute on one of the works in your cluster.

The advantage of this is you can log on to a machine in your cluster and
submit your application and then log out. The application will continue to
run.

Here is part of shell script I use to start a streaming app in cluster mode.
This app has been running for several months now

numCores=2 # must be at least 2 else steaming app will not get any data.
over all we are using 3 cores



# --executor-memory=1G # default is supposed to be 1G. If we did not set we
are seeing 6G

executorMem=1G



$SPARK_ROOT/bin/spark-submit \

--class "com.pws.sparkStreaming.collector.StreamingKafkaCollector" \

--master $MASTER_URL \

--deploy-mode cluster \

--total-executor-cores $numCores \

--executor-memory $executorMem \

$jarPath --clusterMode $*



From:  Ascot Moss 
Date:  Wednesday, July 27, 2016 at 6:48 PM
To:  "user @spark" 
Subject:  A question about Spark Cluster vs Local Mode

> Hi,
> 
> If I submit the same job to spark in cluster mode, does it mean in cluster
> mode it will be run in cluster memory pool and it will fail if it runs out of
> cluster's memory?
> 
> 
> --driver-memory 64g \
> 
> --executor-memory 16g \
> 
> Regards

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson

I have a relatively small data set however it is split into many small JSON
files. Each file is between maybe 4K and 400K
This is probably a very common issue for anyone using spark streaming. My
streaming app works fine, how ever my batch application takes several hours
to run. 

All I am doing is calling count(). Currently I am trying to read the files
from s3. When I look at the app UI it looks like spark is blocked probably
on IO? Adding additional workers and memory does not improve performance.

I am able to copy the files from s3 to a worker relatively quickly. So I do
not think s3 read time is the problem.

In the past when I had similar data sets stored on HDFS I was able to use
coalesce() to reduce the number of partition from 200K to 30. This made a
big improvement in processing time. How ever when I read from s3 coalesce()
does not improve performance.

I tried copying the files to a normal file system and then using hadoop fs
put¹ to copy the files to hdfs how ever this takes several hours and is no
where near completion. It appears hdfs does not deal with small files well.

I am considering copying the files from s3 to a normal file system on one of
my workers and then concatenating the files into a few much large files,
then using hadoop fs put¹ to move them to hdfs. Do you think this would
improve the spark count() performance issue?

Does anyone know of heuristics for determining the number or size of the
concatenated files?

Thanks in advance

Andy

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Yu Wei

If cluster runs out of memory, it seems that the executor will be restarted by 
cluster manager.


Jared, (韦煜）
Software developer
Interested in open source software, big data, Linux


From: Ascot Moss 
Sent: Thursday, July 28, 2016 9:48:13 AM
To: user @spark
Subject: A question about Spark Cluster vs Local Mode

Hi,

If I submit the same job to spark in cluster mode, does it mean in cluster mode 
it will be run in cluster memory pool and it will fail if it runs out of 
cluster's memory?


--driver-memory 64g \

--executor-memory 16g \

Regards

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez

There are a few blog posts that detail one possible/likely issue for
example:
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

TLDR: The hadoop libraries spark uses assumes that its input comes from a
 file system (works with HDFS) however S3 is a key value store, not a file
system. Somewhere along the line, this makes things very slow. Below I
describe their approach and a library I am working on to solve this problem.

(Much) Longer Version (with a shiny new library in development):
So far in my reading of source code, Hadoop attempts to actually read from
S3 which can be expensive particularly since it does so from a single
driver core (different from listing files, actually reading them, I can
find the source code and link it later if you would like). The concept
explained above is to instead use the AWS sdk to list files then distribute
the files names as a collection with sc.parallelize, then read them in
parallel. I found this worked, but lacking in a few ways so I started this
project: https://github.com/EntilZha/spark-s3

This takes that idea further by:
1. Rather than sc.parallelize, implement the RDD interface where each
partition is defined by the files it needs to read (haven't gotten to
DataFrames yet)
2. At the driver node, use the AWS SDK to list all the files with their
size (listing is fast), then run the Least Processing Time Algorithm to
sift the files into roughly balanced partitions by size
3. API: S3Context(sc).textFileByPrefix("bucket", "file1",
"folder2").regularRDDOperationsHere or import implicits and do
sc.s3.textFileByPrefix

At present, I am battle testing and benchmarking it at my current job and
results are promising with significant improvements to jobs dealing with
many files especially many small files and to jobs whose input is
unbalanced to start with. Jobs perform better because: 1) there isn't a
long stall at the driver when hadoop decides how to split S3 files 2) the
partitions end up nearly perfectly balanced because of LPT algorithm.

Since I hadn't intended to advertise this quite yet the documentation is
not super polished but exists here:
http://spark-s3.entilzha.io/latest/api/#io.entilzha.spark.s3.S3Context

I am completing the sonatype process for publishing artifacts on maven
central (this should be done by tomorrow so referencing
"io.entilzha:spark-s3_2.10:0.0.0" should work very soon). I would love to
hear if this library solution works, otherwise I hope the blog post above
is illuminating.

Pedro

On Wed, Jul 27, 2016 at 8:19 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> I have a relatively small data set however it is split into many small
> JSON files. Each file is between maybe 4K and 400K
> This is probably a very common issue for anyone using spark streaming. My
> streaming app works fine, how ever my batch application takes several hours
> to run.
>
> All I am doing is calling count(). Currently I am trying to read the files
> from s3. When I look at the app UI it looks like spark is blocked probably
> on IO? Adding additional workers and memory does not improve performance.
>
> I am able to copy the files from s3 to a worker relatively quickly. So I
> do not think s3 read time is the problem.
>
> In the past when I had similar data sets stored on HDFS I was able to use
> coalesce() to reduce the number of partition from 200K to 30. This made a
> big improvement in processing time. How ever when I read from s3 coalesce()
> does not improve performance.
>
> I tried copying the files to a normal file system and then using ‘hadoop
> fs put’ to copy the files to hdfs how ever this takes several hours and is
> no where near completion. It appears hdfs does not deal with small files
> well.
>
> I am considering copying the files from s3 to a normal file system on one
> of my workers and then concatenating the files into a few much large files,
> then using ‘hadoop fs put’ to move them to hdfs. Do you think this would
> improve the spark count() performance issue?
>
> Does anyone know of heuristics for determining the number or size of the
> concatenated files?
>
> Thanks in advance
>
> Andy
>

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Mich Talebzadeh

Hi

These are my notes on this topic.

   -

   *YARN Cluster Mode,* the Spark driver runs inside an application master
   process which is managed by YARN on the cluster, and the client can go away
   after initiating the application. This is invoked with –master yarn
and --deploy-mode
   cluster
   -

   *YARN Client Mode*, the driver runs in the client process, and the
   application master is only used for requesting resources from YARN.
Unlike Spark
   standalone mode, in which the master’s address is specified in the
   --master parameter, in YARN mode the ResourceManager’s address is picked
   up from the Hadoop configuration. Thus, the --master parameter is yarn. This
   is invoked with --deploy-mode client.

Yarn Cluster and Client Considerations

   -

   Client mode requires the process that launched the application remains
   alive. Meaning the host where it lives has to stay alive, and it may not be
   super-friendly to ssh sessions dying, for example, unless you use nohup.
   -

   Client mode driver logs are printed to stderr by default. Granted you
   can change that. In contrast, in cluster mode, they are all collected by
   yarn without any user intervention.
   -

   if your edge node (from where the app is launched) is not part of the
   cluster (e.g., lives in an outside network with firewalls or higher
   latency), you may run into issues.
   -

   In cluster mode, your driver's CPU and memory usage is accounted for in
   YARN. This matters if your edge node is part of the cluster (and could be
   running yarn containers), since in client mode your driver will potentially
   use a lot of CPU/Memory.
   -

   In cluster mode YARN can restart your application without user
   interference. This is useful for things that need to stay up (think a long
   running streaming job, for example).
   -

   If your client is not close to the cluster (e.g. your PC) then you
   definitely want to go cluster to improve performance.
   -

   If your client is close to the cluster (e.g. an edge node) then you
   could go either client or cluster.  Note that by going client, more
   resources are going to be used on the edge node.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 28 July 2016 at 04:39, Yu Wei  wrote:

> If cluster runs out of memory, it seems that the executor will be
> restarted by cluster manager.
>
>
> Jared, (韦煜）
> Software developer
> Interested in open source software, big data, Linux
> --
> *From:* Ascot Moss 
> *Sent:* Thursday, July 28, 2016 9:48:13 AM
> *To:* user @spark
> *Subject:* A question about Spark Cluster vs Local Mode
>
> Hi,
>
> If I submit the same job to spark in cluster mode, does it mean in cluster
> mode it will be run in cluster memory pool and it will fail if it runs out
> of cluster's memory?
>
> --driver-memory 64g \
>
> --executor-memory 16g \
>
> Regards
>

62 matches

Mail list logo