Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-07 Thread Felix C
Or you could build an uber jar ( you could google that )

https://eradiating.wordpress.com/2015/02/15/getting-spark-streaming-on-kafka-to-work/

--- Original Message ---

From: "Akhil Das" 
Sent: April 4, 2015 11:52 PM
To: "Priya Ch" 
Cc: u...@spark.apache.org, "dev" 
Subject: Re: Spark streaming with Kafka- couldnt find KafkaUtils

How are you submitting the application? Use a standard build tool like
maven or sbt to build your project, it will download all the dependency
jars, when you submit your application (if you are using spark-submit, then
use --jars option to add those jars which are causing
classNotFoundException). If you are running as a standalone application
without using spark-submit, then while creating the SparkContext, use
sc.addJar() to add those dependency jars.

For Kafka streaming, when you use sbt, these will be jars that are required:


sc.addJar("/root/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.1.0.jar")
   
sc.addJar("/root/.ivy2/cache/com.yammer.metrics/metrics-core/jars/metrics-core-2.2.0.jar")
   
sc.addJar("/root/.ivy2/cache/org.apache.kafka/kafka_2.10/jars/kafka_2.10-0.8.0.jar")
   sc.addJar("/root/.ivy2/cache/com.101tec/zkclient/jars/zkclient-0.3.jar")




Thanks
Best Regards

On Sun, Apr 5, 2015 at 12:00 PM, Priya Ch 
wrote:

> Hi All,
>
>   I configured Kafka  cluster on a  single node and I have streaming
> application which reads data from kafka topic using KafkaUtils. When I
> execute the code in local mode from the IDE, the application runs fine.
>
> But when I submit the same to spark cluster in standalone mode, I end up
> with the following exception:
> java.lang.ClassNotFoundException:
> org/apache/spark/streaming/kafka/KafkaUtils.
>
> I am using spark-1.2.1 version. when i checked the source files of
> streaming, the source files related to kafka are missing. Are these not
> included in spark-1.3.0 and spark-1.2.1 versions ?
>
> Have to manually include these ??
>
> Regards,
> Padma Ch
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.2

2015-04-07 Thread Sean Owen
I think that's close enough for a +1:

Signatures and hashes are good.
LICENSE, NOTICE still check out.
Compiles for a Hadoop 2.6 + YARN + Hive profile.

JIRAs with target version = 1.2.x look legitimate; no blockers.

I still observe several Hive test failures with:
mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
-DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
-Phive-0.13.1 -Dhadoop.version=2.6.0 test
.. though again I think these are not regressions but known issues in
older branches.

FYI there are 16 Critical issues still open for 1.2.x:

SPARK-6209,ExecutorClassLoader can leak connections after failing to
load classes from the REPL class server,Josh Rosen,In Progress,4/5/15
SPARK-5098,Number of running tasks become negative after tasks
lost,,Open,1/14/15
SPARK-4888,"Spark EC2 doesn't mount local disks for i2.8xlarge
instances",,Open,1/27/15
SPARK-4879,Missing output partitions after job completes with
speculative execution,Josh Rosen,Open,3/5/15
SPARK-4568,Publish release candidates under $VERSION-RCX instead of
$VERSION,Patrick Wendell,Open,11/24/14
SPARK-4520,SparkSQL exception when reading certain columns from a
parquet file,sadhan sood,Open,1/21/15
SPARK-4514,SparkContext localProperties does not inherit property
updates across thread reuse,Josh Rosen,Open,3/31/15
SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
SPARK-4452,Shuffle data structures can starve others on the same
thread for memory,Tianshuo Deng,Open,1/24/15
SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14
SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15
SPARK-4194,Exceptions thrown during SparkContext or SparkEnv
construction might lead to resource leaks or corrupted global
state,,In Progress,4/2/15
SPARK-4159,"Maven build doesn't run JUnit test suites",Sean Owen,Open,1/11/15
SPARK-4106,Shuffle write and spill to disk metrics are incorrect,,Open,10/28/14
SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14
SPARK-3461,Support external groupByKey using
repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14
SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14
SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15
SPARK-1312,Batch should read based on the batch interval provided in
the StreamingContext,Tathagata Das,Open,12/24/14

On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.2!
>
> The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1DCNddt
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.2-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1082/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.2!
>
> The vote is open until Thursday, April 08, at 00:30 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



not in gzip format

2015-04-07 Thread prabeesh k
Please check the apache mirror
http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.0/spark-1.3.0.tgz
file. It is not in the gzip format.


Re: not in gzip format

2015-04-07 Thread Sean Owen
Er, click the link? It is indeed a redirector HTML page. This is how all
Apache releases are served.
On Apr 7, 2015 8:32 AM, "prabeesh k"  wrote:

> Please check the apache mirror
> http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.0/spark-1.3.0.tgz
> file. It is not in the gzip format.
>


Re: not in gzip format

2015-04-07 Thread prabeesh k
but name just confusing

On 7 April 2015 at 16:35, Sean Owen  wrote:

> Er, click the link? It is indeed a redirector HTML page. This is how all
> Apache releases are served.
> On Apr 7, 2015 8:32 AM, "prabeesh k"  wrote:
>
>> Please check the apache mirror
>> http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.0/spark-1.3.0.tgz
>> file. It is not in the gzip format.
>>
>


Re: [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Marcelo Vanzin
+1 (non-binding)

Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
without the external shuffle service in yarn mode.

On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.3.1!
>
> The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1080
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.1!
>
> The vote is open until Wednesday, April 08, at 01:10 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Regularization in MLlib

2015-04-07 Thread Ulanov, Alexander
Hi,

Could anyone elaborate on the regularization in Spark? I've found that L1 and 
L2 are implemented with Updaters (L1Updater, SquaredL2Updater).
1)Why the loss reported by L2 is (0.5 * regParam * norm * norm) where norm is 
Norm(weights, 2.0)? It should be 0.5*regParam*norm (0.5 to disappear after 
differentiation). It seems that it is mixed up with mean squared error.
2)Why all weights are regularized? I think we should leave the bias weights 
(aka free or intercept) untouched if we don't assume that the data is 
centralized.
3)Are there any short-term plans to move regularization from updater to a more 
convenient place?

Best regards, Alexander



Re: Regularization in MLlib

2015-04-07 Thread DB Tsai
1)  Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm
* norm is required.

2) This is bug as you said. I intend to fix this using weighted
regularization, and intercept term will be regularized with weight
zero. https://github.com/apache/spark/pull/1518 But I never actually
have time to finish it. In the meantime, I'm fixing this without this
framework in new ML pipeline framework.

3) I think in the long term, we need weighted regularizer instead of
updater which couples regularization and adaptive step size update for
GD which is not needed in other optimization package.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Apr 7, 2015 at 3:03 PM, Ulanov, Alexander
 wrote:
> Hi,
>
> Could anyone elaborate on the regularization in Spark? I've found that L1 and 
> L2 are implemented with Updaters (L1Updater, SquaredL2Updater).
> 1)Why the loss reported by L2 is (0.5 * regParam * norm * norm) where norm is 
> Norm(weights, 2.0)? It should be 0.5*regParam*norm (0.5 to disappear after 
> differentiation). It seems that it is mixed up with mean squared error.
> 2)Why all weights are regularized? I think we should leave the bias weights 
> (aka free or intercept) untouched if we don't assume that the data is 
> centralized.
> 3)Are there any short-term plans to move regularization from updater to a 
> more convenient place?
>
> Best regards, Alexander
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Regularization in MLlib

2015-04-07 Thread Ulanov, Alexander
Hi DB,

Thank you!

In general case (not only for regression), I think that Regularizer should be 
tightly coupled with Gradient otherwise it will have no idea which weights are 
bias (intercept).

Best regards, Alexander

-Original Message-
From: DB Tsai [mailto:dbt...@dbtsai.com] 
Sent: Tuesday, April 07, 2015 3:28 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Regularization in MLlib

1)  Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm
* norm is required.

2) This is bug as you said. I intend to fix this using weighted regularization, 
and intercept term will be regularized with weight zero. 
https://github.com/apache/spark/pull/1518 But I never actually have time to 
finish it. In the meantime, I'm fixing this without this framework in new ML 
pipeline framework.

3) I think in the long term, we need weighted regularizer instead of updater 
which couples regularization and adaptive step size update for GD which is not 
needed in other optimization package.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Apr 7, 2015 at 3:03 PM, Ulanov, Alexander  
wrote:
> Hi,
>
> Could anyone elaborate on the regularization in Spark? I've found that L1 and 
> L2 are implemented with Updaters (L1Updater, SquaredL2Updater).
> 1)Why the loss reported by L2 is (0.5 * regParam * norm * norm) where norm is 
> Norm(weights, 2.0)? It should be 0.5*regParam*norm (0.5 to disappear after 
> differentiation). It seems that it is mixed up with mean squared error.
> 2)Why all weights are regularized? I think we should leave the bias weights 
> (aka free or intercept) untouched if we don't assume that the data is 
> centralized.
> 3)Are there any short-term plans to move regularization from updater to a 
> more convenient place?
>
> Best regards, Alexander
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Patrick Wendell
Hey All,

Today SPARK-6737 came to my attention. This is a bug that causes a
memory leak for any long running program that repeatedly saves data
out to a Hadoop FileSystem. For that reason, it is problematic for
Spark Streaming.

My sense is that this is severe enough to cut another RC once the fix
is merged (which is imminent):

https://issues.apache.org/jira/browse/SPARK-6737

I'll leave a bit of time for others to comment, in particular if
people feel we should not wait for this fix.

- Patrick

On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin  wrote:
> +1 (non-binding)
>
> Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
> without the external shuffle service in yarn mode.
>
> On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.3.1!
>>
>> The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>>
>> The list of fixes present in this release can be found at:
>> http://bit.ly/1C2nVPY
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1080
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.1!
>>
>> The vote is open until Wednesday, April 08, at 01:10 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-07 Thread shane knapp
reminder!  this is happening thurday morning.

On Fri, Apr 3, 2015 at 9:59 AM, shane knapp  wrote:

> welcome to python2.7+, java 8 and more!  :)
>
> i'll be doing a major upgrade to our build system next thursday morning.
>  here's a quick list of what's going on:
>
> * installation of anaconda python on all worker nodes
>
> * installation of pypy 2.5.1 (python 2.7) on all nodes
>
> * matching installation of python modules for the current system python
> (2.6), and anaconda python (2.6, 2.7 and 3.4)
>   - anaconda python 2.7 will be the default for all workers (this has
> stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
> and i've noticed no test failures)
>   - you can now use anaconda environments to specify which version of
> python to use in your tests:  http://www.continuum.io/blog/conda
>
> * installation of new python 2.7 modules:  pymongo requests six pymongo
> requests six python-crontab
>
> * bare-bones mongodb installation on all workers
>
> * installation of java 1.6 and 1.8 internal to jenkins
>   - jobs will default to the system java, which is 1.7.0_75
>   - if you want to run your tests w/java 6 or 8, you can select the JDK
> version of your choice in the job configuration page (it'll be towards the
> top)
>
> these changes have actually all been tested against a variety of builds
> (yay staging!) and while i'm certain that i have all of the kinks worked
> out, i'm going to schedule a longer downtime so that i have a chance to
> identify and squash any problems that surface.
>
> thanks to josh rosen, k. shankari and davies liu for helping me test all
> of this and get it working.
>
> shane
>


Re: [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Josh Rosen
The leak will impact long running streaming jobs even if they don't write 
Hadoop files, although the problem may take much longer to manifest itself for 
those jobs.

I think we currently leak an empty HashMap per stage submitted in the common 
case, so it could take a very long time for this to trigger an OOM.  On the 
other hand, the worst case behavior is quite bad for streaming jobs, so we 
should probably fix this so that 1.2.x streaming users can more safely upgrade 
to 1.3.x.

- Josh

Sent from my phone

> On Apr 7, 2015, at 4:13 PM, Patrick Wendell  wrote:
> 
> Hey All,
> 
> Today SPARK-6737 came to my attention. This is a bug that causes a
> memory leak for any long running program that repeatedly saves data
> out to a Hadoop FileSystem. For that reason, it is problematic for
> Spark Streaming.
> 
> My sense is that this is severe enough to cut another RC once the fix
> is merged (which is imminent):
> 
> https://issues.apache.org/jira/browse/SPARK-6737
> 
> I'll leave a bit of time for others to comment, in particular if
> people feel we should not wait for this fix.
> 
> - Patrick
> 
>> On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin  wrote:
>> +1 (non-binding)
>> 
>> Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
>> without the external shuffle service in yarn mode.
>> 
>>> On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell  wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 1.3.1!
>>> 
>>> The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>>> 
>>> The list of fixes present in this release can be found at:
>>> http://bit.ly/1C2nVPY
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.3.1-rc1/
>>> 
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>> 
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1080
>>> 
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
>>> 
>>> Please vote on releasing this package as Apache Spark 1.3.1!
>>> 
>>> The vote is open until Wednesday, April 08, at 01:10 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>> 
>>> [ ] +1 Release this package as Apache Spark 1.3.1
>>> [ ] -1 Do not release this package because ...
>>> 
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> 
>>> - Patrick
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 
>> 
>> --
>> Marcelo
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Contributor CLAs

2015-04-07 Thread Nicholas Chammas
I've seen many other OSS projects ask contributors to sign CLAs. I've never
seen us do that.

I assume it's not an issue, since people opening PRs generally understand
what it means. But legally I'm sure there's some danger in taking an
implied vs. explicit license to do something.

So: Do we need to make people sign contributor CLAs?

I'm betting Sean Owen knows something about this... :)

Nick


Re: Contributor CLAs

2015-04-07 Thread Sean Owen
Yeah, this is why this pops up when you open a PR:
https://github.com/apache/spark/blob/master/CONTRIBUTING.md

Mostly, I want to take all reasonable steps to ensure that when
somebody offers a code contribution, that they are fine with the ways
in which it actually used (redistributed under the terms of the AL2),
whether or not they understand the intricacies. In good faith, I'm all
but sure that all contributors either think they're giving the
contribution to the project anyway, or at least, do understand it to
be their own work licensed under the same terms as all of the project
contributions are.

IANAL, but in stricter legal terms, the project license is plain and
clear, and the intricacies are signposted and easy to read when you
contribute. You would have a very hard time arguing that you made a
contribution, didn't state anything about the license, but did not
intend somehow that the work could be licensed as the rest of the
project is. For reference Apache projects do not in general require a
CLA.

On Tue, Apr 7, 2015 at 8:59 PM, Nicholas Chammas
 wrote:
> I've seen many other OSS projects ask contributors to sign CLAs. I've never
> seen us do that.
>
> I assume it's not an issue, since people opening PRs generally understand
> what it means. But legally I'm sure there's some danger in taking an
> implied vs. explicit license to do something.
>
> So: Do we need to make people sign contributor CLAs?
>
> I'm betting Sean Owen knows something about this... :)
>
> Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark + Kinesis

2015-04-07 Thread Vadim Bichutskiy
Hey y'all,

While I haven't been able to get Spark + Kinesis integration working, I
pivoted to plan B: I now push data to S3 where I set up a DStream to
monitor an S3 bucket with textFileStream, and that works great.

I <3 Spark!

Best,
Vadim


ᐧ

On Mon, Apr 6, 2015 at 12:23 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:

> Hi all,
>
> I am wondering, has anyone on this list been able to successfully
> implement Spark on top of Kinesis?
>
> Best,
> Vadim
>
> On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy <
> vadim.bichuts...@gmail.com> wrote:
>
>> Hi all,
>>
>> Below is the output that I am getting. My Kinesis stream has 1 shard, and
>> my Spark cluster on EC2 has 2 slaves (I think that's fine?).
>> I should mention that my Kinesis producer is written in Python where I
>> followed the example
>> http://blogs.aws.amazon.com/bigdata/post/Tx2Z24D4T99AN35/Snakes-in-the-Stream-Feeding-and-Eating-Amazon-Kinesis-Streams-with-Python
>>
>> I also wrote a Python consumer, again using the example at the above
>> link, that works fine. But I am unable to display output from my Spark
>> consumer.
>>
>> I'd appreciate any help.
>>
>> Thanks,
>> Vadim
>>
>> ---
>>
>> Time: 142825409 ms
>>
>> ---
>>
>>
>> 15/04/05 17:14:50 INFO scheduler.JobScheduler: Finished job streaming job
>> 142825409 ms.0 from job set of time 142825409 ms
>>
>> 15/04/05 17:14:50 INFO scheduler.JobScheduler: Total delay: 0.099 s for
>> time 142825409 ms (execution: 0.090 s)
>>
>> 15/04/05 17:14:50 INFO rdd.ShuffledRDD: Removing RDD 63 from persistence
>> list
>>
>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 63
>>
>> 15/04/05 17:14:50 INFO rdd.MapPartitionsRDD: Removing RDD 62 from
>> persistence list
>>
>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 62
>>
>> 15/04/05 17:14:50 INFO rdd.MapPartitionsRDD: Removing RDD 61 from
>> persistence list
>>
>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 61
>>
>> 15/04/05 17:14:50 INFO rdd.UnionRDD: Removing RDD 60 from persistence list
>>
>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 60
>>
>> 15/04/05 17:14:50 INFO rdd.BlockRDD: Removing RDD 59 from persistence list
>>
>> 15/04/05 17:14:50 INFO storage.BlockManager: Removing RDD 59
>>
>> 15/04/05 17:14:50 INFO dstream.PluggableInputDStream: Removing blocks of
>> RDD BlockRDD[59] at createStream at MyConsumer.scala:56 of time
>> 142825409 ms
>>
>> ***
>>
>> 15/04/05 17:14:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
>> ArrayBuffer(142825407 ms)
>> On Sat, Apr 4, 2015 at 3:13 PM, Vadim Bichutskiy <
>> vadim.bichuts...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> More good news! I was able to utilize mergeStrategy to assembly my
>>> Kinesis consumer into an "uber jar"
>>>
>>> Here's what I added to* build.sbt:*
>>>
>>> *mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>*
>>> *  {*
>>> *  case PathList("com", "esotericsoftware", "minlog", xs @ _*) =>
>>> MergeStrategy.first*
>>> *  case PathList("com", "google", "common", "base", xs @ _*) =>
>>> MergeStrategy.first*
>>> *  case PathList("org", "apache", "commons", xs @ _*) =>
>>> MergeStrategy.last*
>>> *  case PathList("org", "apache", "hadoop", xs @ _*) =>
>>> MergeStrategy.first*
>>> *  case PathList("org", "apache", "spark", "unused", xs @ _*) =>
>>> MergeStrategy.first*
>>> *case x => old(x)*
>>> *  }*
>>> *}*
>>>
>>> Everything appears to be working fine. Right now my producer is pushing
>>> simple strings through Kinesis,
>>> which my consumer is trying to print (using Spark's print() method for
>>> now).
>>>
>>> However, instead of displaying my strings, I get the following:
>>>
>>> *15/04/04 18:57:32 INFO scheduler.ReceivedBlockTracker: Deleting batches
>>> ArrayBuffer(1428173848000 ms)*
>>>
>>> Any idea on what might be going on?
>>>
>>> Thanks,
>>>
>>> Vadim
>>>
>>> Here's my consumer code (adapted from the WordCount example):
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *private object MyConsumer extends Logging {  def main(args:
>>> Array[String]) {/* Check that all required args were passed in. */
>>> if (args.length < 2) {  System.err.println("""  |Usage:
>>> KinesisWordCount|
>>> is the name of the Kinesis stream  | is the
>>> endpoint of the Kinesis service  |   (e.g.
>>> https://kinesis.us-east-1.amazonaws.com
>>> )""".stripMargin)
>>> System.exit(1)}/* Populate the appropriate variables from the given
>>> args */val Array(streamName, endpointUrl) 

Re: 1.3 Build Error with Scala-2.11

2015-04-07 Thread Imran Rashid
did you run

dev/change-version-to-2.11.sh

before compiling?  When I ran this on current master, it mostly worked:

dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Pscala-2.11 -DskipTests clean package

There was a failure in building catalyst, but core built just fine for me.
The error I got was:

[INFO]


[INFO] Building Spark Project Catalyst 1.4.0-SNAPSHOT

[INFO]


[WARNING] The POM for org.scalamacros:quasiquotes_2.11:jar:2.0.1 is
missing, no dependency information available


I'm not sure if catalyst is supposed to work w/ scala-2.11 or not ... I
wouldn't be surprised if the way macros should be used has changed, but its
not listed explicitly in the docs as being incompatible:

http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211




On Tue, Apr 7, 2015 at 12:00 AM, mjhb  wrote:

> I even deleted my local maven repository (.m2) but still stuck when
> attempting to build w/ Scala-2.11:
>
> [ERROR] Failed to execute goal on project spark-core_2.11: Could not
> resolve
> dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
> artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
> find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
> in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal on project spark-core_2.11: Could not resolve dependencies for project
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
> artifacts
> could not be resolved:
> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
> find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
> in apache.snapshots (http://repository.apache.org/snapshots)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11449.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Contributor CLAs

2015-04-07 Thread Nicholas Chammas
SGTM.

On Tue, Apr 7, 2015 at 9:11 PM Sean Owen  wrote:

> Yeah, this is why this pops up when you open a PR:
> https://github.com/apache/spark/blob/master/CONTRIBUTING.md
>
> Mostly, I want to take all reasonable steps to ensure that when
> somebody offers a code contribution, that they are fine with the ways
> in which it actually used (redistributed under the terms of the AL2),
> whether or not they understand the intricacies. In good faith, I'm all
> but sure that all contributors either think they're giving the
> contribution to the project anyway, or at least, do understand it to
> be their own work licensed under the same terms as all of the project
> contributions are.
>
> IANAL, but in stricter legal terms, the project license is plain and
> clear, and the intricacies are signposted and easy to read when you
> contribute. You would have a very hard time arguing that you made a
> contribution, didn't state anything about the license, but did not
> intend somehow that the work could be licensed as the rest of the
> project is. For reference Apache projects do not in general require a
> CLA.
>
> On Tue, Apr 7, 2015 at 8:59 PM, Nicholas Chammas
>  wrote:
> > I've seen many other OSS projects ask contributors to sign CLAs. I've
> never
> > seen us do that.
> >
> > I assume it's not an issue, since people opening PRs generally understand
> > what it means. But legally I'm sure there's some danger in taking an
> > implied vs. explicit license to do something.
> >
> > So: Do we need to make people sign contributor CLAs?
> >
> > I'm betting Sean Owen knows something about this... :)
> >
> > Nick
>


Re: Contributor CLAs

2015-04-07 Thread Matei Zaharia
You do actually sign a CLA when you become a committer, and in general, we 
should ask for CLAs from anyone who contributes a large piece of code. This is 
the individual CLA: https://www.apache.org/licenses/icla.txt. Some people have 
sent them proactively because their employer asks them too.

Matei

> On Apr 7, 2015, at 10:19 PM, Nicholas Chammas  
> wrote:
> 
> SGTM.
> 
> On Tue, Apr 7, 2015 at 9:11 PM Sean Owen  wrote:
> 
>> Yeah, this is why this pops up when you open a PR:
>> https://github.com/apache/spark/blob/master/CONTRIBUTING.md
>> 
>> Mostly, I want to take all reasonable steps to ensure that when
>> somebody offers a code contribution, that they are fine with the ways
>> in which it actually used (redistributed under the terms of the AL2),
>> whether or not they understand the intricacies. In good faith, I'm all
>> but sure that all contributors either think they're giving the
>> contribution to the project anyway, or at least, do understand it to
>> be their own work licensed under the same terms as all of the project
>> contributions are.
>> 
>> IANAL, but in stricter legal terms, the project license is plain and
>> clear, and the intricacies are signposted and easy to read when you
>> contribute. You would have a very hard time arguing that you made a
>> contribution, didn't state anything about the license, but did not
>> intend somehow that the work could be licensed as the rest of the
>> project is. For reference Apache projects do not in general require a
>> CLA.
>> 
>> On Tue, Apr 7, 2015 at 8:59 PM, Nicholas Chammas
>>  wrote:
>>> I've seen many other OSS projects ask contributors to sign CLAs. I've
>> never
>>> seen us do that.
>>> 
>>> I assume it's not an issue, since people opening PRs generally understand
>>> what it means. But legally I'm sure there's some danger in taking an
>>> implied vs. explicit license to do something.
>>> 
>>> So: Do we need to make people sign contributor CLAs?
>>> 
>>> I'm betting Sean Owen knows something about this... :)
>>> 
>>> Nick
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-07 Thread Marty Bower
Yes - ran dev/change-version-to-2.11.sh

But was missing -Dscala-2.11 on mvn command after a -2.10 build. Building
successfully again now after adding that.

On Tue, Apr 7, 2015 at 7:04 PM Imran Rashid  wrote:

> did you run
>
> dev/change-version-to-2.11.sh
>
> before compiling?  When I ran this on current master, it mostly worked:
>
> dev/change-version-to-2.11.sh
> mvn -Pyarn -Phadoop-2.4 -Pscala-2.11 -DskipTests clean package
>
> There was a failure in building catalyst, but core built just fine for
> me.  The error I got was:
>
> [INFO]
> 
>
> [INFO] Building Spark Project Catalyst 1.4.0-SNAPSHOT
>
> [INFO]
> 
>
> [WARNING] The POM for org.scalamacros:quasiquotes_2.11:jar:2.0.1 is
> missing, no dependency information available
>
>
> I'm not sure if catalyst is supposed to work w/ scala-2.11 or not ... I
> wouldn't be surprised if the way macros should be used has changed, but its
> not listed explicitly in the docs as being incompatible:
>
>
> http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
>
>
>
>
> On Tue, Apr 7, 2015 at 12:00 AM, mjhb  wrote:
>
>> I even deleted my local maven repository (.m2) but still stuck when
>> attempting to build w/ Scala-2.11:
>>
> [ERROR] Failed to execute goal on project spark-core_2.11: Could not
>> resolve
>> dependencies for project
>>
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
>> artifacts
>> could not be resolved:
>
>
>> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
>> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
>> find artifact
>> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
>>
> in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]
>
>
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal on project spark-core_2.11: Could not resolve dependencies for
>> project
>>
> org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
>> artifacts
>> could not be resolved:
>
>
>> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
>> org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
>> find artifact
>> org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
>> in apache.snapshots (http://repository.apache.org/snapshots)
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11449.html
>
>
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>


[RESULT] [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Tue, Apr 7, 2015 at 8:13 PM, Josh Rosen  wrote:
> The leak will impact long running streaming jobs even if they don't write 
> Hadoop files, although the problem may take much longer to manifest itself 
> for those jobs.
>
> I think we currently leak an empty HashMap per stage submitted in the common 
> case, so it could take a very long time for this to trigger an OOM.  On the 
> other hand, the worst case behavior is quite bad for streaming jobs, so we 
> should probably fix this so that 1.2.x streaming users can more safely 
> upgrade to 1.3.x.
>
> - Josh
>
> Sent from my phone
>
>> On Apr 7, 2015, at 4:13 PM, Patrick Wendell  wrote:
>>
>> Hey All,
>>
>> Today SPARK-6737 came to my attention. This is a bug that causes a
>> memory leak for any long running program that repeatedly saves data
>> out to a Hadoop FileSystem. For that reason, it is problematic for
>> Spark Streaming.
>>
>> My sense is that this is severe enough to cut another RC once the fix
>> is merged (which is imminent):
>>
>> https://issues.apache.org/jira/browse/SPARK-6737
>>
>> I'll leave a bit of time for others to comment, in particular if
>> people feel we should not wait for this fix.
>>
>> - Patrick
>>
>>> On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin  wrote:
>>> +1 (non-binding)
>>>
>>> Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
>>> without the external shuffle service in yarn mode.
>>>
 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell  wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-07 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1083/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/

The patches on top of RC1 are:

[SPARK-6737] Fix memory leak in OutputCommitCoordinator
https://github.com/apache/spark/pull/5397

[SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
https://github.com/apache/spark/pull/5302

[SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
NoClassDefFoundError
https://github.com/apache/spark/pull/4933

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Saturday, April 11, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org