Re: Requirements for Spark cluster

2014-07-09 Thread Sandy Ryza
Hi Robert,

If you're running Spark against YARN, you don't need to install anything
Spark-specific on all the nodes.  For each application, the client will
copy the Spark jar to HDFS where the Spark processes can fetch it.  For
faster app startup, you can copy the Spark jar to a public location on HDFS
and point to it there.  The YARN NodeManagers will cache it on each node
after it's fetched the first time.

-Sandy


On Tue, Jul 8, 2014 at 6:24 PM, Robert James  wrote:

> I have a Spark app which runs well on local master.  I'm now ready to
> put it on a cluster.  What needs to be installed on the master? What
> needs to be installed on the workers?
>
> If the cluster already has Hadoop or YARN or Cloudera, does it still
> need an install of Spark?
>


How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Haopu Wang
Besides restarting the Master, is there any other way to clear the
Completed Applications in Master web UI?


Re: Spark Streaming using File Stream in Java

2014-07-09 Thread Akhil Das
Try this out:

JavaStreamingContext sc = new
JavaStreamingContext(...);JavaDStream lines =
ctx.fileStream("whatever");JavaDStream words = lines.flatMap(
  new FlatMapFunction() {
public Iterable call(String s) {
  return Arrays.asList(s.split(" "));
}
  });

JavaPairDStream ones = words.map(
  new PairFunction() {
public Tuple2 call(String s) {
  return new Tuple2(s, 1);
}
  });

JavaPairDStream counts = ones.reduceByKey(
  new Function2() {
public Integer call(Integer i1, Integer i2) {
  return i1 + i2;
}
  });


​Actually modified from
https://spark.apache.org/docs/0.9.1/java-programming-guide.html#example​

Thanks
Best Regards


On Wed, Jul 9, 2014 at 6:03 AM, Aravind  wrote:

> Hi all,
>
> I am trying to run the NetworkWordCount.java file in the streaming
> examples.
> The example shows how to read from a network socket. But my usecase is that
> , I have a local log file which is a stream and continuously updated (say
> /Users/.../Desktop/mylog.log).
>
> I would like to write the same NetworkWordCount.java using this filestream
>
> jssc.fileStream(dataDirectory);
>
> Question:
> 1. How do I write a mapreduce function for the above to measure wordcounts
> (in java, not scala)?
>
> 2. Also does the streaming application stop if the file is not updating or
> does it continuously poll for the file updates?
>
> I am a new user of Apache Spark Streaming. Kindly help me as I am totally
> stuck
>
> Thanks in advance.
>
> Regards
> Aravind
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-File-Stream-in-Java-tp9115.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: error when spark access hdfs with Kerberos enable

2014-07-09 Thread Sandy Ryza
Hi Cheney,

I haven't heard of anybody deploying non-secure YARN on top of secure HDFS.
 It's conceivable that you might be able to get work, but my guess is that
you'd run into some issues.  Also, without authentication on in YARN, you
could be leaving your HDFS tokens exposed, which others could steal and get
to your data.

-Sandy


On Tue, Jul 8, 2014 at 7:28 PM, Cheney Sun  wrote:

> Hi Sandy,
>
> We are also going to grep data from a security enabled (with kerberos)
> HDFS in our Spark application. Per you answer, we have to switch Spark on
> YARN to achieve this.
> We plan to deploy a different Hadoop cluster(with YARN) only to run Spark.
> Is it necessary to deploy YARN with security enabled? Or is it possible to
> access data within a security HDFS from no-security enabled Spark on YARN?
>
>
> On Wed, Jul 9, 2014 at 4:19 AM, Sandy Ryza 
> wrote:
>
>> That's correct.  Only Spark on YARN supports Kerberos.
>>
>> -Sandy
>>
>>
>> On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Someone might be able to correct me if I'm wrong, but I don't believe
>>> standalone mode supports kerberos. You'd have to use Yarn for that.
>>>
>>> On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜  wrote:
>>> > Hi all,
>>> >
>>> >
>>> >
>>> > I encounter a strange issue when using spark 1.0 to access hdfs with
>>> > Kerberos
>>> >
>>> > I just have one spark test node for spark and HADOOP_CONF_DIR is set
>>> to the
>>> > location containing the hdfs configuration files(hdfs-site.xml and
>>> > core-site.xml)
>>> >
>>> > When I use spark-shell with local mode, the access to hdfs is
>>> successfully .
>>> >
>>> > However, If I use spark-shell which connects to the stand alone
>>> cluster (I
>>> > configured the spark as standalone cluster mode with only one node).
>>> >
>>> > The access to the hdfs fails with the following error: “Can't get
>>> Master
>>> > Kerberos principal for use as renewer”
>>> >
>>> >
>>> >
>>> > Anyone have any ideas on this ?
>>> >
>>> > Thanks a lot.
>>> >
>>> >
>>> >
>>> > Regards,
>>> > Xiaowei
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>


Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Akhil Das
Hi Bertrand,

We've updated the document
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0

This is our working Github repo
https://github.com/sigmoidanalytics/spork/tree/spork-0.9

Feel free to open issues over here
https://github.com/sigmoidanalytics/spork/issues

Thanks
Best Regards


On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux  wrote:

> @Mayur : I won't fight with the semantic of a fork but at the moment, no
> Spork does take the standard Pig as dependency. On that, we should agree.
>
> As for my use of Pig, I have no limitation. I am however interested to see
> the rise of a 'no-sql high level non programming language' for Spark.
>
> @Zhang : Could you elaborate your reference about Twitter?
>
>
> Bertrand Dechoux
>
>
> On Tue, Jul 8, 2014 at 4:04 AM, 张包峰  wrote:
>
>> Hi guys, previously I checked out the old "spork" and updated it to
>> Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine
>> https://github.com/pelick/flare-spork‍
>>
>> It it also highly experimental, and just directly mapping pig physical
>> operations to spark RDD transformations/actions. It works for simple
>> requests. :)
>>
>> I am also interested on the progress of spork, is it undergoing in
>> Twitter in an un open-source way?
>>
>> --
>> Thanks
>> Zhang Baofeng
>> Blog  | Github 
>> | Weibo  | LinkedIn
>> 
>>
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Mayur Rustagi";;
>> *发送时间:* 2014年7月7日(星期一) 晚上11:55
>> *收件人:* "user@spark.apache.org";
>> *主题:* Re: Pig 0.13, Spark, Spork
>>
>> That version is old :).
>> We are not forking pig but cleanly separating out pig execution engine.
>> Let me know if you are willing to give it a go.
>>
>> Also would love to know what features of pig you are using ?
>>
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux 
>> wrote:
>>
>>> I saw a wiki page from your company but with an old version of Spark.
>>>
>>> http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
>>>
>>> I have no reason to use it yet but I am interested in the state of the
>>> initiative.
>>> What's your point of view (personal and/or professional) about the Pig
>>> 0.13 release?
>>> Is the pluggable execution engine flexible enough in order to avoid
>>> having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D
>>>
>>> As a (for now) external observer, I am glad to see competition in that
>>> space. It can only be good for the community in the end.
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi 
>>> wrote:
>>>
 Hi,
 We have fixed many major issues around Spork & deploying it with some
 customers. Would be happy to provide a working version to you to try out.
 We are looking for more folks to try it out & submit bugs.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux 
 wrote:

> Hi,
>
> I was wondering what was the state of the Pig+Spark initiative now
> that the execution engine of Pig is pluggable? Granted, it was done in
> order to use Tez but could it be used by Spark? I know about a
> 'theoretical' project called Spork but I don't know any stable and
> maintained version of it.
>
> Regards
>
> Bertrand Dechoux
>


>>>
>>
>


Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.

On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang  wrote:
> Besides restarting the Master, is there any other way to clear the
> Completed Applications in Master web UI?


how to host the drive node

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-host-the-drive-node-tp9149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to host spark driver

2014-07-09 Thread amin mohebbi


 I have one master and two slave nodes, I did not set any ip for spark driver. 
My question is should I set a ip for spark driver and can I host the driver 
inside the cluster in master node? if so, how to host it? will it be hosted 
automatically in that node we submit the application by spark-submit?  

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

H/P : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of clusters (or using different versions of Spark).

It also unifies the way you bundle and submit an app for Yarn, Mesos,
etc... this was something that became very fragmented over time before
this was added.

Another feature is allowing users to set configuration values
dynamically rather than compile them inside of their program. That's
the one you mention here. You can choose to use this feature or not.
If you know your configs are not going to change, then you don't need
to set them with spark-submit.


On Wed, Jul 9, 2014 at 10:22 AM, Robert James  wrote:
> What is the purpose of spark-submit? Does it do anything outside of
> the standard val conf = new SparkConf ... val sc = new SparkContext
> ... ?


TaskContext stageId = 0

2014-07-09 Thread silvermast
Has anyone else seen this, at least in local mode? I haven't tried this in
the cluster, but I'm getting myself frustrated that I cannot ID activity
within the RDD's compute() method whether by stageId or rddId (available on
ParallelCollectionPartition but not on ShuffledRDDPartition, and then only
through reflection). Anyone else solving this problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TaskContext-stageId-0-tp9152.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Why doesn't the driver node do any work?

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by spark-submit?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-doesn-t-the-driver-node-do-any-work-tp3909p9153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Comparative study

2014-07-09 Thread Sean Owen
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons  wrote:

>  Impala is *not* built on map/reduce, though it was built to replace Hive,
> which is map/reduce based.  It has its own distributed query engine, though
> it does load data from HDFS, and is part of the hadoop ecosystem.  Impala
> really shines when your
>

(It was not built to replace Hive. It's purpose-built to make interactive
use with a BI tool feasible -- single-digit second queries on huge data
sets. It's very memory hungry. Hive's architecture choices and legacy code
have been throughput-oriented, and can't really get below minutes at scale,
but, remains a right choice when you are in fact doing ETL!)


Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá
Hi Jerry, it's all clear to me now, I will try with something like Apache
DBCP for the connection pool

Thanks a lot for your help!


2014-07-09 3:08 GMT+02:00 Shao, Saisai :

>  Yes, that would be the Java equivalence to use static class member, but
> you should carefully program to prevent resource leakage. A good choice is
> to use third-party DB connection library which supports connection pool,
> that will alleviate your programming efforts.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
> *Sent:* Tuesday, July 08, 2014 6:54 PM
> *To:* user@spark.apache.org
>
> *Subject:* Re: Which is the best way to get a connection to an external
> database per task in Spark Streaming?
>
>
>
> Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and
> I only have rudimentary knowledge about Scala, how could I recreate in Java
> the lazy creation of a singleton object that you propose for Scala? Maybe a
> static class member in Java for the connection would be the solution?
>
> Thanks again for your help,
>
> Best Regards,
>
> Juan
>
>
>
> 2014-07-08 11:44 GMT+02:00 Shao, Saisai :
>
> I think you can maintain a connection pool or keep the connection as a
> long-lived object in executor side (like lazily creating a singleton object
> in object { } in Scala), so your task can get this connection each time
> executing a task, not creating a new one, that would be good for your
> scenario, since create a connection is quite expensive for each task.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Juan Rodríguez Hortalá [mailto:juan.rodriguez.hort...@gmail.com]
> *Sent:* Tuesday, July 08, 2014 5:19 PM
> *To:* Tobias Pfeiffer
> *Cc:* user@spark.apache.org
> *Subject:* Re: Which is the best way to get a connection to an external
> database per task in Spark Streaming?
>
>
>
> Hi Tobias, thanks for your help. I understand that with that code we
> obtain a database connection per partition, but I also suspect that with
> that code a new database connection is created per each execution of the
> function used as argument for mapPartitions(). That would be very
> inefficient because a new object and a new database connection would be
> created for each batch of the DStream. But my knowledge about the lifecycle
> of Functions in Spark Streaming is very limited, so maybe I'm wrong, what
> do you think?
>
> Greetings,
>
> Juan
>
>
>
> 2014-07-08 3:30 GMT+02:00 Tobias Pfeiffer :
>
> Juan,
>
>
>
> I am doing something similar, just not "insert into SQL database", but
> "issue some RPC call". I think mapPartitions() may be helpful to you. You
> could do something like
>
>
>
> dstream.mapPartitions(iter => {
>
>   val db = new DbConnection()
>
>   // maybe only do the above if !iter.isEmpty
>
>   iter.map(item => {
>
> db.call(...)
>
> // do some cleanup if !iter.hasNext here
>
> item
>
>   })
>
> }).count() // force output
>
>
>
> Keep in mind though that the whole idea about RDDs is that operations are
> idempotent and in theory could be run on multiple hosts (to take the result
> from the fastest server) or multiple times (to deal with failures/timeouts)
> etc., which is maybe something you want to deal with in your SQL.
>
>
>
> Tobias
>
>
>
>
>
> On Tue, Jul 8, 2014 at 3:40 AM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
> Hi list,
>
> I'm writing a Spark Streaming program that reads from a kafka topic,
> performs some transformations on the data, and then inserts each record in
> a database with foreachRDD. I was wondering which is the best way to handle
> the connection to the database so each worker, or even each task, uses a
> different connection to the database, and then database inserts/updates
> would be performed in parallel.
> - I understand that using a final variable in the driver code is not a
> good idea because then the communication with the database would be
> performed in the driver code, which leads to a bottleneck, according to
> http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
> - I think creating a new connection in the call() method of the Function
> passed to foreachRDD is also a bad idea, because then I wouldn't be reusing
> the connection to the database for each batch RDD in the DStream
> - I'm not sure that a broadcast variable with the connection handler is a
> good idea in case the target database is distributed, because if the same
> handler is used for all the nodes of the Spark cluster then than could have
> a negative effect in the data locality of the connection to the database.
> - From
> http://apache-spark-user-list.1001560.n3.nabble.com/Database-connection-per-worker-td1280.html
> I understand that by using an static variable and referencing it in the
> call() method of the Function passed to foreachRDD we get a different
> connection per Spark worker, I guess it's because there is a different JVM
> per worker. But then all the tasks in 

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-09 Thread Martin Gammelsæter
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem
to work. For now we've downgraded dropwizard to 0.6.2 which uses a
compatible version of jetty. Not optimal, but it works for now.

On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai  wrote:
> We're doing similar thing to lunch spark job in tomcat, and I opened a
> JIRA for this. There are couple technical discussions there.
>
> https://issues.apache.org/jira/browse/SPARK-2100
>
> In this end, we realized that spark uses jetty not only for Spark
> WebUI, but also for distributing the jars and tasks, so it really hard
> to remove the web dependency in Spark.
>
> In the end, we lunch our spark job in yarn-cluster mode, and in the
> runtime, the only dependency in our web application is spark-yarn
> which doesn't contain any spark web stuff.
>
> PS, upgrading the spark jetty 8.x to 9.x in spark may not be
> straightforward by just changing the version in spark build script.
> Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires
> Java 7.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers  wrote:
>> do you control your cluster and spark deployment? if so, you can try to
>> rebuild with jetty 9.x
>>
>>
>> On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
>>  wrote:
>>>
>>> Digging a bit more I see that there is yet another jetty instance that
>>> is causing the problem, namely the BroadcastManager has one. I guess
>>> this one isn't very wise to disable... It might very well be that the
>>> WebUI is a problem as well, but I guess the code doesn't get far
>>> enough. Any ideas on how to solve this? Spark seems to use jetty
>>> 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
>>> of the problem. Any ideas?
>>>
>>> On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
>>>  wrote:
>>> > Hi!
>>> >
>>> > I am building a web frontend for a Spark app, allowing users to input
>>> > sql/hql and get results back. When starting a SparkContext from within
>>> > my server code (using jetty/dropwizard) I get the error
>>> >
>>> > java.lang.NoSuchMethodError:
>>> > org.eclipse.jetty.server.AbstractConnector: method ()V not found
>>> >
>>> > when Spark tries to fire up its own jetty server. This does not happen
>>> > when running the same code without my web server. This is probably
>>> > fixable somehow(?) but I'd like to disable the webUI as I don't need
>>> > it, and ideally I would like to access that information
>>> > programatically instead, allowing me to embed it in my own web
>>> > application.
>>> >
>>> > Is this possible?
>>> >
>>> > --
>>> > Best regards,
>>> > Martin Gammelsæter
>>>
>>>
>>>
>>> --
>>> Mvh.
>>> Martin Gammelsæter
>>> 92209139
>>
>>



-- 
Mvh.
Martin Gammelsæter
92209139


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail,

I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?

- Patrick

On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov  wrote:
> Thanks Andrew,
>  ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
> --executor-memory 20g --driver-memory 10g
> works well, just wanted to make sure that I'm not missing anything
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


"Initial job has not accepted any resources" means many things

2014-07-09 Thread Martin Gammelsæter
It seems like the "Initial job has not accepted any resources;" shows
up for a wide variety of different errors (for example the obvious one
where you've requested more memory than is available) but also for
example in the case where the worker nodes does not have the
appropriate code on their class path. Debugging from this error is
very hard as errors does not show up in the logs on the workers. Is
this a known issue? I'm having issues with getting the code to the
workers without using addJar (my code is a fairly static application,
and I'd like to avoid using addJar every time the app starts up, and
instead manually add the jar to the classpath of every worke), but I
can't seem to find out how)

-- 
Best regards,
Martin Gammelsæter


Re: Kryo is slower, and the size saving is minimal

2014-07-09 Thread wxhsdp
i'am not familiar with kryo and my opinion may be not right. in my case, kryo
only saves about 5% of the original size when dealing with primitive types
such as Arrays. i'am not sure whether it is the common case.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-is-slower-and-the-size-saving-is-minimal-tp9131p9160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Kryo is slower, and the size saving is minimal

2014-07-09 Thread innowireless TaeYun Kim
Thank you for your response.

Maybe that applies to my case.
In my test case, The types of almost all of the data are either primitive
types, joda DateTime, or String.
But I'm somewhat disappointed with the speed.
At least it should not be slower than Java default serializer...

-Original Message-
From: wxhsdp [mailto:wxh...@gmail.com] 
Sent: Wednesday, July 09, 2014 5:47 PM
To: u...@spark.incubator.apache.org
Subject: Re: Kryo is slower, and the size saving is minimal

i'am not familiar with kryo and my opinion may be not right. in my case,
kryo only saves about 5% of the original size when dealing with primitive
types such as Arrays. i'am not sure whether it is the common case.



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-is-slower-and-the-s
ize-saving-is-minimal-tp9131p9160.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: TaskContext stageId = 0

2014-07-09 Thread silvermast
Oh well, never mind. The problem is that ResultTask's stageId is immutable
and is used to construct the Task superclass. Anyway, my solution now is to
use this.id for the rddId and to gather all rddIds using a spark listener on
stage completed to clean up for any activity registered for those rdds. I
could use TaskContext's hook but I'd have to add some more messaging so I
can clear state that may live on a different executor than the one my
partition is on, but since I don't know that the executor will succeed, this
is not safe.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TaskContext-stageId-0-tp9152p9162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to run a job on all workers?

2014-07-09 Thread silvermast
Is it possible to run a job that assigns work to every worker in the system?
My bootleg right now is to have a spark listener hear whenever a block
manager is added and to increase a split count by 1. It runs a spark job
with that split count and hopes that it will at least run on the newest
worker. There's some weirdness with block managers being removed sometimes
allowing my count to go negative, so I just keep monotonically increasing my
split count. Anyone have a way that doesn't suck?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-a-job-on-all-workers-tp9163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Error using MLlib-NaiveBayes : "Matrices are not aligned"

2014-07-09 Thread Rahul Bhojwani
I am using Naive Bayes in MLlib .
Below I have printed log of *model.theta*. after training on train data.
You can check that it contains 9 features for 2 class classification.

>>print numpy.log(model.theta)

[[ 0.31618962  0.16636852  0.07200358  0.05411449  0.08542039  0.17620751
   0.03711986  0.07110912  0.02146691]
 [ 0.26598639  0.23809524  0.06258503  0.01904762  0.05714286  0.18231293
   0.08911565  0.03061224  0.05510204]]




 I am giving the same no. of features for prediction but I am getting
error: *matrices not alligned.*

*ERROR:*

Traceback (most recent call last):
  File ".\naive_bayes_analyser.py", line 192, in 
prediction =
model.predict(array([float(pos_count),float(neg_count),float(need_count),float(goal_count),float(try_co
unt),float(means_count),float(persist_count),float(complete_count),float(fail_count)]))
  File "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
line 101, in predict
return numpy.argmax(self.pi + dot(x, self.theta))
ValueError: matrices are not aligned



Can someone suggest me the possible error/mistake?
Thanks in advance
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all,

I wondered if you could help me to clarify the next situation:
in the classic example

val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))

As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to apply filtering during the read
step? and don't put all objects into memory?

Thank you,
Konstantin Kudryavtsev


Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi
Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Hi all,
>
> I wondered if you could help me to clarify the next situation:
> in the classic example
>
> val file = spark.textFile("hdfs://...")
> val errors = file.filter(line => line.contains("ERROR"))
>
> As I understand, the data is read in memory in first, and after that
> filtering is applying. Is it any way to apply filtering during the read
> step? and don't put all objects into memory?
>
> Thank you,
> Konstantin Kudryavtsev
>


Docker Scripts

2014-07-09 Thread dmpour23
Hi,

Regarding docker scripts I know i can change the base image easily but  is
there any specific reason why the base image is hadoop_1.2.1 . Why is this
prefered to Hadoop2 [HDP2, CDH5]) distributions? 

Now that amazon supports docker could this replace ec2-scripts?

Kind regards
Dimitri





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Docker-Scripts-tp9167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


FW: memory question

2014-07-09 Thread michael.lewis
Hi,

Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. 
rather than set spark.cleaner.ttl and have this run periodically, I'd like to 
run it on demand. The problem with periodic cleaning is that it can remove rdd 
that we still require (some calcs are short, others very long).

We're using Spark 0.9.0 with Cloudera distribution.

I have a simple test calculation in a loop as follows:

val test = new TestCalc(sparkContext)
for (i <- 1 to 10) {
  val (x) = test.evaluate(rdd)
}

Where TestCalc is defined as:
class TestCalc(sparkContext: SparkContext) extends Serializable  {

  def aplus(a: Double, b:Double) :Double = a+b;

  def evaluate(rdd : RDD[Double]) = {
 /* do some dummy calc. */
  val x = rdd.groupBy(x => x /2.0)
  val y = x.fold((0.0,Seq[Double]()))((a,b)=>(aplus(a._1,b._1),Seq()))
  val z = y._1
/* try with/without this... */
  val e :SparkEnv = SparkEnv.getThreadLocal
  e.blockManager.master.removeRdd(x.id,true) // still see memory 
consumption go up...
(z)
  }
}

What I can see on the cluster is the memory usage on the node executing this 
continually
climbs. I'd expect it to level off and not jump up over 1G...
I thought that putting in the line 'removeRdd' might help, but it doesn't seem 
to make a difference


Regards,
Mike


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___


Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread gil...@gmail.com
Hello,While trying to run this example below I am getting errors.I have build
Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0
SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean
assembly-Running the example using
Spark-shell---$
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client
./bin/spark-shellscala> val sqlContext = new
org.apache.spark.sql.SQLContext(sc)import sqlContext._case class
Person(name: String, age: Int)val people =
sc.textFile("hdfs://myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt").map(_.split(",")).map(p
=> Person(p(0), p(1).trim.toInt))people.registerAsTable("people")val
teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <=
19")teenagers.map(t => "Name: " +
t(0)).collect().foreach(println)--error---java.lang.NoClassDefFoundError:
Could not initialize class $line10.$read$at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
$line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)at
scala.collection.Iterator$$anon$1.next(Iterator.scala:853)at
scala.collection.Iterator$$anon$1.head(Iterator.scala:840)at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
   
at
org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
   
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)   
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)at
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)   
at org.apache.spark.scheduler.Task.run(Task.scala:51)at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)   
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
  
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
  
at java.lang.Thread.run(Thread.java:744)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-java-lang-NoClassDefFoundError-Could-not-initialize-class-line10-read-tp9170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?

for classpath a simple script similar to "hadoop classpath" that shows what
needs to be added should be sufficient.

on spark standalone I can launch a program just fine with just SparkConf
and SparkContext. not on yarn, so the spark-launch script must be doing a
few things extra there I am missing... which makes things more difficult
because I am not sure its realistic to expect every application that needs
to run something on spark to be launched using spark-submit.
 On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:

> It fulfills a few different functions. The main one is giving users a
> way to inject Spark as a runtime dependency separately from their
> program and make sure they get exactly the right version of Spark. So
> a user can bundle an application and then use spark-submit to send it
> to different types of clusters (or using different versions of Spark).
>
> It also unifies the way you bundle and submit an app for Yarn, Mesos,
> etc... this was something that became very fragmented over time before
> this was added.
>
> Another feature is allowing users to set configuration values
> dynamically rather than compile them inside of their program. That's
> the one you mention here. You can choose to use this feature or not.
> If you know your configs are not going to change, then you don't need
> to set them with spark-submit.
>
>
> On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
> wrote:
> > What is the purpose of spark-submit? Does it do anything outside of
> > the standard val conf = new SparkConf ... val sc = new SparkContext
> > ... ?
>


Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Mayur Rustagi
Also its far from bug free :)
Let me know if you need any help to try it out.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Jul 9, 2014 at 12:58 PM, Akhil Das 
wrote:

> Hi Bertrand,
>
> We've updated the document
> http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0
>
> This is our working Github repo
> https://github.com/sigmoidanalytics/spork/tree/spork-0.9
>
> Feel free to open issues over here
> https://github.com/sigmoidanalytics/spork/issues
>
> Thanks
> Best Regards
>
>
> On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux 
> wrote:
>
>> @Mayur : I won't fight with the semantic of a fork but at the moment, no
>> Spork does take the standard Pig as dependency. On that, we should agree.
>>
>> As for my use of Pig, I have no limitation. I am however interested to
>> see the rise of a 'no-sql high level non programming language' for Spark.
>>
>> @Zhang : Could you elaborate your reference about Twitter?
>>
>>
>> Bertrand Dechoux
>>
>>
>> On Tue, Jul 8, 2014 at 4:04 AM, 张包峰  wrote:
>>
>>> Hi guys, previously I checked out the old "spork" and updated it to
>>> Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine
>>> https://github.com/pelick/flare-spork‍
>>>
>>> It it also highly experimental, and just directly mapping pig physical
>>> operations to spark RDD transformations/actions. It works for simple
>>> requests. :)
>>>
>>> I am also interested on the progress of spork, is it undergoing in
>>> Twitter in an un open-source way?
>>>
>>> --
>>> Thanks
>>> Zhang Baofeng
>>> Blog  | Github 
>>> | Weibo  | LinkedIn
>>> 
>>>
>>>
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Mayur Rustagi";;
>>> *发送时间:* 2014年7月7日(星期一) 晚上11:55
>>> *收件人:* "user@spark.apache.org";
>>> *主题:* Re: Pig 0.13, Spark, Spork
>>>
>>> That version is old :).
>>> We are not forking pig but cleanly separating out pig execution engine.
>>> Let me know if you are willing to give it a go.
>>>
>>> Also would love to know what features of pig you are using ?
>>>
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux 
>>> wrote:
>>>
 I saw a wiki page from your company but with an old version of Spark.

 http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

 I have no reason to use it yet but I am interested in the state of the
 initiative.
 What's your point of view (personal and/or professional) about the Pig
 0.13 release?
 Is the pluggable execution engine flexible enough in order to avoid
 having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

 As a (for now) external observer, I am glad to see competition in that
 space. It can only be good for the community in the end.

 Bertrand Dechoux


 On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi 
 wrote:

> Hi,
> We have fixed many major issues around Spork & deploying it with some
> customers. Would be happy to provide a working version to you to try out.
> We are looking for more folks to try it out & submit bugs.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux 
> wrote:
>
>> Hi,
>>
>> I was wondering what was the state of the Pig+Spark initiative now
>> that the execution engine of Pig is pluggable? Granted, it was done in
>> order to use Tez but could it be used by Spark? I know about a
>> 'theoretical' project called Spork but I don't know any stable and
>> maintained version of it.
>>
>> Regards
>>
>> Bertrand Dechoux
>>
>
>

>>>
>>
>


Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman
Are there any gaps beyond convenience and code/config separation in using
spark-submit versus SparkConf/SparkContext if you are willing to set your
own config?

If there are any gaps, +1 on having parity within SparkConf/SparkContext
where possible. In my use case, we launch our jobs programmatically. In
theory, we could shell out to spark-submit but it's not the best option for
us.

So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
on the complexities of other modes, though.

-Suren



On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers  wrote:

> not sure I understand why unifying how you submit app for different
> platforms and dynamic configuration cannot be part of SparkConf and
> SparkContext?
>
> for classpath a simple script similar to "hadoop classpath" that shows
> what needs to be added should be sufficient.
>
> on spark standalone I can launch a program just fine with just SparkConf
> and SparkContext. not on yarn, so the spark-launch script must be doing a
> few things extra there I am missing... which makes things more difficult
> because I am not sure its realistic to expect every application that needs
> to run something on spark to be launched using spark-submit.
>  On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:
>
>> It fulfills a few different functions. The main one is giving users a
>> way to inject Spark as a runtime dependency separately from their
>> program and make sure they get exactly the right version of Spark. So
>> a user can bundle an application and then use spark-submit to send it
>> to different types of clusters (or using different versions of Spark).
>>
>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
>> etc... this was something that became very fragmented over time before
>> this was added.
>>
>> Another feature is allowing users to set configuration values
>> dynamically rather than compile them inside of their program. That's
>> the one you mention here. You can choose to use this feature or not.
>> If you know your configs are not going to change, then you don't need
>> to set them with spark-submit.
>>
>>
>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
>> wrote:
>> > What is the purpose of spark-submit? Does it do anything outside of
>> > the standard val conf = new SparkConf ... val sc = new SparkContext
>> > ... ?
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Mayur Rustagi
RDD can only keep objects. How do you plan to encode these images so that
they can be loaded. Keeping the whole image as a single object in 1 rdd
would perhaps not be super optimized.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa 
wrote:

> Hi all,
>
> I need to run a spark job that need a set of images as input. I need
> something that load these images as RDD but I just don't know how to do
> that. Do some of you have any idea ?
>
> Cheers,
>
>
> Jao
>


Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
The idea is to run a job that use images as input so that each work will
process a subset of the images


On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi 
wrote:

> RDD can only keep objects. How do you plan to encode these images so that
> they can be loaded. Keeping the whole image as a single object in 1 rdd
> would perhaps not be super optimized.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Jul 9, 2014 at 12:17 PM, Jaonary Rabarisoa 
> wrote:
>
>> Hi all,
>>
>> I need to run a spark job that need a set of images as input. I need
>> something that load these images as RDD but I just don't know how to do
>> that. Do some of you have any idea ?
>>
>> Cheers,
>>
>>
>> Jao
>>
>
>


Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi
 val sem = 0
sc.addSparkListener(new SparkListener {
  override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem +=1
  }

})
sc = spark context



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Jul 9, 2014 at 4:34 AM, abhiguruvayya 
wrote:

> Hello Mayur,
>
> How can I implement these methods mentioned below. Do u you have any clue
> on
> this pls et me know.
>
>
> public void onJobStart(SparkListenerJobStart arg0) {
> }
>
> @Override
> public void onStageCompleted(SparkListenerStageCompleted arg0) {
> }
>
> @Override
> public void onStageSubmitted(SparkListenerStageSubmitted arg0) {
>
> }
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9104.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all,

I am currently trying to save to Cassandra after some Spark Streaming
computation.

I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd => {

var someCol = Seq[MyType]()

foreach(kv =>{
  someCol :+ rdd._2 //I only want the RDD value and not the key
 }
val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING TO
RUN THE WORKER
collectionRDD.saveToCassandra(...)
}

I get the NotSerializableException while trying to run the Node (also tried
someCol as shared variable).
I believe this happens because the myDStream doesn't exist yet when the code
is pushed to the Node so the parallelize doens't have any structure to
relate to it. Inside this foreachRDD I should only do RDD calls which are
only related to other RDDs. I guess this was just a desperate attempt

So I have a question
Using the Cassandra Spark driver - Can we only write to Cassandra from an
RDD? In my case I only want to write once all the computation is finished in
a single batch on the driver app.

tnks in advance.

Rod











--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


how to convert JavaDStream to JavaRDD

2014-07-09 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS
directory

JavaDStream lines_2 =
ssc.textFileStream("hdfs://localhost:9000/user/rajesh/EventsDirectory/");

How to convert JavaDStream result to JavaRDD? if we can
convert. I can use collect() method on JavaRDD and process my textfile.

I'm not able to find collect method on JavaRDD.

Thank you very much in advance.

Regards,
Rajesh


Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Is MyType serializable? Everything inside the foreachRDD closure has to be
serializable.


2014-07-09 14:24 GMT+01:00 RodrigoB :

> Hi all,
>
> I am currently trying to save to Cassandra after some Spark Streaming
> computation.
>
> I call a myDStream.foreachRDD so that I can collect each RDD in the driver
> app runtime and inside I do something like this:
> myDStream.foreachRDD(rdd => {
>
> var someCol = Seq[MyType]()
>
> foreach(kv =>{
>   someCol :+ rdd._2 //I only want the RDD value and not the key
>  }
> val collectionRDD = sc.parallelize(someCol) //THIS IS WHY IT FAILS TRYING
> TO
> RUN THE WORKER
> collectionRDD.saveToCassandra(...)
> }
>
> I get the NotSerializableException while trying to run the Node (also tried
> someCol as shared variable).
> I believe this happens because the myDStream doesn't exist yet when the
> code
> is pushed to the Node so the parallelize doens't have any structure to
> relate to it. Inside this foreachRDD I should only do RDD calls which are
> only related to other RDDs. I guess this was just a desperate attempt
>
> So I have a question
> Using the Cassandra Spark driver - Can we only write to Cassandra from an
> RDD? In my case I only want to write once all the computation is finished
> in
> a single batch on the driver app.
>
> tnks in advance.
>
> Rod
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Purpose of spark-submit?

2014-07-09 Thread Robert James
+1 to be able to do anything via SparkConf/SparkContext.  Our app
worked fine in Spark 0.9, but, after several days of wrestling with
uber jars and spark-submit, and so far failing to get Spark 1.0
working, we'd like to go back to doing it ourself with SparkConf.

As the previous poster said, a few scripts should be able to give us
the classpath and any other params we need, and be a lot more
transparent and debuggable.

On 7/9/14, Surendranauth Hiraman  wrote:
> Are there any gaps beyond convenience and code/config separation in using
> spark-submit versus SparkConf/SparkContext if you are willing to set your
> own config?
>
> If there are any gaps, +1 on having parity within SparkConf/SparkContext
> where possible. In my use case, we launch our jobs programmatically. In
> theory, we could shell out to spark-submit but it's not the best option for
> us.
>
> So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
> on the complexities of other modes, though.
>
> -Suren
>
>
>
> On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers  wrote:
>
>> not sure I understand why unifying how you submit app for different
>> platforms and dynamic configuration cannot be part of SparkConf and
>> SparkContext?
>>
>> for classpath a simple script similar to "hadoop classpath" that shows
>> what needs to be added should be sufficient.
>>
>> on spark standalone I can launch a program just fine with just SparkConf
>> and SparkContext. not on yarn, so the spark-launch script must be doing a
>> few things extra there I am missing... which makes things more difficult
>> because I am not sure its realistic to expect every application that
>> needs
>> to run something on spark to be launched using spark-submit.
>>  On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:
>>
>>> It fulfills a few different functions. The main one is giving users a
>>> way to inject Spark as a runtime dependency separately from their
>>> program and make sure they get exactly the right version of Spark. So
>>> a user can bundle an application and then use spark-submit to send it
>>> to different types of clusters (or using different versions of Spark).
>>>
>>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
>>> etc... this was something that became very fragmented over time before
>>> this was added.
>>>
>>> Another feature is allowing users to set configuration values
>>> dynamically rather than compile them inside of their program. That's
>>> the one you mention here. You can choose to use this feature or not.
>>> If you know your configs are not going to change, then you don't need
>>> to set them with spark-submit.
>>>
>>>
>>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
>>> wrote:
>>> > What is the purpose of spark-submit? Does it do anything outside of
>>> > the standard val conf = new SparkConf ... val sc = new SparkContext
>>> > ... ?
>>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>


Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed
Hi,

For QueueRDD, have a look here. 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

  

Regards,
Laeeq



On Friday, May 23, 2014 10:33 AM, Mayur Rustagi  wrote:
 


Well its hard to use text data as time of input. 
But if you are adament here's what you would do. 
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files you 
receive & push them into the folder in order of time. Mostly your data would be 
produced at t, you would get it at t + say 5 sec, & you can push it in & get 
processed at t + say 10 sec. Then you can timeshift your calculations. If you 
are okay with broad enough time frame you should be fine.


Another way is to use queue processing.
Queue> rddQueue = new LinkedList>();
Create a Dstream to consume from Queue of RDD, keep looking into the folder of 
files & create rdd's from them at a min level & push them into thee queue. This 
would cause you to go through your data atleast twice & just provide order 
guarantees , processing time is still going to vary with how quickly you can 
process your RDD.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



On Thu, May 22, 2014 at 9:08 PM, Ian Holsman  wrote:

Hi.
>
>
>I'm writing a pilot project, and plan on using spark's streaming app for it.
>
>
>To start with I have a dump of some access logs with their own timestamps, and 
>am using the textFileStream and some old files to test it with.
>
>
>One of the issues I've come across is simulating the windows. I would like use 
>the timestamp from the access logs as the 'system time' instead of the real 
>clock time.
>
>
>I googled a bit and found the 'manual' clock which appears to be used for 
>testing the job scheduler.. but I'm not sure what my next steps should be.
>
>
>I'm guessing I'll need to do something like
>
>
>1. use the textFileStream to create a 'DStream'
>2. have some kind of DStream that runs on top of that that creates the RDDs 
>based on the timestamps Instead of the system time
>3. the rest of my mappers.
>
>
>Is this correct? or do I need to create my own 'textFileStream' to initially 
>create the RDDs and modify the system clock inside of it.
>
>
>I'm not too concerned about out-of-order messages, going backwards in time, or 
>being 100% in sync across workers.. as this is more for 
>'development'/prototyping.
>
>
>Are there better ways of achieving this? I would assume that controlling the 
>windows RDD buckets would be a common use case.
>
>
>TIA
>Ian
>
>-- 
>
>Ian Holsman
>i...@holsman.com.au 
>PH: + 61-3-9028 8133 / +1-(425) 998-7083

RDD Cleanup

2014-07-09 Thread premdass
Hi,

I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
system. Basically a long running context is created, which enables to run
multiple jobs under the same context, and hence sharing of the data.

It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
created and cached by a Job-1 gets cleaned up by BlockManager (can see log
statements saying cleaning up RDD) and so the cached RDD's are not available
for Job-2, though Both Job-1 and Job-2 are running under same spark context.

I tried using the spark.cleaner.referenceTracking = false setting, how-ever
this causes the issue that unpersisted RDD's are not cleaned up properly,
and occupying the Spark's memory..


Had anybody faced issue like this before? If so, any advice would be greatly
appreicated.


Also is there any way, to mark an RDD as being used under a context, event
though the job using that had been finished (so subsequent jobs can use that
RDD).


Thanks,
Prem



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: how to convert JavaDStream to JavaRDD

2014-07-09 Thread Laeeq Ahmed
Hi,

First use foreachrdd and then use collect as


DStream.foreachRDD(rdd => {
   rdd.collect.foreach({

Also its better to use scala. Less verbose. 


Regards,
Laeeq



On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar 
 wrote:
 


Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory

JavaDStream lines_2 = 
ssc.textFileStream("hdfs://localhost:9000/user/rajesh/EventsDirectory/");

How to convert JavaDStream result to JavaRDD? 
if we can convert. I can use collect() method on JavaRDD and process my 
textfile.

I'm not able to find collect method on JavaRDD.

Thank you very much in advance.

Regards,
Rajesh

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-07-09 Thread premdass
Michael,

Thanks for the response. Yes, Moving the Case class solved the issue.

Thanks,
Prem



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-registerAsTable-No-TypeTag-available-Error-tp7623p9183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed
Hi,

For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.

 


On Sunday, July 6, 2014 10:20 AM, alessandro finamore 
 wrote:
 


On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] 
<[hidden email]> wrote: 
> Key idea is to simulate your app time as you enter data . So you can connect 
> spark streaming to a queue and insert data in it spaced by time. Easier said 
> than done :). 

I see. 
I'll try to implement also this solution so that I can compare it with 
my current spark implementation. 
I'm interested in seeing if this is faster...as I assume it should be :) 

> What are the parallelism issues you are hitting with your 
> static approach. 

In my current spark implementation, whenever I need to get the 
aggregated stats over the window, I'm re-mapping all the current bins 
to have the same key so that they can be reduced altogether. 
This means that data need to shipped to a single reducer. 
As results, adding nodes/cores to the application does not really 
affect the total time :( 


> 
> 
> On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote: 
>> 
>> Thanks for the replies 
>> 
>> What is not completely clear to me is how time is managed. 
>> I can create a DStream from file. 
>> But if I set the window property that will be bounded to the application 
>> time, right? 
>> 
>> If I got it right, with a receiver I can control the way DStream are 
>> created. 
>> But, how can apply then the windowing already shipped with the framework 
>> if 
>> this is bounded to the "application time"? 
>> I would like to do define a window of N files but the window() function 
>> requires a duration as input... 
>> 
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com. 
> 
> 
> 
> -- 
> Sent from Gmail Mobile 
> 
> 
>  
> If you reply to this email, your message will be added to the discussion 
> below: 
> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html
> To unsubscribe from window analysis with Spark and Spark streaming, click 
> here. 
> NAML 


-- 
-- 
Alessandro Finamore, PhD 
Politecnico di Torino 
-- 
Office:    +39 0115644127 
Mobile:   +39 3280251485 
SkypeId: alessandro.finamore 
--- 


 View this message in context: Re: window analysis with Spark and Spark 
streaming

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using
shell script.

we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used "hadoop jar"
to submit jobs in shell.



On Wed, Jul 9, 2014 at 9:47 AM, Robert James  wrote:

> +1 to be able to do anything via SparkConf/SparkContext.  Our app
> worked fine in Spark 0.9, but, after several days of wrestling with
> uber jars and spark-submit, and so far failing to get Spark 1.0
> working, we'd like to go back to doing it ourself with SparkConf.
>
> As the previous poster said, a few scripts should be able to give us
> the classpath and any other params we need, and be a lot more
> transparent and debuggable.
>
> On 7/9/14, Surendranauth Hiraman  wrote:
> > Are there any gaps beyond convenience and code/config separation in using
> > spark-submit versus SparkConf/SparkContext if you are willing to set your
> > own config?
> >
> > If there are any gaps, +1 on having parity within SparkConf/SparkContext
> > where possible. In my use case, we launch our jobs programmatically. In
> > theory, we could shell out to spark-submit but it's not the best option
> for
> > us.
> >
> > So far, we are only using Standalone Cluster mode, so I'm not
> knowledgeable
> > on the complexities of other modes, though.
> >
> > -Suren
> >
> >
> >
> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers  wrote:
> >
> >> not sure I understand why unifying how you submit app for different
> >> platforms and dynamic configuration cannot be part of SparkConf and
> >> SparkContext?
> >>
> >> for classpath a simple script similar to "hadoop classpath" that shows
> >> what needs to be added should be sufficient.
> >>
> >> on spark standalone I can launch a program just fine with just SparkConf
> >> and SparkContext. not on yarn, so the spark-launch script must be doing
> a
> >> few things extra there I am missing... which makes things more difficult
> >> because I am not sure its realistic to expect every application that
> >> needs
> >> to run something on spark to be launched using spark-submit.
> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:
> >>
> >>> It fulfills a few different functions. The main one is giving users a
> >>> way to inject Spark as a runtime dependency separately from their
> >>> program and make sure they get exactly the right version of Spark. So
> >>> a user can bundle an application and then use spark-submit to send it
> >>> to different types of clusters (or using different versions of Spark).
> >>>
> >>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
> >>> etc... this was something that became very fragmented over time before
> >>> this was added.
> >>>
> >>> Another feature is allowing users to set configuration values
> >>> dynamically rather than compile them inside of their program. That's
> >>> the one you mention here. You can choose to use this feature or not.
> >>> If you know your configs are not going to change, then you don't need
> >>> to set them with spark-submit.
> >>>
> >>>
> >>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
> >>> wrote:
> >>> > What is the purpose of spark-submit? Does it do anything outside of
> >>> > the standard val conf = new SparkConf ... val sc = new SparkContext
> >>> > ... ?
> >>>
> >>
> >
> >
> > --
> >
> > SUREN HIRAMAN, VP TECHNOLOGY
> > Velos
> > Accelerating Machine Learning
> >
> > 440 NINTH AVENUE, 11TH FLOOR
> > NEW YORK, NY 10001
> > O: (917) 525-2466 ext. 105
> > F: 646.349.4063
> > E: suren.hiraman@v elos.io
> > W: www.velos.io
> >
>


Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis,

Yes it's actually an ouput of the previous RDD. 
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...


On Wed, Jul 9, 2014 at 10:03 AM, premdass  wrote:

> Hi,
>
> I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
> system. Basically a long running context is created, which enables to run
> multiple jobs under the same context, and hence sharing of the data.
>
> It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
> created and cached by a Job-1 gets cleaned up by BlockManager (can see log
> statements saying cleaning up RDD) and so the cached RDD's are not
> available
> for Job-2, though Both Job-1 and Job-2 are running under same spark
> context.
>
> I tried using the spark.cleaner.referenceTracking = false setting, how-ever
> this causes the issue that unpersisted RDD's are not cleaned up properly,
> and occupying the Spark's memory..
>
>
> Had anybody faced issue like this before? If so, any advice would be
> greatly
> appreicated.
>
>
> Also is there any way, to mark an RDD as being used under a context, event
> though the job using that had been finished (so subsequent jobs can use
> that
> RDD).
>
>
> Thanks,
> Prem
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Yes, I'm using it to count concurrent users from a kafka stream of events
without problems. I'm currently testing it using the local mode but any
serialization problem would have already appeared so I don't expect any
serialization issue when I deployed to my cluster.


2014-07-09 15:39 GMT+01:00 RodrigoB :

> Hi Luis,
>
> Yes it's actually an ouput of the previous RDD.
> Have you ever used the Cassandra Spark Driver on the driver app? I believe
> these limitations go around that - it's designed to save RDDs from the
> nodes.
>
> tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-driver-Spark-question-tp9177p9187.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark Streaming and Storm

2014-07-09 Thread Dan H.
Xichen_tju,

I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 
24GBRAM with 3 servers) and was not able to achieve a realistic scale for my 
business domain needs.  Storm is really only a framework, which allows you to 
put in code to do whatever it is you need for a distributed system…so it’s 
completely flexible and distributable, but it comes at a price.  In Storm, the 
one of the biggest performance hits, came down to how the “acks” work within 
the tuple trees.  You can have the framework default ack messages between 
spouts and/or bolts, but in the end, you most likely want to manage acks 
yourself, due to how much reliability you’re system will need (to replay 
messages…).  All this means, is that if you don’t have massive amounts of data 
that you need to process within a few seconds, (which I do) then Storm may work 
well for you, but you’re performance will diminish as you add in more and more 
business rules (unless of course you add in more servers for processing).  If 
you need to ingest at least 1GBps+, then you may want to reevaluate since 
you’re server scale may not mesh with you overall processing needs.

I recently just started using Spark Streaming with Kafka and have been quite 
impressed at the performance level that’s being achieved.  I particularly like 
the fact that Spark isn’t just a framework, but it provides you with simple 
tools with API convenience methods.  Some of those features are reduceByKey 
(mapReduce), sliding and aggregate sub time windows, etc.  Also, In my 
environment, I believe it’s going to be a great fit since we use Hadoop already 
and Spark should fit into that environment well.

You should look into both Storm and Spark Streaming, but in the end it just 
depends on your needs.  If you not looking for Streaming aspects, then Spark on 
Hadoop is a great option since Spark will cache the dataset in memory for all 
queries, which will be much faster than running Hive/Pig onto of Hadoop.  But 
I’m assuming you need some sort of Streaming system for data flow, but if it 
doesn’t need to be real-time or near real-time, you may want to simply look at 
Hadoop, which you could always use Spark ontop of for real-time queries.

Hope this helps…

Dan

 
On Jul 8, 2014, at 7:25 PM, Shao, Saisai  wrote:

> You may get the performance comparison results from Spark Streaming paper and 
> meetup ppt, just google it.
> Actually performance comparison is case by case and relies on your work load 
> design, hardware and software configurations. There is no actual winner for 
> the whole scenarios.
>  
> Thanks
> Jerry
>  
> From: xichen_tju@126 [mailto:xichen_...@126.com] 
> Sent: Wednesday, July 09, 2014 9:17 AM
> To: user@spark.apache.org
> Subject: Spark Streaming and Storm
>  
> hi all
> I am a newbie to Spark Streaming, and used Strom before.Have u test the 
> performance both of them and which one is better?
>  
> xichen_tju@126



Re: RDD Cleanup

2014-07-09 Thread premdass
Hi,

Yes . I am  caching the RDD's by calling cache method..


May i ask, how you are sharing RDD's across jobs in same context? By the RDD
name. I tried printing the RDD's of the Spark context, and when the
referenceTracking is enabled, i get empty list after the clean up.

Thanks,
Prem




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application


On Wed, Jul 9, 2014 at 11:00 AM, premdass  wrote:

> Hi,
>
> Yes . I am  caching the RDD's by calling cache method..
>
>
> May i ask, how you are sharing RDD's across jobs in same context? By the
> RDD
> name. I tried printing the RDD's of the Spark context, and when the
> referenceTracking is enabled, i get empty list after the clean up.
>
> Thanks,
> Prem
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
I am trying to get my head around using Spark on Yarn from a perspective of
a cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This
is done in yarn-client mode and it all works well.

In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone mode, and then in the examples they "connect" to
this cluster in Stand Alone mode. This is done often times using the
spark:// string for connection.  Cool. s
But what I don't understand is how do I setup a Yarn instance that I can
"connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and it
gave me an error, telling me to use yarn-client.  I see information on
using spark-class or spark-submit.  But what I'd really like is a instance
I can connect a spark-shell too, and have the instance stay up. I'd like to
be able run other things on that instance etc. Is that possible with Yarn?
I know there may be long running job challenges with Yarn, but I am just
testing, I am just curious if I am looking at something completely bonkers
here, or just missing something simple.

Thanks!


Re: Purpose of spark-submit?

2014-07-09 Thread Andrei
One another +1. For me it's a question of embedding. With
SparkConf/SparkContext I can easily create larger projects with Spark as a
separate service (just like MySQL and JDBC, for example). With spark-submit
I'm bound to Spark as a main framework that defines how my application
should look like. In my humble opinion, using Spark as embeddable library
rather than main framework and runtime is much easier.




On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:

> +1 as well for being able to submit jobs programmatically without using
> shell script.
>
> we also experience issues of submitting jobs programmatically without
> using spark-submit. In fact, even in the Hadoop World, I rarely used
> "hadoop jar" to submit jobs in shell.
>
>
>
> On Wed, Jul 9, 2014 at 9:47 AM, Robert James 
> wrote:
>
>> +1 to be able to do anything via SparkConf/SparkContext.  Our app
>> worked fine in Spark 0.9, but, after several days of wrestling with
>> uber jars and spark-submit, and so far failing to get Spark 1.0
>> working, we'd like to go back to doing it ourself with SparkConf.
>>
>> As the previous poster said, a few scripts should be able to give us
>> the classpath and any other params we need, and be a lot more
>> transparent and debuggable.
>>
>> On 7/9/14, Surendranauth Hiraman  wrote:
>> > Are there any gaps beyond convenience and code/config separation in
>> using
>> > spark-submit versus SparkConf/SparkContext if you are willing to set
>> your
>> > own config?
>> >
>> > If there are any gaps, +1 on having parity within SparkConf/SparkContext
>> > where possible. In my use case, we launch our jobs programmatically. In
>> > theory, we could shell out to spark-submit but it's not the best option
>> for
>> > us.
>> >
>> > So far, we are only using Standalone Cluster mode, so I'm not
>> knowledgeable
>> > on the complexities of other modes, though.
>> >
>> > -Suren
>> >
>> >
>> >
>> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers 
>> wrote:
>> >
>> >> not sure I understand why unifying how you submit app for different
>> >> platforms and dynamic configuration cannot be part of SparkConf and
>> >> SparkContext?
>> >>
>> >> for classpath a simple script similar to "hadoop classpath" that shows
>> >> what needs to be added should be sufficient.
>> >>
>> >> on spark standalone I can launch a program just fine with just
>> SparkConf
>> >> and SparkContext. not on yarn, so the spark-launch script must be
>> doing a
>> >> few things extra there I am missing... which makes things more
>> difficult
>> >> because I am not sure its realistic to expect every application that
>> >> needs
>> >> to run something on spark to be launched using spark-submit.
>> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:
>> >>
>> >>> It fulfills a few different functions. The main one is giving users a
>> >>> way to inject Spark as a runtime dependency separately from their
>> >>> program and make sure they get exactly the right version of Spark. So
>> >>> a user can bundle an application and then use spark-submit to send it
>> >>> to different types of clusters (or using different versions of Spark).
>> >>>
>> >>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
>> >>> etc... this was something that became very fragmented over time before
>> >>> this was added.
>> >>>
>> >>> Another feature is allowing users to set configuration values
>> >>> dynamically rather than compile them inside of their program. That's
>> >>> the one you mention here. You can choose to use this feature or not.
>> >>> If you know your configs are not going to change, then you don't need
>> >>> to set them with spark-submit.
>> >>>
>> >>>
>> >>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James > >
>> >>> wrote:
>> >>> > What is the purpose of spark-submit? Does it do anything outside of
>> >>> > the standard val conf = new SparkConf ... val sc = new SparkContext
>> >>> > ... ?
>> >>>
>> >>
>> >
>> >
>> > --
>> >
>> > SUREN HIRAMAN, VP TECHNOLOGY
>> > Velos
>> > Accelerating Machine Learning
>> >
>> > 440 NINTH AVENUE, 11TH FLOOR
>> > NEW YORK, NY 10001
>> > O: (917) 525-2466 ext. 105
>> > F: 646.349.4063
>> > E: suren.hiraman@v elos.io
>> > W: www.velos.io
>> >
>>
>
>


Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Spark still supports the ability to submit jobs programmatically without
shell scripts.

Koert,
The main reason that the unification can't be a part of SparkContext is
that YARN and standalone support deploy modes where the driver runs in a
managed process on the cluster.  In this case, the SparkContext is created
on a remote node well after the application is launched.


On Wed, Jul 9, 2014 at 8:34 AM, Andrei  wrote:

> One another +1. For me it's a question of embedding. With
> SparkConf/SparkContext I can easily create larger projects with Spark as a
> separate service (just like MySQL and JDBC, for example). With spark-submit
> I'm bound to Spark as a main framework that defines how my application
> should look like. In my humble opinion, using Spark as embeddable library
> rather than main framework and runtime is much easier.
>
>
>
>
> On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:
>
>> +1 as well for being able to submit jobs programmatically without using
>> shell script.
>>
>> we also experience issues of submitting jobs programmatically without
>> using spark-submit. In fact, even in the Hadoop World, I rarely used
>> "hadoop jar" to submit jobs in shell.
>>
>>
>>
>> On Wed, Jul 9, 2014 at 9:47 AM, Robert James 
>> wrote:
>>
>>> +1 to be able to do anything via SparkConf/SparkContext.  Our app
>>> worked fine in Spark 0.9, but, after several days of wrestling with
>>> uber jars and spark-submit, and so far failing to get Spark 1.0
>>> working, we'd like to go back to doing it ourself with SparkConf.
>>>
>>> As the previous poster said, a few scripts should be able to give us
>>> the classpath and any other params we need, and be a lot more
>>> transparent and debuggable.
>>>
>>> On 7/9/14, Surendranauth Hiraman  wrote:
>>> > Are there any gaps beyond convenience and code/config separation in
>>> using
>>> > spark-submit versus SparkConf/SparkContext if you are willing to set
>>> your
>>> > own config?
>>> >
>>> > If there are any gaps, +1 on having parity within
>>> SparkConf/SparkContext
>>> > where possible. In my use case, we launch our jobs programmatically. In
>>> > theory, we could shell out to spark-submit but it's not the best
>>> option for
>>> > us.
>>> >
>>> > So far, we are only using Standalone Cluster mode, so I'm not
>>> knowledgeable
>>> > on the complexities of other modes, though.
>>> >
>>> > -Suren
>>> >
>>> >
>>> >
>>> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers 
>>> wrote:
>>> >
>>> >> not sure I understand why unifying how you submit app for different
>>> >> platforms and dynamic configuration cannot be part of SparkConf and
>>> >> SparkContext?
>>> >>
>>> >> for classpath a simple script similar to "hadoop classpath" that shows
>>> >> what needs to be added should be sufficient.
>>> >>
>>> >> on spark standalone I can launch a program just fine with just
>>> SparkConf
>>> >> and SparkContext. not on yarn, so the spark-launch script must be
>>> doing a
>>> >> few things extra there I am missing... which makes things more
>>> difficult
>>> >> because I am not sure its realistic to expect every application that
>>> >> needs
>>> >> to run something on spark to be launched using spark-submit.
>>> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell" 
>>> wrote:
>>> >>
>>> >>> It fulfills a few different functions. The main one is giving users a
>>> >>> way to inject Spark as a runtime dependency separately from their
>>> >>> program and make sure they get exactly the right version of Spark. So
>>> >>> a user can bundle an application and then use spark-submit to send it
>>> >>> to different types of clusters (or using different versions of
>>> Spark).
>>> >>>
>>> >>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
>>> >>> etc... this was something that became very fragmented over time
>>> before
>>> >>> this was added.
>>> >>>
>>> >>> Another feature is allowing users to set configuration values
>>> >>> dynamically rather than compile them inside of their program. That's
>>> >>> the one you mention here. You can choose to use this feature or not.
>>> >>> If you know your configs are not going to change, then you don't need
>>> >>> to set them with spark-submit.
>>> >>>
>>> >>>
>>> >>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James <
>>> srobertja...@gmail.com>
>>> >>> wrote:
>>> >>> > What is the purpose of spark-submit? Does it do anything outside of
>>> >>> > the standard val conf = new SparkConf ... val sc = new SparkContext
>>> >>> > ... ?
>>> >>>
>>> >>
>>> >
>>> >
>>> > --
>>> >
>>> > SUREN HIRAMAN, VP TECHNOLOGY
>>> > Velos
>>> > Accelerating Machine Learning
>>> >
>>> > 440 NINTH AVENUE, 11TH FLOOR
>>> > NEW YORK, NY 10001
>>> > O: (917) 525-2466 ext. 105
>>> > F: 646.349.4063
>>> > E: suren.hiraman@v elos.io
>>> > W: www.velos.io
>>> >
>>>
>>
>>
>


Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like 
MapReduce, Storm and Spark.

I would recommend that you build your spark jobs in the main method without 
specifying how you deploy it. Then you can use spark-submit to tell Spark how 
you would want to deploy to it using yarn-cluster as the master. The key point 
here is that once you have YARN setup, the spark client connects to it using 
the $HADOOP_CONF_DIR that contains the resource manager address. In particular, 
this needs to be accessible from the classpath of the submitter since it 
implicitly uses this when it instantiates a YarnConfiguration instance. If you 
want more details, read org.apache.spark.deploy.yarn.Client.scala.

You should be able to download a standalone YARN cluster from any of the Hadoop 
providers like Cloudera or Hortonworks. Once you have that, the spark 
programming guide describes what I mention above in sufficient detail for you 
to proceed.

Thanks,
Ron

Sent from my iPad

> On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
> 
> I am trying to get my head around using Spark on Yarn from a perspective of a 
> cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This is 
> done in yarn-client mode and it all works well. 
> 
> In multiple examples, I see instances where people have setup Spark Clusters 
> in Stand Alone mode, and then in the examples they "connect" to this cluster 
> in Stand Alone mode. This is done often times using the spark:// string for 
> connection.  Cool. s
> But what I don't understand is how do I setup a Yarn instance that I can 
> "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and it 
> gave me an error, telling me to use yarn-client.  I see information on using 
> spark-class or spark-submit.  But what I'd really like is a instance I can 
> connect a spark-shell too, and have the instance stay up. I'd like to be able 
> run other things on that instance etc. Is that possible with Yarn? I know 
> there may be long running job challenges with Yarn, but I am just testing, I 
> am just curious if I am looking at something completely bonkers here, or just 
> missing something simple. 
> 
> Thanks!
> 
> 


Mechanics of passing functions to Spark?

2014-07-09 Thread Seref Arikan
Greetings,
The documentation at
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
says:
"Note that while it is also possible to pass a reference to a method in a
class instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method"

First, could someone clarify what is meant by sending the object here? How
is the object sent to (presumably) nodes of the cluster? Is it a one time
operation per node? Why does this sound like (according to doc) a less
preferred option compared to a singleton's function? Would not the nodes
require the singleton object anyway?

Some clarification would really help.

Regards
Seref

ps: the expression "the object that contains that class" sounds a bit
unusual, is the intended meaning "the object that is the instance  of that
class" ?


Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
sandy, that makes sense. however i had trouble doing programmatic execution
on yarn in client mode as well. the application-master in yarn came up but
then bombed because it was looking for jars that dont exist (it was looking
in the original file paths on the driver side, which are not available on
the yarn node). my guess is that spark-submit is changing some settings
(perhaps preparing the distributed cache and modifying settings
accordingly), which makes it harder to run things programmatically. i could
be wrong however. i gave up debugging and resorted to using spark-submit
for now.



On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza  wrote:

> Spark still supports the ability to submit jobs programmatically without
> shell scripts.
>
> Koert,
> The main reason that the unification can't be a part of SparkContext is
> that YARN and standalone support deploy modes where the driver runs in a
> managed process on the cluster.  In this case, the SparkContext is created
> on a remote node well after the application is launched.
>
>
> On Wed, Jul 9, 2014 at 8:34 AM, Andrei  wrote:
>
>> One another +1. For me it's a question of embedding. With
>> SparkConf/SparkContext I can easily create larger projects with Spark as a
>> separate service (just like MySQL and JDBC, for example). With spark-submit
>> I'm bound to Spark as a main framework that defines how my application
>> should look like. In my humble opinion, using Spark as embeddable library
>> rather than main framework and runtime is much easier.
>>
>>
>>
>>
>> On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:
>>
>>> +1 as well for being able to submit jobs programmatically without using
>>> shell script.
>>>
>>> we also experience issues of submitting jobs programmatically without
>>> using spark-submit. In fact, even in the Hadoop World, I rarely used
>>> "hadoop jar" to submit jobs in shell.
>>>
>>>
>>>
>>> On Wed, Jul 9, 2014 at 9:47 AM, Robert James 
>>> wrote:
>>>
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman  wrote:
 > Are there any gaps beyond convenience and code/config separation in
 using
 > spark-submit versus SparkConf/SparkContext if you are willing to set
 your
 > own config?
 >
 > If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
 > where possible. In my use case, we launch our jobs programmatically.
 In
 > theory, we could shell out to spark-submit but it's not the best
 option for
 > us.
 >
 > So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
 > on the complexities of other modes, though.
 >
 > -Suren
 >
 >
 >
 > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers 
 wrote:
 >
 >> not sure I understand why unifying how you submit app for different
 >> platforms and dynamic configuration cannot be part of SparkConf and
 >> SparkContext?
 >>
 >> for classpath a simple script similar to "hadoop classpath" that
 shows
 >> what needs to be added should be sufficient.
 >>
 >> on spark standalone I can launch a program just fine with just
 SparkConf
 >> and SparkContext. not on yarn, so the spark-launch script must be
 doing a
 >> few things extra there I am missing... which makes things more
 difficult
 >> because I am not sure its realistic to expect every application that
 >> needs
 >> to run something on spark to be launched using spark-submit.
 >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell" 
 wrote:
 >>
 >>> It fulfills a few different functions. The main one is giving users
 a
 >>> way to inject Spark as a runtime dependency separately from their
 >>> program and make sure they get exactly the right version of Spark.
 So
 >>> a user can bundle an application and then use spark-submit to send
 it
 >>> to different types of clusters (or using different versions of
 Spark).
 >>>
 >>> It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
 >>> etc... this was something that became very fragmented over time
 before
 >>> this was added.
 >>>
 >>> Another feature is allowing users to set configuration values
 >>> dynamically rather than compile them inside of their program. That's
 >>> the one you mention here. You can choose to use this feature or not.
 >>> If you know your configs are not going to change, then you don't
 need
 >>> to set them with spark-submit.
 >>>
>>>

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
To add to Ron's answer, this post explains what it means to run Spark
against a YARN cluster, the difference between yarn-client and yarn-cluster
mode, and the reason spark-shell only works in yarn-client mode.
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

-Sandy


On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez  wrote:

> The idea behind YARN is that you can run different application types like
> MapReduce, Storm and Spark.
>
> I would recommend that you build your spark jobs in the main method
> without specifying how you deploy it. Then you can use spark-submit to tell
> Spark how you would want to deploy to it using yarn-cluster as the master.
> The key point here is that once you have YARN setup, the spark client
> connects to it using the $HADOOP_CONF_DIR that contains the resource
> manager address. In particular, this needs to be accessible from the
> classpath of the submitter since it implicitly uses this when it
> instantiates a YarnConfiguration instance. If you want more details, read
> org.apache.spark.deploy.yarn.Client.scala.
>
> You should be able to download a standalone YARN cluster from any of the
> Hadoop providers like Cloudera or Hortonworks. Once you have that, the
> spark programming guide describes what I mention above in sufficient detail
> for you to proceed.
>
> Thanks,
> Ron
>
> Sent from my iPad
>
> > On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
> >
> > I am trying to get my head around using Spark on Yarn from a perspective
> of a cluster. I can start a Spark Shell no issues in Yarn. Works easily.
>  This is done in yarn-client mode and it all works well.
> >
> > In multiple examples, I see instances where people have setup Spark
> Clusters in Stand Alone mode, and then in the examples they "connect" to
> this cluster in Stand Alone mode. This is done often times using the
> spark:// string for connection.  Cool. s
> > But what I don't understand is how do I setup a Yarn instance that I can
> "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and it
> gave me an error, telling me to use yarn-client.  I see information on
> using spark-class or spark-submit.  But what I'd really like is a instance
> I can connect a spark-shell too, and have the instance stay up. I'd like to
> be able run other things on that instance etc. Is that possible with Yarn?
> I know there may be long running job challenges with Yarn, but I am just
> testing, I am just curious if I am looking at something completely bonkers
> here, or just missing something simple.
> >
> > Thanks!
> >
> >
>


Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr
Hi

My setup is to use localMode standalone, Sprak 1.0.0 release version, scala
2.10.4

I made a job that receive serialized object from Kafka broker. The objects
are serialized using kryo. 
The code :

val sparkConf = new
SparkConf().setMaster("local[4]").setAppName("SparkTest")
  .set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
  .set("spark.kryo.registrator",
"com.inneractive.fortknox.kafka.EventDetailRegistrator")

val ssc = new StreamingContext(sparkConf, Seconds(20))
ssc.checkpoint("checkpoint")


val topicMap = topic.split(",").map((_,partitions)).toMap

// Create a Stream using my Decoder EventKryoEncoder 
val events = KafkaUtils.createStream[String, EventDetails,
StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams,
  topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)

val data = events.map(e => (e.getPublisherId, 1L))
val counter = data.reduceByKey(_ + _)
counter.print()

ssc.start()
ssc.awaitTermination()

When I run this I get
java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
~[na:1.7.0_60]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
~[na:1.7.0_60]
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
~[na:1.7.0_60]
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.scheduler.Task.run(Task.scala:51)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]

I've check that my decoder is working I can trace that the deserialization
is OK thus sprark must get ready to use object 

My setup work if I use JSON and not Kryo serialized object
Thanks for help because I don't what to do next






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
Sandy, I experienced the similar behavior as Koert just mentioned. I don't
understand why there is a difference between using spark-submit and
programmatic execution. Maybe there is something else we need to add to the
spark conf/spark context in order to launch spark jobs programmatically
that are not needed before?



On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers  wrote:

> sandy, that makes sense. however i had trouble doing programmatic
> execution on yarn in client mode as well. the application-master in yarn
> came up but then bombed because it was looking for jars that dont exist (it
> was looking in the original file paths on the driver side, which are not
> available on the yarn node). my guess is that spark-submit is changing some
> settings (perhaps preparing the distributed cache and modifying settings
> accordingly), which makes it harder to run things programmatically. i could
> be wrong however. i gave up debugging and resorted to using spark-submit
> for now.
>
>
>
> On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza 
> wrote:
>
>> Spark still supports the ability to submit jobs programmatically without
>> shell scripts.
>>
>> Koert,
>> The main reason that the unification can't be a part of SparkContext is
>> that YARN and standalone support deploy modes where the driver runs in a
>> managed process on the cluster.  In this case, the SparkContext is created
>> on a remote node well after the application is launched.
>>
>>
>> On Wed, Jul 9, 2014 at 8:34 AM, Andrei  wrote:
>>
>>> One another +1. For me it's a question of embedding. With
>>> SparkConf/SparkContext I can easily create larger projects with Spark as a
>>> separate service (just like MySQL and JDBC, for example). With spark-submit
>>> I'm bound to Spark as a main framework that defines how my application
>>> should look like. In my humble opinion, using Spark as embeddable library
>>> rather than main framework and runtime is much easier.
>>>
>>>
>>>
>>>
>>> On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:
>>>
 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 "hadoop jar" to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James 
 wrote:

> +1 to be able to do anything via SparkConf/SparkContext.  Our app
> worked fine in Spark 0.9, but, after several days of wrestling with
> uber jars and spark-submit, and so far failing to get Spark 1.0
> working, we'd like to go back to doing it ourself with SparkConf.
>
> As the previous poster said, a few scripts should be able to give us
> the classpath and any other params we need, and be a lot more
> transparent and debuggable.
>
> On 7/9/14, Surendranauth Hiraman  wrote:
> > Are there any gaps beyond convenience and code/config separation in
> using
> > spark-submit versus SparkConf/SparkContext if you are willing to set
> your
> > own config?
> >
> > If there are any gaps, +1 on having parity within
> SparkConf/SparkContext
> > where possible. In my use case, we launch our jobs programmatically.
> In
> > theory, we could shell out to spark-submit but it's not the best
> option for
> > us.
> >
> > So far, we are only using Standalone Cluster mode, so I'm not
> knowledgeable
> > on the complexities of other modes, though.
> >
> > -Suren
> >
> >
> >
> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers 
> wrote:
> >
> >> not sure I understand why unifying how you submit app for different
> >> platforms and dynamic configuration cannot be part of SparkConf and
> >> SparkContext?
> >>
> >> for classpath a simple script similar to "hadoop classpath" that
> shows
> >> what needs to be added should be sufficient.
> >>
> >> on spark standalone I can launch a program just fine with just
> SparkConf
> >> and SparkContext. not on yarn, so the spark-launch script must be
> doing a
> >> few things extra there I am missing... which makes things more
> difficult
> >> because I am not sure its realistic to expect every application that
> >> needs
> >> to run something on spark to be launched using spark-submit.
> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell" 
> wrote:
> >>
> >>> It fulfills a few different functions. The main one is giving
> users a
> >>> way to inject Spark as a runtime dependency separately from their
> >>> program and make sure they get exactly the right version of Spark.
> So
> >>> a user can bundle an application and then use spark-submit to send
> it
> >>> to different types of clusters (or using different versions of
> Spark).
> >>>
> >>> It also unifies the way you bundle and submit an ap

Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Are you able to share the error you're getting?


On Wed, Jul 9, 2014 at 9:25 AM, Jerry Lam  wrote:

> Sandy, I experienced the similar behavior as Koert just mentioned. I don't
> understand why there is a difference between using spark-submit and
> programmatic execution. Maybe there is something else we need to add to the
> spark conf/spark context in order to launch spark jobs programmatically
> that are not needed before?
>
>
>
> On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers  wrote:
>
>> sandy, that makes sense. however i had trouble doing programmatic
>> execution on yarn in client mode as well. the application-master in yarn
>> came up but then bombed because it was looking for jars that dont exist (it
>> was looking in the original file paths on the driver side, which are not
>> available on the yarn node). my guess is that spark-submit is changing some
>> settings (perhaps preparing the distributed cache and modifying settings
>> accordingly), which makes it harder to run things programmatically. i could
>> be wrong however. i gave up debugging and resorted to using spark-submit
>> for now.
>>
>>
>>
>> On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza 
>> wrote:
>>
>>> Spark still supports the ability to submit jobs programmatically without
>>> shell scripts.
>>>
>>> Koert,
>>> The main reason that the unification can't be a part of SparkContext is
>>> that YARN and standalone support deploy modes where the driver runs in a
>>> managed process on the cluster.  In this case, the SparkContext is created
>>> on a remote node well after the application is launched.
>>>
>>>
>>> On Wed, Jul 9, 2014 at 8:34 AM, Andrei 
>>> wrote:
>>>
 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:

> +1 as well for being able to submit jobs programmatically without
> using shell script.
>
> we also experience issues of submitting jobs programmatically without
> using spark-submit. In fact, even in the Hadoop World, I rarely used
> "hadoop jar" to submit jobs in shell.
>
>
>
> On Wed, Jul 9, 2014 at 9:47 AM, Robert James 
> wrote:
>
>> +1 to be able to do anything via SparkConf/SparkContext.  Our app
>> worked fine in Spark 0.9, but, after several days of wrestling with
>> uber jars and spark-submit, and so far failing to get Spark 1.0
>> working, we'd like to go back to doing it ourself with SparkConf.
>>
>> As the previous poster said, a few scripts should be able to give us
>> the classpath and any other params we need, and be a lot more
>> transparent and debuggable.
>>
>> On 7/9/14, Surendranauth Hiraman  wrote:
>> > Are there any gaps beyond convenience and code/config separation in
>> using
>> > spark-submit versus SparkConf/SparkContext if you are willing to
>> set your
>> > own config?
>> >
>> > If there are any gaps, +1 on having parity within
>> SparkConf/SparkContext
>> > where possible. In my use case, we launch our jobs
>> programmatically. In
>> > theory, we could shell out to spark-submit but it's not the best
>> option for
>> > us.
>> >
>> > So far, we are only using Standalone Cluster mode, so I'm not
>> knowledgeable
>> > on the complexities of other modes, though.
>> >
>> > -Suren
>> >
>> >
>> >
>> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers 
>> wrote:
>> >
>> >> not sure I understand why unifying how you submit app for different
>> >> platforms and dynamic configuration cannot be part of SparkConf and
>> >> SparkContext?
>> >>
>> >> for classpath a simple script similar to "hadoop classpath" that
>> shows
>> >> what needs to be added should be sufficient.
>> >>
>> >> on spark standalone I can launch a program just fine with just
>> SparkConf
>> >> and SparkContext. not on yarn, so the spark-launch script must be
>> doing a
>> >> few things extra there I am missing... which makes things more
>> difficult
>> >> because I am not sure its realistic to expect every application
>> that
>> >> needs
>> >> to run something on spark to be launched using spark-submit.
>> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell" 
>> wrote:
>> >>
>> >>> It fulfills a few different functions. The main one is giving
>> users a
>> >>> way to inject Spark as a runtime dependency separately from their
>> >>> program and make sure they get exactly the right version of
>> Spark. So
>

Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Hello,

I am currently learning Apache Spark and I want to see how it integrates
with an existing Hadoop Cluster.

My current Hadoop configuration is version 2.2.0 without Yarn. I have build
Apache Spark (v1.0.0) following the instructions in the README file. Only
setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the HADOOP_CONF_DIR
to point to the configuration directory of Hadoop configuration.

My use-case is the Linear Least Regression MLlib example of Apache Spark
(link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
The only difference in the code is that I give the text file to be an HDFS
file.

However, I get a "Runtime Exception: Error in configuring object."

So my question is the following:

Does Spark work with a Hadoop distribution without Yarn?
If yes, am I doing it right? If no, can I build Spark with
SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?

Thank you,
Nick


Re: Spark Streaming using File Stream in Java

2014-07-09 Thread Aravind
Hi Akil,

It didnt work. Here is the code...


package com.paypal;

import org.apache.spark.SparkConf;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;

import com.google.common.collect.Lists;

import org.apache.spark.streaming.receiver.Receiver;
import scala.Tuple2;

import java.net.ConnectException;
import java.net.Socket;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.io.*;
/**
 * Hello world!
 *
 */
public class App3
{
private static final Pattern SPACE = Pattern.compile(" ");

public static void main(String[] args) {

// Create the context with a 1 second batch size
SparkConf sparkConf = new
SparkConf().setAppName("JavaNetworkWordCount");

// *** always give local[4] to execute and see the output
JavaStreamingContext ssc = new JavaStreamingContext("local[4]",
"JavaNetworkWordCount",  new Duration(5000));

// throws an error saying requires JavaPairDstream and not JavaDstream.
JavaDStream lines =
ssc.fileStream("/Users/../Desktop/alarms.log");
JavaDStream words = lines.flatMap(
new FlatMapFunction() {
public Iterable call(String s) {
return Arrays.asList(s.split(" "));
}
}
);

JavaPairDStream ones = words.map(
new Function() {
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}
);

JavaPairDStream counts = ones.reduceByKey(
new Function2() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);


System.out.println("Hello world");
wordCounts.print();

ssc.start();
ssc.awaitTermination();
}


}

I am not able to figure out how to type cast the objects of Type
JavaPairDStream to JDstream. Can you provide me a working code for the same.
Thanks in advance. 

Regards
Aravindan





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-File-Stream-in-Java-tp9115p9204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert,
Yeah I had the same problems trying to do programmatic submission of spark jobs 
to my Yarn cluster. I was ultimately able to resolve it by reviewing the 
classpath and debugging through all the different things that the Spark Yarn 
client (Client.scala) did for submitting to Yarn (like env setup, local 
resources, etc), and I compared it to what spark-submit had done.
I have to admit though that it was far from trivial to get it working out of 
the box, and perhaps some work could be done in that regards. In my case, it 
had boiled down to the launch environment not having the HADOOP_CONF_DIR set, 
which prevented the app master from registering itself with the Resource 
Manager.

Thanks,
Ron

Sent from my iPad

> On Jul 9, 2014, at 9:25 AM, Jerry Lam  wrote:
> 
> Sandy, I experienced the similar behavior as Koert just mentioned. I don't 
> understand why there is a difference between using spark-submit and 
> programmatic execution. Maybe there is something else we need to add to the 
> spark conf/spark context in order to launch spark jobs programmatically that 
> are not needed before?
> 
> 
> 
>> On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers  wrote:
>> sandy, that makes sense. however i had trouble doing programmatic execution 
>> on yarn in client mode as well. the application-master in yarn came up but 
>> then bombed because it was looking for jars that dont exist (it was looking 
>> in the original file paths on the driver side, which are not available on 
>> the yarn node). my guess is that spark-submit is changing some settings 
>> (perhaps preparing the distributed cache and modifying settings 
>> accordingly), which makes it harder to run things programmatically. i could 
>> be wrong however. i gave up debugging and resorted to using spark-submit for 
>> now.
>> 
>> 
>> 
>>> On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza  wrote:
>>> Spark still supports the ability to submit jobs programmatically without 
>>> shell scripts.
>>> 
>>> Koert,
>>> The main reason that the unification can't be a part of SparkContext is 
>>> that YARN and standalone support deploy modes where the driver runs in a 
>>> managed process on the cluster.  In this case, the SparkContext is created 
>>> on a remote node well after the application is launched.
>>> 
>>> 
 On Wed, Jul 9, 2014 at 8:34 AM, Andrei  wrote:
 One another +1. For me it's a question of embedding. With 
 SparkConf/SparkContext I can easily create larger projects with Spark as a 
 separate service (just like MySQL and JDBC, for example). With 
 spark-submit I'm bound to Spark as a main framework that defines how my 
 application should look like. In my humble opinion, using Spark as 
 embeddable library rather than main framework and runtime is much easier. 
 
 
 
 
> On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam  wrote:
> +1 as well for being able to submit jobs programmatically without using 
> shell script.
> 
> we also experience issues of submitting jobs programmatically without 
> using spark-submit. In fact, even in the Hadoop World, I rarely used 
> "hadoop jar" to submit jobs in shell. 
> 
> 
> 
>> On Wed, Jul 9, 2014 at 9:47 AM, Robert James  
>> wrote:
>> +1 to be able to do anything via SparkConf/SparkContext.  Our app
>> worked fine in Spark 0.9, but, after several days of wrestling with
>> uber jars and spark-submit, and so far failing to get Spark 1.0
>> working, we'd like to go back to doing it ourself with SparkConf.
>> 
>> As the previous poster said, a few scripts should be able to give us
>> the classpath and any other params we need, and be a lot more
>> transparent and debuggable.
>> 
>> On 7/9/14, Surendranauth Hiraman  wrote:
>> > Are there any gaps beyond convenience and code/config separation in 
>> > using
>> > spark-submit versus SparkConf/SparkContext if you are willing to set 
>> > your
>> > own config?
>> >
>> > If there are any gaps, +1 on having parity within 
>> > SparkConf/SparkContext
>> > where possible. In my use case, we launch our jobs programmatically. In
>> > theory, we could shell out to spark-submit but it's not the best 
>> > option for
>> > us.
>> >
>> > So far, we are only using Standalone Cluster mode, so I'm not 
>> > knowledgeable
>> > on the complexities of other modes, though.
>> >
>> > -Suren
>> >
>> >
>> >
>> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers  
>> > wrote:
>> >
>> >> not sure I understand why unifying how you submit app for different
>> >> platforms and dynamic configuration cannot be part of SparkConf and
>> >> SparkContext?
>> >>
>> >> for classpath a simple script similar to "hadoop classpath" that shows
>> >> what needs to be added should be sufficient.
>> >>
>> >> on spark standalone I can launch a program just fine with jus

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic 
entry point for Yarn.

Thanks,
Ron

Sent from my iPad

> On Jul 9, 2014, at 7:14 AM, Jerry Lam  wrote:
> 
> +1 as well for being able to submit jobs programmatically without using shell 
> script.
> 
> we also experience issues of submitting jobs programmatically without using 
> spark-submit. In fact, even in the Hadoop World, I rarely used "hadoop jar" 
> to submit jobs in shell. 
> 
> 
> 
>> On Wed, Jul 9, 2014 at 9:47 AM, Robert James  wrote:
>> +1 to be able to do anything via SparkConf/SparkContext.  Our app
>> worked fine in Spark 0.9, but, after several days of wrestling with
>> uber jars and spark-submit, and so far failing to get Spark 1.0
>> working, we'd like to go back to doing it ourself with SparkConf.
>> 
>> As the previous poster said, a few scripts should be able to give us
>> the classpath and any other params we need, and be a lot more
>> transparent and debuggable.
>> 
>> On 7/9/14, Surendranauth Hiraman  wrote:
>> > Are there any gaps beyond convenience and code/config separation in using
>> > spark-submit versus SparkConf/SparkContext if you are willing to set your
>> > own config?
>> >
>> > If there are any gaps, +1 on having parity within SparkConf/SparkContext
>> > where possible. In my use case, we launch our jobs programmatically. In
>> > theory, we could shell out to spark-submit but it's not the best option for
>> > us.
>> >
>> > So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
>> > on the complexities of other modes, though.
>> >
>> > -Suren
>> >
>> >
>> >
>> > On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers  wrote:
>> >
>> >> not sure I understand why unifying how you submit app for different
>> >> platforms and dynamic configuration cannot be part of SparkConf and
>> >> SparkContext?
>> >>
>> >> for classpath a simple script similar to "hadoop classpath" that shows
>> >> what needs to be added should be sufficient.
>> >>
>> >> on spark standalone I can launch a program just fine with just SparkConf
>> >> and SparkContext. not on yarn, so the spark-launch script must be doing a
>> >> few things extra there I am missing... which makes things more difficult
>> >> because I am not sure its realistic to expect every application that
>> >> needs
>> >> to run something on spark to be launched using spark-submit.
>> >>  On Jul 9, 2014 3:45 AM, "Patrick Wendell"  wrote:
>> >>
>> >>> It fulfills a few different functions. The main one is giving users a
>> >>> way to inject Spark as a runtime dependency separately from their
>> >>> program and make sure they get exactly the right version of Spark. So
>> >>> a user can bundle an application and then use spark-submit to send it
>> >>> to different types of clusters (or using different versions of Spark).
>> >>>
>> >>> It also unifies the way you bundle and submit an app for Yarn, Mesos,
>> >>> etc... this was something that became very fragmented over time before
>> >>> this was added.
>> >>>
>> >>> Another feature is allowing users to set configuration values
>> >>> dynamically rather than compile them inside of their program. That's
>> >>> the one you mention here. You can choose to use this feature or not.
>> >>> If you know your configs are not going to change, then you don't need
>> >>> to set them with spark-submit.
>> >>>
>> >>>
>> >>> On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
>> >>> wrote:
>> >>> > What is the purpose of spark-submit? Does it do anything outside of
>> >>> > the standard val conf = new SparkConf ... val sc = new SparkContext
>> >>> > ... ?
>> >>>
>> >>
>> >
>> >
>> > --
>> >
>> > SUREN HIRAMAN, VP TECHNOLOGY
>> > Velos
>> > Accelerating Machine Learning
>> >
>> > 440 NINTH AVENUE, 11TH FLOOR
>> > NEW YORK, NY 10001
>> > O: (917) 525-2466 ext. 105
>> > F: 646.349.4063
>> > E: suren.hiraman@v elos.io
>> > W: www.velos.io
>> >
> 


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov
Hi Patrick,

I used 1.0 branch, but it was not an official release, I just git pulled
whatever was there and compiled.

Thanks,
Mikhail



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Rahul Bhojwani
According to me there is BUG in MLlib Naive Bayes implementation in spark
0.9.1.
Whom should I report this to or with whom should I discuss? I can discuss
this over call as well.

My Skype ID : rahul.bhijwani
Phone no:  +91-9945197359

Thanks,
-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka


Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Marcelo Vanzin
And if you think that's too many, you can control the number by
setting "spark.deploy.retainedApplications".

On Wed, Jul 9, 2014 at 12:38 AM, Patrick Wendell  wrote:
> There isn't currently a way to do this, but it will start dropping
> older applications once more than 200 are stored.
>
> On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang  wrote:
>> Besides restarting the Master, is there any other way to clear the
>> Completed Applications in Master web UI?



-- 
Marcelo


Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Make a JIRA with enough detail to reproduce the error ideally:
https://issues.apache.org/jira/browse/SPARK

and then even more ideally open a PR with a fix:
https://github.com/apache/spark


On Wed, Jul 9, 2014 at 5:57 PM, Rahul Bhojwani 
wrote:

> According to me there is BUG in MLlib Naive Bayes implementation in spark
> 0.9.1.
> Whom should I report this to or with whom should I discuss? I can discuss
> this over call as well.
>
> My Skype ID : rahul.bhijwani
> Phone no:  +91-9945197359
>
> Thanks,
> --
> Rahul K Bhojwani
> 3rd Year B.Tech
> Computer Science and Engineering
> National Institute of Technology, Karnataka
>


Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git
repository.
On Jul 9, 2014 9:45 AM, "AlexanderRiggers" 
wrote:

> By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
> master? Because I am using v 1.0.0 - Alex
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Hi everyone,

I am new to Spark and I'm having problems to make my code compile. I have
the feeling I might be misunderstanding the functions so I would be very
glad to get some insight in what could be wrong.

The problematic code is the following:

JavaRDD bodies = lines.map(l -> {Body b = new Body(); b.parse(l);} );

JavaPairRDD> partitions =
bodies.mapToPair(b ->
b.computePartitions(maxDistance)).groupByKey();

Partition and Body are defined inside the driver class. Body contains the
following definition:

protected Iterable> computePartitions (int
maxDistance)

The idea is to reproduce the following schema:

The first map results in: *body1, body2, ... *
The mapToPair should output several of these:* (partition_i, body1),
(partition_i, body2)...*
Which are gathered by key as follows: *(partition_i, (body1, body_n),
(partition_i', (body2, body_n') ...*

Thanks in advance.
Regards,
Silvina


Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point.  Shows how personal use cases color how we interpret products.


On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen  wrote:

> On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons  wrote:
>
>>  Impala is *not* built on map/reduce, though it was built to replace
>> Hive, which is map/reduce based.  It has its own distributed query engine,
>> though it does load data from HDFS, and is part of the hadoop ecosystem.
>>  Impala really shines when your
>>
>
> (It was not built to replace Hive. It's purpose-built to make interactive
> use with a BI tool feasible -- single-digit second queries on huge data
> sets. It's very memory hungry. Hive's architecture choices and legacy code
> have been throughput-oriented, and can't really get below minutes at scale,
> but, remains a right choice when you are in fact doing ETL!)
>


Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
Thank you for the link.  In that link the following is written:

For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a single
batch job, an interactive session with multiple jobs spaced apart, or a
long-lived server continually satisfying requests

So, if I wanted to use "a long-lived server continually satisfying
requests" and then start a shell that connected to that context, how would
I do that in Yarn? That's the problem I am having right now, I just want
there to be that long lived service that I can utilize.

Thanks!


On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza  wrote:

> To add to Ron's answer, this post explains what it means to run Spark
> against a YARN cluster, the difference between yarn-client and yarn-cluster
> mode, and the reason spark-shell only works in yarn-client mode.
>
> http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
>
> -Sandy
>
>
> On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez  wrote:
>
>> The idea behind YARN is that you can run different application types like
>> MapReduce, Storm and Spark.
>>
>> I would recommend that you build your spark jobs in the main method
>> without specifying how you deploy it. Then you can use spark-submit to tell
>> Spark how you would want to deploy to it using yarn-cluster as the master.
>> The key point here is that once you have YARN setup, the spark client
>> connects to it using the $HADOOP_CONF_DIR that contains the resource
>> manager address. In particular, this needs to be accessible from the
>> classpath of the submitter since it implicitly uses this when it
>> instantiates a YarnConfiguration instance. If you want more details, read
>> org.apache.spark.deploy.yarn.Client.scala.
>>
>> You should be able to download a standalone YARN cluster from any of the
>> Hadoop providers like Cloudera or Hortonworks. Once you have that, the
>> spark programming guide describes what I mention above in sufficient detail
>> for you to proceed.
>>
>> Thanks,
>> Ron
>>
>> Sent from my iPad
>>
>> > On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
>> >
>> > I am trying to get my head around using Spark on Yarn from a
>> perspective of a cluster. I can start a Spark Shell no issues in Yarn.
>> Works easily.  This is done in yarn-client mode and it all works well.
>> >
>> > In multiple examples, I see instances where people have setup Spark
>> Clusters in Stand Alone mode, and then in the examples they "connect" to
>> this cluster in Stand Alone mode. This is done often times using the
>> spark:// string for connection.  Cool. s
>> > But what I don't understand is how do I setup a Yarn instance that I
>> can "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and
>> it gave me an error, telling me to use yarn-client.  I see information on
>> using spark-class or spark-submit.  But what I'd really like is a instance
>> I can connect a spark-shell too, and have the instance stay up. I'd like to
>> be able run other things on that instance etc. Is that possible with Yarn?
>> I know there may be long running job challenges with Yarn, but I am just
>> testing, I am just curious if I am looking at something completely bonkers
>> here, or just missing something simple.
>> >
>> > Thanks!
>> >
>> >
>>
>
>


Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So basically, I have Spark on Yarn running (spark shell) how do I connect
to it with another tool I am trying to test using the spark://IP:7077  URL
it's expecting? If that won't work with spark shell, or yarn-client mode,
how do I setup Spark on Yarn to be able to handle that?

Thanks!




On Wed, Jul 9, 2014 at 12:41 PM, John Omernik  wrote:

> Thank you for the link.  In that link the following is written:
>
> For those familiar with the Spark API, an application corresponds to an
> instance of the SparkContext class. An application can be used for a
> single batch job, an interactive session with multiple jobs spaced apart,
> or a long-lived server continually satisfying requests
>
> So, if I wanted to use "a long-lived server continually satisfying
> requests" and then start a shell that connected to that context, how would
> I do that in Yarn? That's the problem I am having right now, I just want
> there to be that long lived service that I can utilize.
>
> Thanks!
>
>
> On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza 
> wrote:
>
>> To add to Ron's answer, this post explains what it means to run Spark
>> against a YARN cluster, the difference between yarn-client and yarn-cluster
>> mode, and the reason spark-shell only works in yarn-client mode.
>>
>> http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
>>
>> -Sandy
>>
>>
>> On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez 
>> wrote:
>>
>>> The idea behind YARN is that you can run different application types
>>> like MapReduce, Storm and Spark.
>>>
>>> I would recommend that you build your spark jobs in the main method
>>> without specifying how you deploy it. Then you can use spark-submit to tell
>>> Spark how you would want to deploy to it using yarn-cluster as the master.
>>> The key point here is that once you have YARN setup, the spark client
>>> connects to it using the $HADOOP_CONF_DIR that contains the resource
>>> manager address. In particular, this needs to be accessible from the
>>> classpath of the submitter since it implicitly uses this when it
>>> instantiates a YarnConfiguration instance. If you want more details, read
>>> org.apache.spark.deploy.yarn.Client.scala.
>>>
>>> You should be able to download a standalone YARN cluster from any of the
>>> Hadoop providers like Cloudera or Hortonworks. Once you have that, the
>>> spark programming guide describes what I mention above in sufficient detail
>>> for you to proceed.
>>>
>>> Thanks,
>>> Ron
>>>
>>> Sent from my iPad
>>>
>>> > On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
>>> >
>>> > I am trying to get my head around using Spark on Yarn from a
>>> perspective of a cluster. I can start a Spark Shell no issues in Yarn.
>>> Works easily.  This is done in yarn-client mode and it all works well.
>>> >
>>> > In multiple examples, I see instances where people have setup Spark
>>> Clusters in Stand Alone mode, and then in the examples they "connect" to
>>> this cluster in Stand Alone mode. This is done often times using the
>>> spark:// string for connection.  Cool. s
>>> > But what I don't understand is how do I setup a Yarn instance that I
>>> can "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and
>>> it gave me an error, telling me to use yarn-client.  I see information on
>>> using spark-class or spark-submit.  But what I'd really like is a instance
>>> I can connect a spark-shell too, and have the instance stay up. I'd like to
>>> be able run other things on that instance etc. Is that possible with Yarn?
>>> I know there may be long running job challenges with Yarn, but I am just
>>> testing, I am just curious if I am looking at something completely bonkers
>>> here, or just missing something simple.
>>> >
>>> > Thanks!
>>> >
>>> >
>>>
>>
>>
>


Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
Spark doesn't currently offer you anything special to do this.  I.e. if you
want to write a Spark application that fires off jobs on behalf of remote
processes, you would need to implement the communication between those
remote processes and your Spark application code yourself.


On Wed, Jul 9, 2014 at 10:41 AM, John Omernik  wrote:

> Thank you for the link.  In that link the following is written:
>
> For those familiar with the Spark API, an application corresponds to an
> instance of the SparkContext class. An application can be used for a
> single batch job, an interactive session with multiple jobs spaced apart,
> or a long-lived server continually satisfying requests
>
> So, if I wanted to use "a long-lived server continually satisfying
> requests" and then start a shell that connected to that context, how would
> I do that in Yarn? That's the problem I am having right now, I just want
> there to be that long lived service that I can utilize.
>
> Thanks!
>
>
> On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza 
> wrote:
>
>> To add to Ron's answer, this post explains what it means to run Spark
>> against a YARN cluster, the difference between yarn-client and yarn-cluster
>> mode, and the reason spark-shell only works in yarn-client mode.
>>
>> http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
>>
>> -Sandy
>>
>>
>> On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez 
>> wrote:
>>
>>> The idea behind YARN is that you can run different application types
>>> like MapReduce, Storm and Spark.
>>>
>>> I would recommend that you build your spark jobs in the main method
>>> without specifying how you deploy it. Then you can use spark-submit to tell
>>> Spark how you would want to deploy to it using yarn-cluster as the master.
>>> The key point here is that once you have YARN setup, the spark client
>>> connects to it using the $HADOOP_CONF_DIR that contains the resource
>>> manager address. In particular, this needs to be accessible from the
>>> classpath of the submitter since it implicitly uses this when it
>>> instantiates a YarnConfiguration instance. If you want more details, read
>>> org.apache.spark.deploy.yarn.Client.scala.
>>>
>>> You should be able to download a standalone YARN cluster from any of the
>>> Hadoop providers like Cloudera or Hortonworks. Once you have that, the
>>> spark programming guide describes what I mention above in sufficient detail
>>> for you to proceed.
>>>
>>> Thanks,
>>> Ron
>>>
>>> Sent from my iPad
>>>
>>> > On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
>>> >
>>> > I am trying to get my head around using Spark on Yarn from a
>>> perspective of a cluster. I can start a Spark Shell no issues in Yarn.
>>> Works easily.  This is done in yarn-client mode and it all works well.
>>> >
>>> > In multiple examples, I see instances where people have setup Spark
>>> Clusters in Stand Alone mode, and then in the examples they "connect" to
>>> this cluster in Stand Alone mode. This is done often times using the
>>> spark:// string for connection.  Cool. s
>>> > But what I don't understand is how do I setup a Yarn instance that I
>>> can "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and
>>> it gave me an error, telling me to use yarn-client.  I see information on
>>> using spark-class or spark-submit.  But what I'd really like is a instance
>>> I can connect a spark-shell too, and have the instance stay up. I'd like to
>>> be able run other things on that instance etc. Is that possible with Yarn?
>>> I know there may be long running job challenges with Yarn, but I am just
>>> testing, I am just curious if I am looking at something completely bonkers
>>> here, or just missing something simple.
>>> >
>>> > Thanks!
>>> >
>>> >
>>>
>>
>>
>


Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng
We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui

On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar  wrote:
> Xiangrui,
>
> Thanks for all the help in resolving this issue. The  cause turned out to
> bethe build environment rather than runtime configuration. The build process
> had picked up maven2 while building spark. Using binaries that were rebuilt
> using m3, the entire processing went through fine. While I'm aware that the
> build instruction page specifies m3 as the min requirement, declaratively
> preventing accidental m2 usage (e.g. through something like the maven
> enforcer plugin?) might help other developers avoid such issues.
>
> -Bharath
>
>
>
> On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng  wrote:
>>
>> It seems to me a setup issue. I just tested news20.binary (1355191
>> features) on a 2-node EC2 cluster and it worked well. I added one line
>> to conf/spark-env.sh:
>>
>> export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "
>>
>> and launched spark-shell with "--driver-memory 20g". Could you re-try
>> with an EC2 setup? If it still doesn't work, please attach all your
>> code and logs.
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar 
>> wrote:
>> > Hi Xiangrui,
>> >
>> > 1) Yes, I used the same build (compiled locally from source) to the host
>> > that has (master, slave1) and the second host with slave2.
>> >
>> > 2) The execution was successful when run in local mode with reduced
>> > number
>> > of partitions. Does this imply issues communicating/coordinating across
>> > processes (i.e. driver, master and workers)?
>> >
>> > Thanks,
>> > Bharath
>> >
>> >
>> >
>> > On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng  wrote:
>> >>
>> >> Hi Bharath,
>> >>
>> >> 1) Did you sync the spark jar and conf to the worker nodes after build?
>> >> 2) Since the dataset is not large, could you try local mode first
>> >> using `spark-summit --driver-memory 12g --master local[*]`?
>> >> 3) Try to use less number of partitions, say 5.
>> >>
>> >> If the problem is still there, please attach the full master/worker log
>> >> files.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
>> >> 
>> >> wrote:
>> >> > Xiangrui,
>> >> >
>> >> > Leaving the frameSize unspecified led to an error message (and
>> >> > failure)
>> >> > stating that the task size (~11M) was larger. I hence set it to an
>> >> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary
>> >> > in
>> >> > this case). I've now set the size to 20M and repeated the runs. The
>> >> > earlier
>> >> > runs were on an uncached RDD. Caching the RDD (and setting
>> >> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
>> >> > execution, but the end result remained the same. The cached RDD size
>> >> > is
>> >> > as
>> >> > follows:
>> >> >
>> >> > RDD NameStorage LevelCached Partitions
>> >> > Fraction CachedSize in MemorySize in TachyonSize on
>> >> > Disk
>> >> > 1084 Memory Deserialized 1x Replicated 80
>> >> > 100% 165.9 MB 0.0 B 0.0 B
>> >> >
>> >> >
>> >> >
>> >> > The corresponding master logs were:
>> >> >
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/1
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/2
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/0
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/3
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/2
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/4
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/3
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/5
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
>> >> > 14/07/04 06:29:39 INFO LocalActorRef: Message
>> >> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>> >> > from
>> >> > Actor[akka://sparkMaster/deadLetters] to
>> >> >
>> >> >
>> >> > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-1

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So how do I do the "long-lived server continually satisfying requests" in
the Cloudera application? I am very confused by that at this point.


On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza  wrote:

> Spark doesn't currently offer you anything special to do this.  I.e. if
> you want to write a Spark application that fires off jobs on behalf of
> remote processes, you would need to implement the communication between
> those remote processes and your Spark application code yourself.
>
>
> On Wed, Jul 9, 2014 at 10:41 AM, John Omernik  wrote:
>
>> Thank you for the link.  In that link the following is written:
>>
>> For those familiar with the Spark API, an application corresponds to an
>> instance of the SparkContext class. An application can be used for a
>> single batch job, an interactive session with multiple jobs spaced apart,
>> or a long-lived server continually satisfying requests
>>
>> So, if I wanted to use "a long-lived server continually satisfying
>> requests" and then start a shell that connected to that context, how would
>> I do that in Yarn? That's the problem I am having right now, I just want
>> there to be that long lived service that I can utilize.
>>
>> Thanks!
>>
>>
>> On Wed, Jul 9, 2014 at 11:14 AM, Sandy Ryza 
>> wrote:
>>
>>> To add to Ron's answer, this post explains what it means to run Spark
>>> against a YARN cluster, the difference between yarn-client and yarn-cluster
>>> mode, and the reason spark-shell only works in yarn-client mode.
>>>
>>> http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
>>>
>>> -Sandy
>>>
>>>
>>> On Wed, Jul 9, 2014 at 9:09 AM, Ron Gonzalez 
>>> wrote:
>>>
 The idea behind YARN is that you can run different application types
 like MapReduce, Storm and Spark.

 I would recommend that you build your spark jobs in the main method
 without specifying how you deploy it. Then you can use spark-submit to tell
 Spark how you would want to deploy to it using yarn-cluster as the master.
 The key point here is that once you have YARN setup, the spark client
 connects to it using the $HADOOP_CONF_DIR that contains the resource
 manager address. In particular, this needs to be accessible from the
 classpath of the submitter since it implicitly uses this when it
 instantiates a YarnConfiguration instance. If you want more details, read
 org.apache.spark.deploy.yarn.Client.scala.

 You should be able to download a standalone YARN cluster from any of
 the Hadoop providers like Cloudera or Hortonworks. Once you have that, the
 spark programming guide describes what I mention above in sufficient detail
 for you to proceed.

 Thanks,
 Ron

 Sent from my iPad

 > On Jul 9, 2014, at 8:31 AM, John Omernik  wrote:
 >
 > I am trying to get my head around using Spark on Yarn from a
 perspective of a cluster. I can start a Spark Shell no issues in Yarn.
 Works easily.  This is done in yarn-client mode and it all works well.
 >
 > In multiple examples, I see instances where people have setup Spark
 Clusters in Stand Alone mode, and then in the examples they "connect" to
 this cluster in Stand Alone mode. This is done often times using the
 spark:// string for connection.  Cool. s
 > But what I don't understand is how do I setup a Yarn instance that I
 can "connect" to? I.e. I tried running Spark Shell in yarn-cluster mode and
 it gave me an error, telling me to use yarn-client.  I see information on
 using spark-class or spark-submit.  But what I'd really like is a instance
 I can connect a spark-shell too, and have the instance stay up. I'd like to
 be able run other things on that instance etc. Is that possible with Yarn?
 I know there may be long running job challenges with Yarn, but I am just
 testing, I am just curious if I am looking at something completely bonkers
 here, or just missing something simple.
 >
 > Thanks!
 >
 >

>>>
>>>
>>
>


Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the
spark shell.  (i.e. the line that you type into the spark shell are
compiled into classes and then shipped to the executors where they run).
 Are you able to run other spark examples with closures in the same shell?

Michael


On Wed, Jul 9, 2014 at 4:28 AM, gil...@gmail.com  wrote:

> Hello, While trying to run this example below I am getting errors. I have
> build Spark using the followng command: $ SPARK_HADOOP_VERSION=2.4.0
> SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly
> --- --Running the example using
> Spark-shell --- $
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.4.0.jar
> HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop MASTER=yarn-client
> ./bin/spark-shell scala> val sqlContext = new
> org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class
> Person(name: String, age: Int) val people = sc.textFile("hdfs://
> myd-vm05698.hpswlabs.adapps.hp.com:9000/user/spark/examples/people.txt").map(_.split(",")).map(p
> => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") val
> teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
> teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
> -- error
> ---
> java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
> at $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at
> $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
> scala.collection.Iterator$$anon$1.next(Iterator.scala:853) at
> scala.collection.Iterator$$anon$1.head(Iterator.scala:840) at
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:181)
> at
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:176)
> at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at
> org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
> org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:112) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) at
> org.apache.spark.scheduler.Task.run(Task.scala:51) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> --
> View this message in context: Spark SQL - java.lang.NoClassDefFoundError:
> Could not initialize class $line10.$read$
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
I am using the Spark Streaming and have the following two questions:

1. If more than one output operations are put in the same StreamingContext
(basically, I mean, I put all the output operations in the same class), are
they processed one by one as the order they appear in the class? Or they
are actually processes parallely?

2. If one DStream takes longer than the interval time, does a new DStream
wait in the queue until the previous DStream is fully processed? Is there
any parallelism that may process the two DStream at the same time?

Thank you.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like
to add in a jar I believe I need to access Twitter data live, twitter4j
. I’m confused over where and how to
add this jar in.

SPARK-1089  mentions two
environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has
an addJar method and a jars property, the latter of which does not have an
associated doc

.

What’s the difference between all these jar-related things, and what do I
need to do to add this Twitter jar in correctly?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at a
time. This *can* be parallelized using an undocumented config parameter
"spark.streaming.concurrentJobs" which is by default set to 1.

2. Yes, the output operations (and the spark jobs that are involved with
them) gets queued up.

TD


On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang  wrote:

> I am using the Spark Streaming and have the following two questions:
>
> 1. If more than one output operations are put in the same StreamingContext
> (basically, I mean, I put all the output operations in the same class), are
> they processed one by one as the order they appear in the class? Or they
> are actually processes parallely?
>
> 2. If one DStream takes longer than the interval time, does a new DStream
> wait in the queue until the previous DStream is fully processed? Is there
> any parallelism that may process the two DStream at the same time?
>
> Thank you.
>
> Cheers,
>
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
>


Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
Great. Thank you!

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das 
wrote:

> 1. Multiple output operations are processed in the order they are defined.
> That is because by default each one output operation is processed at a
> time. This *can* be parallelized using an undocumented config parameter
> "spark.streaming.concurrentJobs" which is by default set to 1.
>
> 2. Yes, the output operations (and the spark jobs that are involved with
> them) gets queued up.
>
> TD
>
>
> On Wed, Jul 9, 2014 at 11:22 AM, Yan Fang  wrote:
>
>> I am using the Spark Streaming and have the following two questions:
>>
>> 1. If more than one output operations are put in the same
>> StreamingContext (basically, I mean, I put all the output operations in the
>> same class), are they processed one by one as the order they appear in the
>> class? Or they are actually processes parallely?
>>
>> 2. If one DStream takes longer than the interval time, does a new DStream
>> wait in the queue until the previous DStream is fully processed? Is there
>> any parallelism that may process the two DStream at the same time?
>>
>> Thank you.
>>
>> Cheers,
>>
>> Fang, Yan
>> yanfang...@gmail.com
>> +1 (206) 849-4108
>>
>
>


Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks:


I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

* Process each row and save the counter in cassandra.  In this scenario 
after the text file has been consumed, there is no task/stages seen in the 
spark UI.

* If instead I use reduce by key before saving to cassandra, the spark 
UI shows continuous generation of tasks/stages even afterprocessing the file 
has been completed.

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any insights/pointers 
for understanding this difference in behavior and how to avoid generating 
tasks/stages when there is no data (new file) available.

Thanks

Mans

RE: How should I add a jar?

2014-07-09 Thread Sameer Tilak
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars. 
First jar is the dependency and the second one is my application. Hope this 
will work for you. 
 ./spark-shell --jars 
/apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar


Date: Wed, 9 Jul 2014 11:44:27 -0700
From: nicholas.cham...@gmail.com
To: u...@spark.incubator.apache.org
Subject: How should I add a jar?

I’m just starting to use the Scala version of Spark’s shell, and I’d like to 
add in a jar I believe I need to access Twitter data live, twitter4j. I’m 
confused over where and how to add this jar in.


SPARK-1089 mentions two environment variables, SPARK_CLASSPATH and ADD_JARS. 
SparkContext also has an addJar method and a jars property, the latter of which 
does not have an associated doc.


What’s the difference between all these jar-related things, and what do I need 
to do to add this Twitter jar in correctly?
Nick
​







View this message in context: How should I add a jar?

Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

Some question about SQL and streaming

2014-07-09 Thread hsy...@gmail.com
Hi guys,

I'm a new user to spark. I would like to know is there an example of how to
user spark SQL and spark streaming together? My use case is I want to do
some SQL on the input stream from kafka.
Thanks!

Best,
Siyuan


Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Awww ye. That worked! Thank you Sameer.

Is this documented somewhere? I feel there there's a slight doc deficiency
here.

Nick


On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak  wrote:

> Hi Nicholas,
>
> I am using Spark 1.0 and I use this method to specify the additional jars.
> First jar is the dependency and the second one is my application. Hope this
> will work for you.
>
>  ./spark-shell --jars
> /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar
>
>
>
> --
> Date: Wed, 9 Jul 2014 11:44:27 -0700
> From: nicholas.cham...@gmail.com
> To: u...@spark.incubator.apache.org
> Subject: How should I add a jar?
>
>
> I’m just starting to use the Scala version of Spark’s shell, and I’d like
> to add in a jar I believe I need to access Twitter data live, twitter4j
> . I’m confused over where and how to
> add this jar in.
>
> SPARK-1089  mentions
> two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext
> also has an addJar method and a jars property, the latter of which does
> not have an associated doc
> 
> .
>
> What’s the difference between all these jar-related things, and what do I
> need to do to add this Twitter jar in correctly?
>
> Nick
>  ​
>
> --
> View this message in context: How should I add a jar?
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Right, the compile error is a casting issue telling me I cannot assign
a JavaPairRDD to a JavaPairRDD. It happens in the mapToPair()
method.




On 9 July 2014 19:52, Sean Owen  wrote:

> You forgot the compile error!
>
>
> On Wed, Jul 9, 2014 at 6:14 PM, Silvina Caíno Lores  > wrote:
>
>> Hi everyone,
>>
>>  I am new to Spark and I'm having problems to make my code compile. I
>> have the feeling I might be misunderstanding the functions so I would be
>> very glad to get some insight in what could be wrong.
>>
>> The problematic code is the following:
>>
>> JavaRDD bodies = lines.map(l -> {Body b = new Body(); b.parse(l);}
>> );
>>
>> JavaPairRDD> partitions =
>> bodies.mapToPair(b ->
>> b.computePartitions(maxDistance)).groupByKey();
>>
>>  Partition and Body are defined inside the driver class. Body contains
>> the following definition:
>>
>> protected Iterable> computePartitions (int
>> maxDistance)
>>
>> The idea is to reproduce the following schema:
>>
>> The first map results in: *body1, body2, ... *
>> The mapToPair should output several of these:* (partition_i, body1),
>> (partition_i, body2)...*
>> Which are gathered by key as follows: *(partition_i, (body1,
>> body_n), (partition_i', (body2, body_n') ...*
>>
>> Thanks in advance.
>> Regards,
>> Silvina
>>
>
>


SPARK_CLASSPATH Warning

2014-07-09 Thread Nick R. Katsipoulakis
Hello,

I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop
Distribution installed (v2.2.0 without yarn). Due to the fact that the
Hadoop Distribution that I am using, uses a list of jars , I do the
following changes to the conf/spark-env.sh

#!/usr/bin/env bash

export HADOOP_CONF_DIR=/path-to-hadoop-conf/hadoop-conf
export SPARK_LOCAL_IP=impl41
export
SPARK_CLASSPATH="/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*"
...

Also, to make sure that I have everything working I execute the Spark shell
as follows:

[biadmin@impl41 spark]$ ./bin/spark-shell --jars
/path-to-proprietary-hadoop-lib/lib/*.jar

14/07/09 13:37:28 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:28 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(biadmin)
14/07/09 13:37:28 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:29 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:29 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44292
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
  /_/

Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

14/07/09 13:37:36 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*'
as a work-around.
14/07/09 13:37:36 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/path-to-proprietary-hadoop-lib/lib/*:/path-to-proprietary-hadoop-lib/*'
as a work-around.
14/07/09 13:37:36 INFO spark.SecurityManager: Changing view acls to: biadmin
14/07/09 13:37:36 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(biadmin)
14/07/09 13:37:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/09 13:37:37 INFO Remoting: Starting remoting
14/07/09 13:37:37 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@impl41:46081]
14/07/09 13:37:37 INFO spark.SparkEnv: Registering MapOutputTracker
14/07/09 13:37:37 INFO spark.SparkEnv: Registering BlockManagerMaster
14/07/09 13:37:37 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140709133737-798b
14/07/09 13:37:37 INFO storage.MemoryStore: MemoryStore started with
capacity 307.2 MB.
14/07/09 13:37:38 INFO network.ConnectionManager: Bound socket to port
16685 with id = ConnectionManagerId(impl41,16685)
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/07/09 13:37:38 INFO storage.BlockManagerInfo: Registering block manager
impl41:16685 with 307.2 MB RAM
14/07/09 13:37:38 INFO storage.BlockManagerMaster: Registered BlockManager
14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:21938
14/07/09 13:37:38 INFO broadcast.HttpBroadcast: Broadcast server started at
http://impl41:21938
14/07/09 13:37:38 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-91e8e040-f2ca-43dd-b574-805033f476c7
14/07/09 13:37:38 INFO spark.HttpServer: Starting HTTP Server
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:52678
14/07/09 13:37:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/07/09 13:37:38 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/07/09 13:37:38 INFO ui.SparkUI: Started SparkUI at http://impl41:4040
14/07/09 13:37:39 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/09 13:37:39 INFO spark.SparkContext: Added JAR
file:/opt/ibm/biginsights/IHC/lib/adaptive-mr.jar at
http://impl41:52678/jars/adaptive-mr.jar with timestamp 1404938259526
14/07/09 13:37:39 INFO executor.Executor: Using REPL class URI:
http://impl41:44292
14/07/09 13:37:39 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.

scala>

So, my question is the following:

Am I including my libraries correctly? Why do I get the message that the
SPARK_CLASSPATH method is deprecated?

Also, when I execute the following example:

scala> val file = sc.textFile("hdfs://lpsa.dat")
14/07/09 13:41:43 WARN util.SizeEstimator: Failed t

Understanding how to install in HDP

2014-07-09 Thread Abel Coronado Iruegas
Hi everybody

We have hortonworks cluster with many nodes, we want to test a deployment
of Spark. Whats the recomended path to follow?

I mean we can compile the sources in the Name Node. But i don't really
understand how to pass the executable jar and configuration to the rest of
the nodes.

Thanks! !!

Abel


Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias,

I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in terms of
efficiency? Thanks!


On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer  wrote:

> Bill,
>
> do the additional 100 nodes receive any tasks at all? (I don't know which
> cluster you use, but with Mesos you could check client logs in the web
> interface.) You might want to try something like repartition(N) or
> repartition(N*2) (with N the number of your nodes) after you receive your
> data.
>
> Tobias
>
>
> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay 
> wrote:
>
>> Hi Tobias,
>>
>> Thanks for the suggestion. I have tried to add more nodes from 300 to
>> 400. It seems the running time did not get improved.
>>
>>
>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer  wrote:
>>
>>> Bill,
>>>
>>> can't you just add more nodes in order to speed up the processing?
>>>
>>> Tobias
>>>
>>>
>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay 
>>> wrote:
>>>
 Hi all,

 I have a problem of using Spark Streaming to accept input data and
 update a result.

 The input of the data is from Kafka and the output is to report a map
 which is updated by historical data in every minute. My current method is
 to set batch size as 1 minute and use foreachRDD to update this map and
 output the map at the end of the foreachRDD function. However, the current
 issue is the processing cannot be finished within one minute.

 I am thinking of updating the map whenever the new data come instead of
 doing the update when the whoe RDD comes. Is there any idea on how to
 achieve this in a better running time? Thanks!

 Bill

>>>
>>>
>>
>


Re: Understanding how to install in HDP

2014-07-09 Thread Krishna Sankar
Abel,
   I rsync the spark-1.0.1 directory to all the nodes. Then whenever the
configuration changes, rsync the conf directory.
Cheers



On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas <
acoronadoirue...@gmail.com> wrote:

> Hi everybody
>
> We have hortonworks cluster with many nodes, we want to test a deployment
> of Spark. Whats the recomended path to follow?
>
> I mean we can compile the sources in the Name Node. But i don't really
> understand how to pass the executable jar and configuration to the rest of
> the nodes.
>
> Thanks! !!
>
> Abel
>


Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Krishna Sankar
Nick,
   AFAIK, you can compile with yarn=true and still run spark in stand alone
cluster mode.
Cheers



On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis 
wrote:

> Hello,
>
> I am currently learning Apache Spark and I want to see how it integrates
> with an existing Hadoop Cluster.
>
> My current Hadoop configuration is version 2.2.0 without Yarn. I have
> build Apache Spark (v1.0.0) following the instructions in the README file.
> Only setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the
> HADOOP_CONF_DIR to point to the configuration directory of Hadoop
> configuration.
>
> My use-case is the Linear Least Regression MLlib example of Apache Spark
> (link:
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
> The only difference in the code is that I give the text file to be an HDFS
> file.
>
> However, I get a "Runtime Exception: Error in configuring object."
>
> So my question is the following:
>
> Does Spark work with a Hadoop distribution without Yarn?
> If yes, am I doing it right? If no, can I build Spark with
> SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?
>
> Thank you,
> Nick
>


Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Krishna,

Ok, thank you. I just wanted to make sure that this can be done.

Cheers,
Nick


On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar  wrote:

> Nick,
>AFAIK, you can compile with yarn=true and still run spark in stand
> alone cluster mode.
> Cheers
> 
>
>
> On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis 
> wrote:
>
>> Hello,
>>
>> I am currently learning Apache Spark and I want to see how it integrates
>> with an existing Hadoop Cluster.
>>
>> My current Hadoop configuration is version 2.2.0 without Yarn. I have
>> build Apache Spark (v1.0.0) following the instructions in the README file.
>> Only setting the SPARK_HADOOP_VERSION=1.2.1. Also, I export the
>> HADOOP_CONF_DIR to point to the configuration directory of Hadoop
>> configuration.
>>
>> My use-case is the Linear Least Regression MLlib example of Apache Spark
>> (link:
>> http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression).
>> The only difference in the code is that I give the text file to be an HDFS
>> file.
>>
>> However, I get a "Runtime Exception: Error in configuring object."
>>
>> So my question is the following:
>>
>> Does Spark work with a Hadoop distribution without Yarn?
>> If yes, am I doing it right? If no, can I build Spark with
>> SPARK_HADOOP_VERSION=2.2.0 and with SPARK_YARN=false?
>>
>> Thank you,
>> Nick
>>
>
>


Re: Understanding how to install in HDP

2014-07-09 Thread Andrew Or
Hi Abel and Krishna,

You shouldn't have to do any manual rsync'ing. If you're using HDP, then
you can just change the configs through Ambari. As for passing the assembly
jar to all executor nodes, the Spark on YARN code automatically uploads the
jar to a distributed cache (HDFS) and all executors pull the jar from
there. That is, all you need to do is install Spark on one of the nodes and
launch your application from there. (Alternatively you can launch your
application from outside of the cluster, in which case you should use
yarn-cluster mode)

Andrew


2014-07-09 15:24 GMT-07:00 Krishna Sankar :

> Abel,
>I rsync the spark-1.0.1 directory to all the nodes. Then whenever the
> configuration changes, rsync the conf directory.
> Cheers
> 
>
>
> On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas <
> acoronadoirue...@gmail.com> wrote:
>
>> Hi everybody
>>
>> We have hortonworks cluster with many nodes, we want to test a deployment
>> of Spark. Whats the recomended path to follow?
>>
>> I mean we can compile the sources in the Name Node. But i don't really
>> understand how to pass the executable jar and configuration to the rest of
>> the nodes.
>>
>> Thanks! !!
>>
>> Abel
>>
>
>


Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers



On Tue, Jul 8, 2014 at 6:24 PM, Robert James  wrote:

> I have a Spark app which runs well on local master.  I'm now ready to
> put it on a cluster.  What needs to be installed on the master? What
> needs to be installed on the workers?
>
> If the cluster already has Hadoop or YARN or Cloudera, does it still
> need an install of Spark?
>


Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all,

I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.

For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which seems slow to me. I allocated 300 workers and
specify 300 as the partition number for groupby. When I checked the slow
stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes 2
executors allocated for this stage. However, during other batches, the
executors can be several hundred for the same stage, which means the number
of executors for the same operations change.

Does anyone know how Spark allocate the number of executors for different
stages and how to increase the efficiency for task? Thanks!

Bill


Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-09 Thread Aris Vlasakakis
Hello everybody,

I am trying to figure out how to submit a Spark application from one
separate physical machine to a Spark stand alone cluster. I have an
application that I wrote in Python that works if I am on the 1-Node Spark
server itself, and from that spark installation I run bin/spark-submit with
1)  MASTER=local[*] or if 2) MASTER=spark://localhost:7077.

However, I want to be on a separate machine that submits a job to Spark. Am
I doing something wrong here? I think something is wrong because I am
working from two different spark "installations" -- as in, on the big
server I have one spark installation and I am running sbin/start-all.sh to
run the standalone server (and that works), and then on a separate laptop I
have a different installation of spark-1.0.0, but I am using the laptop's
bin/spark-submit script to submit to the remote Spark server (using
MASTER=spark://:7077

This "submit-to-remote cluster" does not work, even for the Scala examples
like SparkPi.

Concrete Example: I want to do submit the example SparkPi to the cluster,
from my laptop.

Server is 10.20.10.152, running master and slave, I can look at the Master
web UI at http://10.20.10.152:8080. Great.

>From laptop (10.20.10.154), I try the following, using bin/run-example from
a locally built version of spark 1.0.0 (so that I have the script
spark-submit!):

bin/spark-submit --verbose --class org.apache.spark.examples.SparkPi
--master spark://10.20.10.152:7077
examples/target/scala-2.10/spark-examples-1.0.0-hadoop1.0.4.jar


This fails, with the errors at the bottom of this email.

Am I doing something wrong? How can I submit to  a remote cluster? I get
the same problem with bin/spark-submit.


 bin/spark-submit --verbose --class org.apache.spark.examples.SparkPi
--master spark://10.20.10.152:7077
examples/target/scala-2.10/spark-examples-1.0.0-hadoop1.0.4.jar
Using properties file: null
Using properties file: null
Parsed arguments:
  master  spark://10.20.10.152:7077
  deployMode  null
  executorMemory  null
  executorCores   null
  totalExecutorCores  null
  propertiesFile  null
  driverMemorynull
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   org.apache.spark.examples.SparkPi
  primaryResource
file:/Users/aris.vlasakakis/Documents/spark-1.0.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop1.0.4.jar
  nameorg.apache.spark.examples.SparkPi
  childArgs   []
  jarsnull
  verbose true

Default properties from null:



Using properties file: null
Main class:
org.apache.spark.examples.SparkPi
Arguments:

System properties:
SPARK_SUBMIT -> true
spark.app.name -> org.apache.spark.examples.SparkPi
spark.jars ->
file:/Users/aris.vlasakakis/Documents/spark-1.0.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop1.0.4.jar
spark.master -> spark://10.20.10.152:7077
Classpath elements:
file:/Users/aris.vlasakakis/Documents/spark-1.0.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop1.0.4.jar


14/07/09 16:16:08 INFO SecurityManager: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties
14/07/09 16:16:08 INFO SecurityManager: Changing view acls to:
aris.vlasakakis
14/07/09 16:16:08 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions:
Set(aris.vlasakakis)
14/07/09 16:16:08 INFO Slf4jLogger: Slf4jLogger started
14/07/09 16:16:08 INFO Remoting: Starting remoting
14/07/09 16:16:08 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@10.20.10.154:50478]
14/07/09 16:16:08 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@10.20.10.154:50478]
14/07/09 16:16:08 INFO SparkEnv: Registering MapOutputTracker
14/07/09 16:16:08 INFO SparkEnv: Registering BlockManagerMaster
14/07/09 16:16:08 INFO DiskBlockManager: Created local directory at
/var/folders/ch/yfyhs7px5h90505g4n21n8d5k3svt3/T/spark-local-20140709161608-0531
14/07/09 16:16:08 INFO MemoryStore: MemoryStore started with capacity 5.8
GB.
14/07/09 16:16:08 INFO ConnectionManager: Bound socket to port 50479 with
id = ConnectionManagerId(10.20.10.154,50479)
14/07/09 16:16:08 INFO BlockManagerMaster: Trying to register BlockManager
14/07/09 16:16:08 INFO BlockManagerInfo: Registering block manager
10.20.10.154:50479 with 5.8 GB RAM
14/07/09 16:16:08 INFO BlockManagerMaster: Registered BlockManager
14/07/09 16:16:08 INFO HttpServer: Starting HTTP Server
14/07/09 16:16:09 INFO HttpBroadcast: Broadcast server started at
http://10.20.10.154:50480
14/07/09 16:16:09 INFO HttpFileServer: HTTP File server directory is
/var/folders/ch/yfyhs7px5h90505g4n21n8

CoarseGrainedExecutorBackend: Driver Disassociated‏

2014-07-09 Thread Sameer Tilak



Hi,This time instead of manually starting worker node using 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

I used start-slaves script on the master node. I also enabled -v (verbose flag) 
in ssh. Here is the o/p that I see. The log file for to the worker node was not 
created. I will switch back to the manual process for starting the cluster. 
bash-4.1$ ./start-slaves.sh172.16.48.44: OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 
Mar 2010172.16.48.44: debug1: Reading configuration data 
/users/userid/.ssh/config172.16.48.44: debug1: Reading configuration data 
/etc/ssh/ssh_config172.16.48.44: debug1: Applying options for *172.16.48.44: 
debug1: Connecting to 172.16.48.44 [172.16.48.44] port 22.172.16.48.44: debug1: 
Connection established.172.16.48.44: debug1: identity file 
/users/p529444/.ssh/identity type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/identity-cert type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_rsa type 1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_rsa-cert type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_dsa type -1172.16.48.44: debug1: identity file 
/users/p529444/.ssh/id_dsa-cert type -1172.16.48.44: debug1: Remote protocol 
version 2.0, remote software version OpenSSH_5.2p1_q17.gM-hpn13v6172.16.48.44: 
debug1: match: OpenSSH_5.2p1_q17.gM-hpn13v6 pat OpenSSH*172.16.48.44: debug1: 
Enabling compatibility mode for protocol 2.0172.16.48.44: debug1: Local version 
string SSH-2.0-OpenSSH_5.3172.16.48.44: debug1: SSH2_MSG_KEXINIT 
sent172.16.48.44: debug1: SSH2_MSG_KEXINIT received172.16.48.44: debug1: kex: 
server->client aes128-ctr hmac-md5 none172.16.48.44: debug1: kex: 
client->server aes128-ctr hmac-md5 none172.16.48.44: debug1: 
SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent172.16.48.44: debug1: expecting 
SSH2_MSG_KEX_DH_GEX_GROUP172.16.48.44: debug1: SSH2_MSG_KEX_DH_GEX_INIT 
sent172.16.48.44: debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY172.16.48.44: 
debug1: Host '172.16.48.44' is known and matches the RSA host key.172.16.48.44: 
debug1: Found key in /users/p529444/.ssh/known_hosts:6172.16.48.44: debug1: 
ssh_rsa_verify: signature correct172.16.48.44: debug1: SSH2_MSG_NEWKEYS 
sent172.16.48.44: debug1: expecting SSH2_MSG_NEWKEYS172.16.48.44: debug1: 
SSH2_MSG_NEWKEYS received172.16.48.44: debug1: SSH2_MSG_SERVICE_REQUEST 
sent172.16.48.44: debug1: SSH2_MSG_SERVICE_ACCEPT received172.16.48.44: 
172.16.48.44: 
This is a private computer system. Access to and use requires172.16.48.44: 
explicit current authorization and is limited to business use.172.16.48.44: All 
users express consent to monitoring by system personnel to172.16.48.44: detect 
improper use of or access to the system, system personnel172.16.48.44: may 
provide evidence of such conduct to law enforcement172.16.48.44: officials 
and/or company management.172.16.48.44: 
172.16.48.44: 
UAM R2 account support: http://ussweb.crdc.kp.org/UAM/172.16.48.44: 
172.16.48.44: 
For password resets, please call the Helpdesk 888-457-4872172.16.48.44: 
172.16.48.44: 
debug1: Authentications that can continue: 
gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44:
 debug1: Next authentication method: gssapi-keyex172.16.48.44: debug1: No valid 
Key exchange context172.16.48.44: debug1: Next authentication method: 
gssapi-with-mic172.16.48.44: debug1: Unspecified GSS failure.  Minor code may 
provide more information172.16.48.44: Cannot find KDC for requested 
realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure.  Minor code 
may provide more information172.16.48.44: Cannot find KDC for requested 
realm172.16.48.44:172.16.48.44: debug1: Unspecified GSS failure.  Minor code 
may provide more information172.16.48.44:172.16.48.44:172.16.48.44: debug1: 
Authentications that can continue: 
gssapi-keyex,gssapi-with-mic,publickey,password,keyboard-interactive172.16.48.44:
 debug1: Next authentication method: publickey172.16.48.44: debug1: Trying 
private key: /users/userid/.ssh/identity172.16.48.44: debug1: Offering public 
key: /users/userid/.ssh/id_rsa172.16.48.44: debug1: Server accepts key: pkalg 
ssh-rsa blen 277172.16.48.44: debug1: read PEM private key done: type 
RSA172.16.48.44: debug1: Authentication succeeded (publickey).172.16.48.44: 
debug1: channel 0: new [client-session]172.16.48.44: debug1: Requesting 
no-more-sessions@openssh.com172.16.48.44: debug1: Entering interactive 
session.172.16.48.44: debug1: Sending environment.172.16.48.44: debug1: Sending 
env LANG = en_US.UTF-8172.16.48.44: debug1: Sending command: cd 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/.. ; 
/apps/software/spark-1.0.0-bin-hadoop1/sbin/start-slave.sh 1 
spark://172.16.48.41:707717

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-09 Thread vs
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not
require Spark to be installed on all nodes manually. When you submit the
Spark assembly jar it will have all its dependencies. YARN will instantiate
Spark App Master & Containers based on this jar.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p9246.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Pyspark, references to different rdds being overwritten to point to the same rdd, different results when using .cache()

2014-07-09 Thread nimbus
Discovered this in ipynb, and I haven't yet checked to see if it happens
elsewhere.

here's a simple example:


this produces the output:


Which is not what I wanted.

Alarmingly, if I call .cache() on these rdds, it changes the result and I
get what I wanted.



which produces:


It's very unexpected for .cache() to actually change the results here. Also
there is additional weirdness when doing more interesting things that
.cache() still doesn't fix, but I don't yet have a simple example.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-references-to-different-rdds-being-overwritten-to-point-to-the-same-rdd-different-results-wh-tp9248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


spark1.0 principal component analysis

2014-07-09 Thread fintis
Hi,

Can anyone please shed more light on the PCA  implementation in spark? The 
documentation is a bit leaving as I am not sure I understand the output.
According to the docs, the output is a local matrix with the columns as
principal components and columns sorted in descending order of covariance.
This is a bit confusing for me as I need to compute other statistic Like
standard deviation of the principal components. How do I match the principal
components to the actual features since there is some sorting? How about
eigenvectors and eigenvalues? 

Please anyone to help shed light on the output, how to use it further and
pca spark implementation in general is appreciated

Thank you in earnest



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-principal-component-analysis-tp9249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Public service announcement:

If you're trying to do some stream processing on Twitter data, you'll need
version 3.0.6 of twitter4j . That should
work with the Spark Streaming 1.0.0 Twitter library.

The latest version of twitter4j, 4.0.2, appears to have breaking changes in
its API for us.

Nick


On Wed, Jul 9, 2014 at 3:34 PM, Nicholas Chammas  wrote:

> Awww ye. That worked! Thank you Sameer.
>
> Is this documented somewhere? I feel there there's a slight doc deficiency
> here.
>
> Nick
>
>
> On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak  wrote:
>
>> Hi Nicholas,
>>
>> I am using Spark 1.0 and I use this method to specify the additional
>> jars. First jar is the dependency and the second one is my application.
>> Hope this will work for you.
>>
>>  ./spark-shell --jars
>> /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software/scala-approsstrmatch/approxstrmatch.jar
>>
>>
>>
>> --
>> Date: Wed, 9 Jul 2014 11:44:27 -0700
>> From: nicholas.cham...@gmail.com
>> To: u...@spark.incubator.apache.org
>> Subject: How should I add a jar?
>>
>>
>> I’m just starting to use the Scala version of Spark’s shell, and I’d like
>> to add in a jar I believe I need to access Twitter data live, twitter4j
>> . I’m confused over where and how to
>> add this jar in.
>>
>> SPARK-1089  mentions
>> two environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext
>> also has an addJar method and a jars property, the latter of which does
>> not have an associated doc
>> 
>> .
>>
>> What’s the difference between all these jar-related things, and what do I
>> need to do to add this Twitter jar in correctly?
>>
>> Nick
>>  ​
>>
>> --
>> View this message in context: How should I add a jar?
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


  1   2   >