Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Suhas Shekar
I made both versions 1.1.1 and I got the same error. I then tried making both 1.1.0 as that is the version of my Spark Core, but I got the same error. I noticed my Kafka dependency is for scala 2.9.2, while my spark streaming kafka dependency is 2.10.x...I will try changing that next, but don't th

Re: Spark 1.2.0 Yarn not published

2014-12-28 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5vd61V1/Spark-yarn+1.2.0&subj=Re+spark+yarn_2+10+1+2+0+artifacts Cheers On Dec 28, 2014, at 11:13 PM, Aniket Bhatnagar wrote: > Hi all > > I just realized that spark-yarn artifact hasn't been published for 1.2.0 > release. Any particular reaso

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Akhil Das
Just looked at the pom file that you are using, why are you having different versions in it? org.apache.spark spark-streaming-kafka_2.10 *1.1.1* org.apache.spark spark-streaming_2.10 *1.0.2* ​can you make both the versions the same?​ Thanks Best Regards On Mon, Dec 29, 2014 at 12:44 PM, Su

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Suhas Shekar
1) Could you please clarify on what you mean by checking the Scala version is correct? In my pom.xml file it is 2.10.4 (which is the same as when I start spark-shell). 2) The spark master URL is definitely correct as I have run other apps with the same script that use Spark (like a word count with

Spark 1.2.0 Yarn not published

2014-12-28 Thread Aniket Bhatnagar
Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0 release. Any particular reason for that? I was using it in my yet another spark-job-server project to submit jobs to a YARN cluster through convenient REST APIs (with some success). The job server was creating SparkCon

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Akhil Das
Make sure you verify the following: - Scala version : I think the correct version would be 2.10.x - SparkMasterURL: Be sure that you copied the one displayed on the webui's top left corner (running on port 8080) Thanks Best Regards On Mon, Dec 29, 2014 at 12:26 PM, suhshekar52 wrote: > Hello E

Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread suhshekar52
Hello Everyone, Thank you for the time and the help :). My goal here is to get this program working: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java The only lines I do not have from the example are lines 6

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
Hi Patrick, I don't mean to be glib, but the fact that it works at all on my cluster (600 nodes) and data is a novel experience. This is the first release that I haven't had to struggle with and then give up entirely. I can , for example, finally use HiveContext from PySpark on CDH, at least to

recent join/iterator fix

2014-12-28 Thread Stephen Haberman
Hey, I saw this commit go by, and find it fairly fascinating: https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305 For two reasons: 1) we have a report that is bogging down exactly in a .join with lots of elements, so, glad to see the fix, but, more interesting I think

Re: TF-IDF in Spark 1.1.0

2014-12-28 Thread Yao
Can you show how to do IDF transform on tfWithId? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TF-IDF-in-Spark-1-1-0-tp16389p20877.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-use

Re: sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Sean Owen
The method you're referring to is a method of RDD, not DStream. If you want to do something with a sample of each RDD in the DStream, then call streamtoread.foreachRDD { rdd => val sampled = rdd.sample(...) ... } On Sun, Dec 28, 2014 at 10:44 PM, Josh J wrote: > Hi, > > I'm trying to using s

sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
Hi, I'm trying to using sampling with Spark Streaming. I imported the following import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ I then call sample val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,StorageLevel.MEMORY_AND_DISK).m

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman wrote: > Hi Josh, > > Thanks for the inf

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-28 Thread Harihar Nahak
Yes, I had try that too. I took the pre-built spark 1.1 release. If you there are changes in up coming changes for GraphX library, just let me know or in spark 1.2 I can do try on that. --Harihar - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: MLlib + Streaming

2014-12-28 Thread Jeremy Freeman
Hi Fernando, There’s currently no streaming ALS in Spark. I’m exploring a streaming singular value decomposition (JIRA) based on this paper (http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf), which might be one way to think about it. There has also been some cool recent work explicitly on str

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-28 Thread Jeremy Freeman
Along with Xiangrui’s suggestion, we will soon be adding an implantation of Streaming Logistic Regression, which will be similar to the current version of Streaming Linear Regression, and continually update the model as new data arrive (JIRA). Hopefully this will be in v1.3. — Jeremy -

Re: unable to do group by with 1st column

2014-12-28 Thread Sean Owen
One value is at least 12 + 4 + 4 + 12 + 4 = 36 bytes if you factor in object overhead, if my math is right. 60M of them is about 2.1GB for a single key. I could imagine that blowing up an executor that's trying to have one in memory and deserialize another. You won't want to use groupByKey if the n

Re: unable to do group by with 1st column

2014-12-28 Thread Michael Albert
Greetings! Thanks for the comment. I have tried several variants of this, as indicated. The code works on small sets, but fails on larger sets.However, I don't get memory errors.I see "java.nio.channels.CancelledKeyException" and things about "lost task"and then things like "Resubmitting state 1"

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining of

Re: Spark core maven error

2014-12-28 Thread Sean Owen
http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar That checksum is correct for this file, and is what Maven Central reports. I wonder if your repo is corrupted. Try deleting everything under ~/.m2/repository that's related to Spark and let it download agai

Re: init / shutdown for complex map job?

2014-12-28 Thread Sean Owen
(Still pending, but believe it's in progress and being written by a colleague here.) On Sun, Dec 28, 2014 at 2:41 PM, Ray Melton wrote: > A follow-up to the blog cited below was hinted at, per "But Wait, > There's More ... To keep this post brief, the remainder will be left to > a follow-up post.

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say "the overhead of spark shuffles start to accumulate"? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenc

Re: Playing along at home: recommendations as to system requirements?

2014-12-28 Thread Amy Brown
Thanks Cody. I reported the core dump as an issue on the Spark JIRA and a developer diagnosed it as an openJDK issue. So I switched over to Oracle Java 8 and... no more core dumps on the examples. I reported the openJDK issue at the icedtea Bugzilla. Looks like I'm off and running with Spark on

Re: Spark core maven error

2014-12-28 Thread Lalit Agarwal
Hi Please find the complete error - |Downloading: org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom Error | Resolve error obtaining

Re: init / shutdown for complex map job?

2014-12-28 Thread Ray Melton
A follow-up to the blog cited below was hinted at, per "But Wait, There's More ... To keep this post brief, the remainder will be left to a follow-up post." Is this follow-up pending? Is it sort of pending? Did the follow-up happen, but I just couldn't find it on the web? Regards, Ray. On Sun

Strange results of running Spark GenSort.scala

2014-12-28 Thread Sam Liu
Hi Experts, I am confusing on the input parameters of GenSort.scala and encountered strange issues. It requires 3 parameters: " [num-parts] [records-per-part] [output-path]". Like Hadoop, I think the sizing of any one row(or record) of the sorting file equals to 100 bytes. So if I want to genera

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
Hi Josh, Thanks for the informative answer. Sounds like one should await your changes in 1.3. As information, I found the following set of options for doing the visual in a notebook. http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/n

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Sean and Nicholas Thank you very much, *exists* method works here :) On Sun, Dec 28, 2014 at 2:27 PM, Sean Owen wrote: > Try instead i.exists(_ == target) > On Dec 28, 2014 8:46 AM, "Amit Behera" wrote: > >> Hi Nicholas, >> >> I am getting >> error: value contains is not a member of Iterab

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Sean Owen
Try instead i.exists(_ == target) On Dec 28, 2014 8:46 AM, "Amit Behera" wrote: > Hi Nicholas, > > I am getting > error: value contains is not a member of Iterable[Int] > > On Sun, Dec 28, 2014 at 2:06 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> take(1) will just give you a s

Re: init / shutdown for complex map job?

2014-12-28 Thread Sean Owen
You can't quite do cleanup in mapPartitions in that way. Here is a bit more explanation (farther down): http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/ On Dec 28, 2014 8:18 AM, "Akhil Das" wrote: > Something like? > > val a = myRDD.mapPartitions(p => { > > >

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Nicholas, I am getting error: value contains is not a member of Iterable[Int] On Sun, Dec 28, 2014 at 2:06 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > take(1) will just give you a single item from the RDD. RDDs are not ideal > for point lookups like you are doing, but you can

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Sean, I have a RDD like *theItems: org.apache.spark.rdd.RDD[Iterable[Int]]* I did like *val items = theItems.collect *//to get it as an array items: Array[Iterable[Int]] *val check = items.contains(item)* Thanks Amit On Sun, Dec 28, 2014 at 1:58 PM, Amit Behera wrote: > Hi Nicholas, > >

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Nicholas Chammas
take(1) will just give you a single item from the RDD. RDDs are not ideal for point lookups like you are doing, but you can find the element you want by doing something like: rdd.filter(i => i.contains(target)).collect() Where target is the Int you are looking for. Nick ​ On Sun Dec 28 2014 at

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Nicholas, The RDD contains only one Iterable[Int]. Pankaj, I used *collect* and I am getting as *items: Array[Iterable[Int]].* Then I did like : *val check = items.take(1).contains(item)* I am getting *check: Boolean = false, *but the item is present. Thanks Amit On Sun, Dec 28, 2014 a

Re: Compile error from Spark 1.2.0

2014-12-28 Thread Akhil Das
Hi Zigen, Looks like they missed it. Thanks Best Regards On Sat, Dec 27, 2014 at 12:43 PM, Zigen Zigen wrote: > Hello , I am zigen. > > I am using the Spark SQL 1.1.0. > > I want to use the Spark SQL 1.2.0. > > > but my Spark application is a compile error. > > Spark 1.1.0 had a DataType.Decim

Re: init / shutdown for complex map job?

2014-12-28 Thread Akhil Das
Something like? val a = myRDD.mapPartitions(p => { //Do the init //Perform some operations //Shut it down? }) Thanks Best Regards On Sun, Dec 28, 2014 at 1:53 AM, Kevin Burton wrote: > I have a job where I want to map over all data in a cass

Re: Problem with StreamingContext - getting SPARK-2243

2014-12-28 Thread Akhil Das
In the shell you could do: val ssc = StreamingContext(*sc*, Seconds(1)) as *sc* is the SparkContext, which is already instantiated. Thanks Best Regards On Sun, Dec 28, 2014 at 6:55 AM, Thomas Frisk wrote: > Yes you are right - thanks for that :) > > On 27 December 2014 at 23:18, Ilya Ganelin