Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-10 Thread Dilip
Hi Akhil, Can you please guide me through this? Because the code I am running already has this in it: [java] SparkContext sc = new SparkContext(); sc.addJar("/usr/local/spark/external/kafka/target/scala-2.10/spark-streaming-kafka_2.10-1.1.0-SNAPSHOT.jar"); Is there something I am mis

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-10 Thread Akhil Das
Sorry, the command is nc -lk 12345 Thanks Best Regards On Fri, Jul 11, 2014 at 6:46 AM, Akhil Das wrote: > You simply use the *nc* command to do this. like: > > nc -p 12345 > > will open the 12345 port and from the terminal you can provide whatever > input you require for your StreamingCode.

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
Hi Praveen, I did not change the number of total executors. I specified 300 as the number of executors when I submitted the jobs. However, for some stages, the number of executors is very small, leading to long calculation time even for small data set. That means not all executors were used for so

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-10 Thread Akhil Das
You simply use the *nc* command to do this. like: nc -p 12345 will open the 12345 port and from the terminal you can provide whatever input you require for your StreamingCode. Thanks Best Regards On Fri, Jul 11, 2014 at 2:41 AM, kytay wrote: > Hi > > I am learning spark streaming, and is try

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-10 Thread Akhil Das
Easiest fix would be adding the kafka jars to the SparkContext while creating it. Thanks Best Regards On Fri, Jul 11, 2014 at 4:39 AM, Dilip wrote: > Hi, > > I am trying to run a program with spark streaming using Kafka on a stand > alone system. These are my details: > > Spark 1.0.0 hadoop2 >

Re: Number of executors change during job running

2014-07-10 Thread Praveen Seluka
If I understand correctly, you could not change the number of executors at runtime right(correct me if am wrong) - its defined when we start the application and fixed. Do you mean number of tasks? On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das wrote: > Can you try setting the number-of-partitio

Re: Wanna know more about Pyspark Internals

2014-07-10 Thread Davies Liu
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals Hope it's useful for you. Davies On Thu, Jul 10, 2014 at 8:49 PM, Baofeng Zhang wrote: > In addition to wiki on Confluence > or > reading source code, whe

Spark summit 2014 videos ?

2014-07-10 Thread Ajay Srivastava
Hi, I did not find any videos on apache spark channel in youtube yet. Any idea when these will be made available ? Regards, Ajay

Spark Streaming with Kafka NoClassDefFoundError

2014-07-10 Thread Dilip
Hi, I am trying to run a program with spark streaming using Kafka on a stand alone system. These are my details: Spark 1.0.0 hadoop2 Scala 2.10.3 I am trying a simple program using my custom sbt project but this is the error I am getting: Exception in thread "main" java.lang.NoClassDefFoun

Wanna know more about Pyspark Internals

2014-07-10 Thread Baofeng Zhang
In addition to wiki on Confluence or reading source code, where/how can I get more information about Pyspark internals, for I am so familiar with python :( -- View this message in context: http://apache-spark-user-list.1

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD => { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
Right this uses NextIterator, which is elsewhere in the repo. It just makes it cleaner to implement a custom iterator. But i guess you got the high level point, so its okay. TD On Thu, Jul 10, 2014 at 7:21 PM, kytay wrote: > Hi TD > > Thanks. > > I have problem understanding the codes in githu

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
No specific plans to do so, since there has some functional loss like time based windowing function which is important for streaming sql. Also keep compatible with fast growing SparkSQL is quite hard. So no clear plans to submit to upstream. -Jerry From: Tobias Pfeiffer [mailto:t...@preferred.

Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi, On Fri, Jul 11, 2014 at 11:38 AM, Shao, Saisai wrote: > Actually we have a POC project which shows the power of combining Spark > Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and > get SchemaDStream. You can take a look at it: > https://github.com/thunderain-proje

Streaming. Cannot get socketTextStream to receive anything.

2014-07-10 Thread kytay
Hi I am learning spark streaming, and is trying out the JavaNetworkCount example. #1 - This is the code I wrote JavaStreamingContext sctx = new JavaStreamingContext("local", appName, new Duration(5000)); JavaReceiverInputDStream lines = sctx.socketTextStream("127.0.0.1", ); JavaDStr

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
Actually we have a POC project which shows the power of combining Spark Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get SchemaDStream. You can take a look at it: https://github.com/thunderain-project/StreamSQL Thanks Jerry From: Tathagata Das [mailto:tathagata.d

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread kytay
Hi TD Thanks. I have problem understanding the codes in github, Object SocketReceiver.byteToLines(...) private[streaming] obje

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer wrote: > Hi, > > I think it would be great if

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Andrew, Thanks for replying. I did the following and the result was still the same. 1. Added "spark.home /root/spark-1.0.0" to local conf/spark-defaults.conf, where "/root" was the place in the cluster where I put Spark. 2. Ran "bin/spark-shell --master spark://sjc1-eng-float01.carrieriq.co

Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi, I think it would be great if we could do the string parsing only once and then just apply the transformation for each interval (reducing the processing overhead for short intervals). Also, one issue with the approach above is that transform() has the following signature: def transform(tran

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hmm, yeah looks like the table name is not getting applied to the attributes of m. You can work around this by rewriting your query as: hql("select s.id from (SELECT * FROM m) m join s on (s.id=m.id) order by s.id" This explicitly gives the alias m to the attributes of that table. You can also op

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Can you try setting the number-of-partitions in all the shuffle-based DStream operations, explicitly. It may be the case that the default parallelism (that is, spark.default.parallelism) is probably not being respected. Regarding the unusual delay, I would look at the task details of that stage in

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam wrote: > Hi Michael, >

Re: incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread Yana Kadiyska
I do not believe the order of points in a distributed RDD is in any way guaranteed. For a simple test, you can always add a last column which is an id (make it double and throw it in the feature vector). Printing the rdd back will not give you the points in file order. If you don't want to go that

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Michael, I got the log you asked for. Note that I manually edited the table name and the field names to hide some sensitive information. == Logical Plan == Project ['s.id] Join Inner, Some((id#106 = 'm.id)) Project [id#96 AS id#62] MetastoreRelation test, m, None MetastoreRelation test

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes ove

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more executors in a job. However, it is still slow. And I further observed that most executors take less than 20 seconds but two of them take much longer such as 2 minutes. The data size is very small (les

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: "NoSuchElementException: key not found" when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr wrote: > Hi > > I get exactly the same problem here, do you've found the problem ? > Thanks >

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay wrote: > Hi all, > > I have a Spark streaming job run

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-10 Thread Tathagata Das
How are you supplying the text file? On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote: > Hi Folks: > > I am working on an application which uses spark streaming (version 1.1.0 > snapshot on a standalone cluster) to process text file and save counters in > cassandra based on fields in each row. I

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas wrote: >

Re: EC2 Cluster script. Shark install fails

2014-07-10 Thread Michael Armbrust
There is no version of Shark that is compatible with Spark 1.0, however, Spark SQL does come included automatically. More information here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://spark.apache.org/docs/latest/sql-programming-g

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Michael Armbrust
I'll add that the SQL parser is very limited right now, and that you'll get much wider coverage using hql inside of HiveContext. We are working on bringing sql() much closer to SQL-92 though in the future. On Thu, Jul 10, 2014 at 7:28 AM, premdass wrote: > Thanks Takuya . works like a Charm >

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql("select s.id from m join s on (s.id=m_id)").queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam wrote: > Hi Spark developers, > > I have the followin

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote: > > For the curious mind, the dataset is about 200-300GB and we are using 10 > machines for this benchmark. Given the env is equal between the two > experiments, why pure spark is faster than SparkSQL? > There is going to be some overhead to parsi

Submitting to a cluster behind a VPN, configuring different IP address

2014-07-10 Thread Aris Vlasakakis
Hi Spark folks, So on our production Spark cluster, it lives in the data center and I need to attach to a VPN from my laptop, so that I can then submit a Spark application job to the Spark Master (behind the VPN). However, the problem arises that I have a local IP address on the laptop which is o

SparkR failed to connect to the master

2014-07-10 Thread cjwang
I have a cluster running. I was able to run Spark Shell and submit programs. But when I tried to use SparkR, I got these errors: wifi-orcus:sparkR cwang$ MASTER=spark://wifi-orcus.dhcp.carrieriq.com:7077 sparkR R version 3.1.0 (2014-04-10) -- "Spring Dance" Copyright (C) 2014 The R Foundation f

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Aris Vlasakakis
Andrew, thank you so much! That worked! I had to manually set the spark.home configuration in the SparkConf object using .set("spark.home","/cluster/path/to/spark/"), and then I was able to submit from my laptop to the cluster! Aris On Thu, Jul 10, 2014 at 11:41 AM, Andrew Or wrote: > Setting

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Andrei
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can a

incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread SK
Hi, I have a csv data file, which I have organized in the following format to be read as a LabeledPoint(following the example in mllib/data/sample_tree_data.csv): 1,5.1,3.5,1.4,0.2 1,4.9,3,1.4,0.2 1,4.7,3.2,1.3,0.2 1,4.6,3.1,1.5,0.2 The first column is the binary label (1 or 0) and the remainin

Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a "local" SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster.

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote: > By the way, I also try hql("select * from m").count. It is terribly slow > too. > > >

What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Nick Chammas
Looks like twitter4j 2.2.6 is what works, but I don’t believe it’s documented anywhere. Using 3.0.6 works for a while, but then causes the following error: 14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing threa

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Zongheng, Thanks a lot for your reply. I was edited my codes in my group project and I forgot to remove the package declaration...How silly! Regards, Haoming > Date: Thu, 10 Jul 2014 12:00:40 -0700 > Subject: Re: SPARKSQL problem with implementing Scala's Product interface > From: zonghen..

Re: RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Thomas Robert
Hi, I'm quite a Spark newbie so I might be wrong but I think that registerAsTable works either on case classes or on classes extending Product. You find this info in an example on the doc page of Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html // Define the schema using

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql("select * from m").count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote: > Hi Spark users and developers, > > I'm doing some simple benchmarks with my team and we found out a potential > performance issue using Hive via SparkSQL. It is very

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which ar

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread Andrew Or
Hi C.J., The PR Yana pointed out seems to fix this. However, it is not merged in master yet, so for now I would recommend that you try the following workaround: set "spark.home" to the executor's /path/to/spark. I provided more detail here: http://mail-archives.apache.org/mod_mbox/spark-user/20140

Re: Potential bugs in SparkSQL

2014-07-10 Thread Stephen Boesch
Hi Jerry, To add to your question: Following does work (from master)- notice the registerAsTable is commented : (I took a liberty to add the "order by" clause) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql("USE test") // hql("select id from m").register

writing FLume data to HDFS

2014-07-10 Thread Sundaram, Muthu X.
I am new to spark. I am trying to do the following. Netcat-->Flume-->Spark streaming(process Flume Data)-->HDFS. My flume config file has following set up. Source = netcat Sink=avrosink. Spark Streaming code: I am able to print data from flume to the monitor. But I am struggling to create a fil

Re: How to RDD.take(middle 10 elements)

2014-07-10 Thread Xiangrui Meng
This is expensive but doable: rdd.zipWithIndex().filter { case (_, idx) => idx >= 10 && idx < 20 }.collect() -Xiangrui On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas wrote: > Interesting question on Stack Overflow: > http://stackoverflow.com/q/24677180/877069 > > Basically, is there a way to ta

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data t

How to RDD.take(middle 10 elements)

2014-07-10 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is there a way to take() elements of an RDD at an arbitrary index? Nick ​ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-RDD-take-middle-10-elements-tp93

Use of the SparkContext.hadoopRDD function in Scala code

2014-07-10 Thread Nick R. Katsipoulakis
Hello, I want to run an MLlib task in Scala API, that creates a hadoopRDD from a CustomInputFormat. According to Spark API def hadoopRDD[K, V](conf: JobConf, inputFormatClass: Class[_ <: org.apache.hadoop.mapred.InputFormat[K,V]], keyClass: Class[K], valueClass: Class[V], minSplits: Int): RDD

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI wrote: > > - how to "build the latest version of Spark from the master branch, which > contains a fix"? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under "Development Version" on th

Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming, For your spark-submit question: can you try using an assembly jar ("sbt/sbt assembly" will build it for you)? Another thing to check is if there is any package structure that contains your SimpleApp; if so you should include the hierarchal name. Zongheng On Thu, Jul 10, 2014 at 11:33

How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-10 Thread Yan Fang
Hi all, I am working to improve the parallelism of the Spark Streaming application. But I have problem in understanding how the executors are used and the application is distributed. 1. In YARN, is one executor equal one container? 2. I saw the statement that a streaming receiver runs on one wor

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Andrew Or
Yes, there are a few bugs in the UI in the event of a node failure. The duplicated stages in both the active and completed tables should be fixed by this PR: https://github.com/apache/spark/pull/1262 The fact that the progress bar on the stages page displays an overflow (e.g. 5/4) is still an open

sparkStaging

2014-07-10 Thread Koert Kuipers
in spark 1.0.0 using yarn-client mode i am seeing that the sparkStaging directories do not get cleaned up. for example i run: $ spark-submit --class org.apache.spark.examples.SparkPi spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10 after which i have this directory left behind with one file in it

Re: Difference between SparkSQL and shark

2014-07-10 Thread Du Li
On the spark 1.0-jdbc branch, there is a thrift server and a beehive CLI that roughly keeps the shark style, corresponding to the shark server and shark CLI, respectively. Check out https://github.com/apache/spark/blob/branch-1.0-jdbc/docs/sql-programming-guide.md for more information. Du Fro

Generic Interface between RDD and DStream

2014-07-10 Thread mshah
I wanted to get a perspective on how to share code between Spark batch processing and Spark Streaming. For example, I want to get unique tweets stored in a HDFS file then in both Spark Batch and Spark Streaming. Currently I will have to do following thing: Tweet { String tweetText; String use

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Andrew Or
Setting SPARK_HOME is not super effective, because it is overridden very quickly by bin/spark-submit here . Instead you should set the config "spark.home". Here's why: Each of your executors inherit

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8 "java version "1.8.0_05" Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)" "which spark-shell ~/bench/spark-1.0.0/bin/spark-shell" "which scala ~/bench/scala-2.10.4/bin/scala" On Thursday,

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Yadid, I have the same problem with you so I implemented the product interface as well, even the codes are similar with your codes. But now I face another problem that is I don't know how to run the codes...My whole program is like this: object SimpleApp { class Record(val x1: String, va

Re: Difference between SparkSQL and shark

2014-07-10 Thread Nicholas Chammas
In short, Spark SQL is the future, built from the ground up. Shark was built as a drop-in replacement for Hive, will be retired, and will perhaps be replaced by a future initiative to run Hive on Spark . More info: - http://databricks.com/bl

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001" I get this response: "Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) " The start point is not random. It uses the first K points from the given set O

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multi

Difference between SparkSQL and shark

2014-07-10 Thread hsy...@gmail.com
I have a newbie question. What is the difference between SparkSQL and Shark? Best, Siyuan

Re: Some question about SQL and streaming

2014-07-10 Thread hsy...@gmail.com
Yes, this is what I tried, but thanks! On Wed, Jul 9, 2014 at 6:02 PM, Tobias Pfeiffer wrote: > Siyuan, > > I do it like this: > > // get data from Kafka > val ssc = new StreamingContext(...) > val kvPairs = KafkaUtils.createStream(...) > // we need to wrap the data in a case class for register

Stateful RDDs?

2014-07-10 Thread Sargun Dhillon
So, one portion of our Spark streaming application requires some state. Our application takes a bunch of application events (i.e. user_session_started, user_session_ended, etc..), and calculates out metrics from these, and writes them to a serving layer (see: Lambda Architecture). Two related event

Running Spark on Yarn vs Mesos

2014-07-10 Thread k.tham
What do people usually do for this? It looks like Yarn might be the simplest since the Cloudera distribution already installs this for you when you install hadoop. Any advantages of using Mesos instead? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggest

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Aris Vlasakakis
Thank you very much Yana for replying! So right now the set up is a single-node machine which is my "cluster", and YES you are right my submitting laptop has a different path to the spark-1.0.0 installation than the "cluster" machine. I tried to set SPARK_HOME on my submittor laptop using the act

Re: Spark job tracker.

2014-07-10 Thread Marcelo Vanzin
That output means you're running in yarn-cluster mode. So your code is running inside the ApplicationMaster and has no access to the local terminal. If you want to see the output: - try yarn-client mode, then your code will run inside the launcher process - check the RM web ui and look at the logs

Re: Terminal freeze during SVM

2014-07-10 Thread Xiangrui Meng
news20.binary's feature dimension is 1.35M. So the serialized task size is above the default limit 10M. You need to set spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this parameter is not passed to executors automatically, which causes Spark freezes. This was fixed in the latest master

Re: Spark job tracker.

2014-07-10 Thread abhiguruvayya
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way i can log these details on the console rather than logging it. As of now once i start my application i could see this, 14/07/10 00:48:20 INFO yarn.Client: Application report from ASM: application identifier: ap

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das wrote: > I ran the SparkKMeans example (not the mllib KMeans that Sean ran) w

Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
Tobias, Your help on the problems I have met have been very helpful. Thanks a lot! Bill On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer wrote: > Bill, > > good to know you found your bottleneck. Unfortunately, I don't know how to > solve this; until know, I have used Spark only with embarassi

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
I have created the issue: "In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug" Have a look at it. Thanks, On Thu, Jul 10, 2014 at 8:37 PM, Bertrand Dechoux wrote: > A patch proposal on the apache JIRA for Spark? > https://issues.apache.org/jira/browse/SPARK

SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread yadid
Hi All, I have a class with too many variables to be implemented as a case class, therefor I am using regular class that implements Scala's product interface. Like so: class Info () extends Product with Serializable { var param1 : String = "" var param2 : String = "" ... var param

How to read saved model?

2014-07-10 Thread rohitspujari
Hello Folks: I attended the session Aron D did at hadoop summit. He mentioned about training the model and saving it on HDFS. When you start scoring you can read the saved model. So, I can save the model using sc.makeRDD(model.clusterCenters).saveAsObjectFile(model) But when I try to read the mo

GraphX: how to specify partition strategy?

2014-07-10 Thread Yifan LI
Hi, I am doing graph computation using GraphX, but it seems to be an error on graph partition strategy specification. as in "GraphX programming guide": The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark 1.0

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into "modules/mod_authn_alias.so" issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Spark developers, I have the following hqls that spark will throw exceptions of this kind: 14/07/10 15:07:55 INFO TaskSetManager: Loss was due to org.apache.spark.TaskKilledException [duplicate 17] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:736 failed 4 times, m

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into "modules/mod_authn_alias.so" issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani wrote: > And also that there is a small bug in implementation. As I mentioned this > earlier also. > > This is my first time I am reporting some

Re: Terminal freeze during SVM

2014-07-10 Thread AlexanderRiggers
Tried the newest branch, but still get stuck on the same task: (kill) runJob at SlidingRDD.scala:74 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html Sent from the Apache Spark User List mailing list ar

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Thanks Takuya . works like a Charm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Language-Integrated-query-OR-clause-and-IN-clause-tp9298p9303.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Daniel Siegmann
One thing to keep in mind is that the progress bar doesn't take into account tasks which are rerun. If you see 4/4 but the stage is still active, click the stage name and look at the task list. That will show you if any are actually running. When rerun tasks complete, it can result in the number of

Re: Restarting a Streaming Context

2014-07-10 Thread Nicholas Chammas
Okie doke. Thanks for the confirmation, Burak and Tathagata. On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das wrote: > I confirm that is indeed the case. It is designed to be so because it > keeps things simpler - less chances of issues related to cleanup when > stop() is called. Also it keeps t

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Takuya UESHIN
Hi Prem, You can write like: people.where('age >= 10 && 'month === 2).select('name) people.where('age >= 10 || 'month === 2).select('name) people.where(In('month, Seq(2, 6))).select('name) The last one needs to import catalyst.expressions package. Thanks. 2014-07-10 22:15 GMT+09:00 pre

SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Hi, any suggestions on how to implement OR clause and IN clause in the SparkSQL language integrated queries. For example: 'SELECT name FROM people WHERE age >= 10 AND month = 2' can be written as val teenagers = people.where('age >= 10).where('month === 2).select('name) How do we write 'SELE

Re: Purpose of spark-submit?

2014-07-10 Thread Koert Kuipers
when i write a general spark application i use SparkConf/SparkContext, not Client.scala for Yarn On Wed, Jul 9, 2014 at 9:39 PM, Andrew Or wrote: > I don't see why using SparkSubmit.scala as your entry point would be any > different, because all that does is invoke the main class of Client.scal

Difference between collect() and take(n)

2014-07-10 Thread MEETHU MATHEW
Hi all, I want to know how collect() works, and how it is different from take().I am just reading a file of 330MB which has 43lakh rows with 13 columns and calling take(430) to save to a variable.But the same is not working with collect().So is there any difference in the operation of both.

EC2 Cluster script. Shark install fails

2014-07-10 Thread Jason H
Hi Just going though the process of installing Spark 1.0.0 on EC2 and notice that the script throws an error when installing shark. Unpacking Spark ~/spark-ec2 Initializing shark ~ ~/spark-ec2 ERROR: Unknown Shark version The install completes in the end but shark is completely missed. Lo

RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Kefah Issa
Hi, SQL on spark 1.0 is an interesting feature. It works fine when the "record" is made of a case-class. The issue I have is that I have around 50 attributes per record. scala Case-class can not handle that (hard-coded limit is 22 for some reason). So I created a regular class and defined the att

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first time I am reporting some bug. So I just wanted to ask, that do your name come somewhere or do you get something after correcting/reporting some bug. So that i can mention that in my profile/res

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
Ya thanks. I can see that lambda is used as the parameter. On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen wrote: > Have a look at the code maybe? > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > > Yes there is a smoothing

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Surendranauth Hiraman
History Server is also very helpful. On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang wrote: > I didn't keep the driver's log. It's a lesson. > > I will try to run it again to see if it happens again. > > > -- > > *From:* Tathagata Das [mailto:tathagata.das1...@gmail.c

  1   2   >