Shark over Spark-Streaming

2014-06-10 Thread praveshjain1991
Is it possible to use Shark over Streaming data? I did not find any mention of that on the website. When you run shark it gives you a shell to run your queries for stored data. Is there any way to do the same over streaming data? -- Thanks -- View this message in context: http://apache-spark-

RE: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-10 Thread Cheng, Hao
And if you want to use the SQL CLI (based on catalyst) as it works in Shark, you can also check out https://github.com/amplab/shark/pull/337 :) This preview version doesn’t require the Hive to be setup in the cluster. (Don’t forget to put the hive-site.xml under SHARK_HOME/conf also) Cheng Hao

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-10 Thread toivoa
Thanks for the hint. I removed signature info from same jar and JVM is happy now. But problem remains, several same jar's but different versions, not good. Spark itself is very, very promising, I am very excited Thank you all toivo -- View this message in context: http://apache-spark-user-

Problem in Spark Streaming

2014-06-10 Thread nilmish
I am running a spark streaming job to count top 10 hashtags over last 5 mins window, querying every 1 sec. It is taking approx <1.4 sec (end-to-end-delay) to answer most of the query but there are few instances in between when it takes considerable more amount of time (like around 15 sec) due to

Re: Problem in Spark Streaming

2014-06-10 Thread Yingjun Wu
Hi Nilmish, I confront the same problem. I am wondering how do you measure the latency? Regards, Yingjun -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7311.html Sent from the Apache Spark User List mailing list archive a

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
You can measure the latency from the logs. Search for words like Total delay in the logs. This denotes the total end to end delay for a particular query. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7312.html Sent from th

pmml with augustus

2014-06-10 Thread filipus
hello guys, has anybody experiances with the library augustus as a serializer for scoring models? looks very promising and i even found a hint on the connection augustus and spark all the best -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augu

Re: pmml with augustus

2014-06-10 Thread Sean Owen
It's worth mentioning that Augustus is a Python-based library. On a related note, in Java-land, I have had good experiences with jpmml's projects: On Tue, Jun 10, 2014 at 7:52 AM, filipus wrote: > hello guys, > > has anybody experiances with the library augustus as a serializer for > scoring mod

Re: pmml with augustus

2014-06-10 Thread Sean Owen
On Tue, Jun 10, 2014 at 7:59 AM, Sean Owen wrote: > It's worth mentioning that Augustus is a Python-based library. On a > related note, in Java-land, I have had good experiences with jpmml's > projects: https://github.com/jpmml in particular https://github.com/jpmml/jpmml-model https://github.c

abnormal latency when running Spark Streaming

2014-06-10 Thread Yingjun Wu
Dear all, I have implemented a simple Spark streaming application to perform windowing wordcount job. However, it seems that the latency is extremely high, compared with running exactly the same job in Storm. The source code is attached as follows: public final class MyKafkaWordcountMain {

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hello! Spark Streaming supports HDFS as input source, and also Akka actor receivers, or TCP socket receivers. For my use case I think it's probably more convenient to read the data directly from Actors, because I already need to set up a multi-node Akka cluster (on the same nodes that Spark runs

Calling JavaPairRDD.first after calling JavaPairRDD.groupByKey results in NullPointerException

2014-06-10 Thread Gaurav Jain
I am getting a strange null pointer exception when trying to list the first entry of a JavaPairRDD after calling groupByKey on it. Following is my code: JavaPairRDD, List> KeyToAppList = KeyToApp.distinct().groupByKey(); // System.out.println("First

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
Hey Nilesh, Great to hear your using Spark Streaming, in my opinion the crux of your question comes down to what you want to do with the data in the future and/or if there is utility it using it from more than one Spark/Streaming job. 1). *One-time-use fire and forget *- as you rightly point out,

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hey Michael, Thanks for the great reply! That clears things up a lot. The idea about Apache Kafka sounds very interesting; I'll look into it. The multiple consumers and fault tolerance sound awesome. That's probably what I need. Cheers, Nilesh -- View this message in context: http://apache-sp

Re: Problem in Spark Streaming

2014-06-10 Thread Boduo Li
Hi Nilmish, What's the data rate/node when you see the high latency? (It seems the latency keeps increasing.) Do you still see it if you lower the data rate or the frequency of the windowed query? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-S

Can't find pyspark when using PySpark on YARN

2014-06-10 Thread 李奇平
Dear all, When I submit a pyspark application using this command: ./bin/spark-submit --master yarn-client examples/src/main/python/wordcount.py "hdfs://..." I get the following exception: Error from python worker: Traceback (most recent call last): File "/usr/ali/lib/python2.5/runpy.py", line 85,

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
How can I measure data rate/node ? I am feeding the data through kafka API. I only know the total inflow data rate which almost remains constant . How can I figure out what amount of data is distributed to the nodes in my cluster ? Latency does not keep on increasing infinetly. It goes up for so

Re: abnormal latency when running Spark Streaming

2014-06-10 Thread Boduo Li
Hi Yingjun, Do you see a stable latency or the latency keeps increasing? And could you provide some details about the input data rate/node, batch interval, windowDuration and slideDuration when you see the high latency? -- View this message in context: http://apache-spark-user-list.1001560.n3.

Spark Streaming socketTextStream

2014-06-10 Thread fredwolfinger
Good morning, I have taken the socketTextStream example and instead of running on a local Spark instance, I have pushed it to my Spark cluster in AWS (1 master with 5 slave nodes). I am getting the following error that appears to indicate that all the slaves are trying to read from localhost:

Re: Problem in Spark Streaming

2014-06-10 Thread Boduo Li
Oh, I mean the average data rate/node. But in case I want to know the input activities to each node (I use a custom receiver instead of Kafka), I usually search these records in logs to get a sense: "BlockManagerInfo: Added input ... on [hostname:port] (size: xxx KB)" I also see some spikes in la

Re: Spark Streaming socketTextStream

2014-06-10 Thread Akhil Das
You can use the master's IP address (Or whichever machine you chose to run the nc command) instead of localhost.

Re: Problem in Spark Streaming

2014-06-10 Thread Yingjun Wu
Hi, as I searched the keyword "Total delay" in the console log, the delay keeps increasing. I am not sure what does this "total delay" mean? For example, if I perform a windowing wordcount with windowSize=1ms and slidingStep=2000ms, then does the delay measured from the 10th second? A sample

Re: Spark Streaming socketTextStream

2014-06-10 Thread fredwolfinger
Worked! Thanks so much! Fred Fred Wolfinger Research Staff Member, CyberPoint Labs direct +1 410 779 6741 mobile +1 443 655 3322 CyberPoint International 621 East Pratt Street, Suite 300 Baltimore MD 21202-3140 phone +1 410 779 6700 www.cyberpointllc.com

Re: FileNotFoundException when using persist(DISK_ONLY)

2014-06-10 Thread Surendranauth Hiraman
Can anyone help point me to configuration options that allow me to reduce the max buffer size when the BlockManager calls doGetRemote()? I'm assuming that is my problem based on the below stack trace. Any help thinking this through (especially if you have dealt with large datasets (greater than RA

Re: pmml with augustus

2014-06-10 Thread filipus
Thank you very much the cascading project i didn't recognize it at all till now this project is very interesting also I got the idea of the usage of scala as a language for spark - becuase i can intergrate jvm based libraries very easy/naturaly when I got it right mh... but I could also use spa

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks
I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant .predic

HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-10 Thread bijoy deb
Hi all, I have build Shark-0.9.1 using sbt using the below command: *SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly* My Hadoop cluster is also having version 2.0.0-mr1-cdh4.6.0. But when I try to execute the below command from Spark shell,which reads a file from HDFS, I get the "IPC v

Spark Logging

2014-06-10 Thread Robert James
How can I write to Spark's logs from my client code? What are the options to view those logs? Besides the Web console, is there a way to read and grep the file?

Re: pmml with augustus

2014-06-10 Thread Paco Nathan
That's a good point about polyglot. Given that Spark is incorporating a range of languages (Scala, Java, Py, R, SQL) it becomes a trade-off whether or not to centralize support or integrate with native options. Going with the latter implies more standardization and less tech debt. The big win with

Re: Spark Logging

2014-06-10 Thread Andrew Or
You can import org.apache.spark.Logging, and use logInfo, logWarning etc. Besides viewing them from the Web console, the location of the logs can be found under $SPARK_HOME/logs, on both the driver and executor machines. (If you are on YARN, these logs are located elsewhere, however.) 2014-06-10

Re: Spark Logging

2014-06-10 Thread coderxiang
By default, the logs are available at `/tmp/spark-events`. You can specify the log directory via spark.eventLog.dir, see this configuration page . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Can't find pyspark when using PySpark on YARN

2014-06-10 Thread Andrew Or
Hi Qi Ping, You don't have to distribute these files; they are automatically packaged in the assembly jar, which is already shipped to the worker nodes. Other people have run into the same issue. See if the instructions here are of any help: http://mail-archives.apache.org/mod_mbox/spark-user/201

Re: Spark Logging

2014-06-10 Thread Surendranauth Hiraman
Event logs are different from writing using a logger, like log4j. The event logs are the type of data showing up in the history server. For my team, we use com.typesafe.scalalogging.slf4j.Logging. Our logs show up in /etc/spark/work///stderr and stdout. All of our logging seems to show up in stde

getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Question on the input and output for ALS.train() and MatrixFactorizationModel.predict(). My input is list of Ratings(user_id, product_id, rating) and my ratings are one a scale of 1-5 (inclusive). When I compute predictions over the superset of all (user_id, product_id) pairs, the ratings produced

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread mpieck
Hi, I have the same problem when running Kafka to Spark Streaming pipeline from Java with explicitely specified message decoders. I had thought, that it was related to Eclipse environment, as suggested here, but it's not the case. I have coded an example based on class: https://github.com/apache/

How to specify executor memory in EC2 ?

2014-06-10 Thread Aliaksei Litouka
I am testing my application in EC2 cluster of m3.medium machines. By default, only 512 MB of memory on each machine is used. I want to increase this amount and I'm trying to do it by passing --executor-memory 2G option to the spark-submit script, but it doesn't seem to work - each machine uses only

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread Michael Chang
I had this same problem as well. I ended up just adding the necessary code in KafkaUtil and compiling my own spark jar. Something like this for the "raw" stream: def createRawStream( jssc: JavaStreamingContext, kafkaParams: JMap[String, String], topics: JMap[String, JInt]

Re: getting started with mllib.recommendation.ALS

2014-06-10 Thread Sean Owen
For trainImplicit(), the output is an approximation of a matrix of 0s and 1s, so the values are generally (not always) in [0,1] But for train(), you should be predicting the original input matrix as-is, as I understand. You should get output in about the same range as the input but again not neces

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread Sean Owen
I added https://issues.apache.org/jira/browse/SPARK-2103 to track this. I also ran into it. I don't have a fix, but, somehow I think someone with more understanding of Scala and Manifest objects might see the easy fix. On Tue, Jun 10, 2014 at 5:15 PM, mpieck wrote: > Hi, > > I have the same probl

Re: pmml with augustus

2014-06-10 Thread filipus
@Paco: I understand that most promising for me to put effort in understanding for in deploying models in the spark enviroment would be augustus and zementis right? actually as you mention I would have both direction of deploying. I have already models which I could transform into pmml and I also t

Re: Information on Spark UI

2014-06-10 Thread coderxiang
The executors shown "CANNOT FIND ADDRESS" are not listed in the Executors Tab on the top of the Spark UI. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Information-on-Spark-UI-tp7354p7355.html Sent from the Apache Spark User List mailing list archive at Na

spark streaming, kafka, SPARK_CLASSPATH

2014-06-10 Thread lannyripple
I am using Spark 1.0.0 compiled with Hadoop 1.2.1. I have a toy spark-streaming-kafka program. It reads from a kafka queue and does stream .map {case (k, v) => (v, 1)} .reduceByKey(_ + _) .print() using a 1 second interval on the stream. The docs say to make Spark and Had

groupBy question

2014-06-10 Thread SK
After doing a groupBy operation, I have the following result: val res = ("ID1",ArrayBuffer((145804601,"ID1","japan"))) ("ID3",ArrayBuffer((145865080,"ID3","canada"), (145899640,"ID3","china"))) ("ID2",ArrayBuffer((145752760,"ID2","usa"), (145934200,"ID2","usa"))) Now I need

Re: groupBy question

2014-06-10 Thread Shuo Xiang
res.map(group => (group._2.size, group._2.map(_._1).max)) On Tue, Jun 10, 2014 at 6:10 PM, SK wrote: > After doing a groupBy operation, I have the following result: > > val res = > ("ID1",ArrayBuffer((145804601,"ID1","japan"))) > ("ID3",ArrayBuffer((145865080,"ID3","canada"), > (145899

Monitoring spark dis-associated workers

2014-06-10 Thread Allen Chang
We're running into an issue where periodically the master loses connectivity with workers in the spark cluster. We believe this issue tends to manifest when the cluster is under heavy load, but we're not entirely sure when it happens. I've seen one or two other messages to this list about this issu

Re: groupBy question

2014-06-10 Thread SK
Great, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-question-tp7357p7360.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

problem starting the history server on EC2

2014-06-10 Thread zhen
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history server on the master node. I used the following command: ./start-history-server.sh /root/spark_log The error message says that the logging directory /root/spark_log does not ex

Re: problem starting the history server on EC2

2014-06-10 Thread bc Wong
What's the permission on /root itself? On Jun 10, 2014 6:29 PM, "zhen" wrote: > I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I > do not seem to be able to start the history server on the master node. I > used the following command: > > ./start-history-server.sh /root/

output tuples in CSV format

2014-06-10 Thread SK
My output is a set of tuples and when I output it using saveAsTextFile, my file looks as follows: (field1_tup1, field2_tup1, field3_tup1,...) (field1_tup2, field2_tup2, field3_tup2,...) In Spark. is there some way I can simply have it output in CSV format as follows (i.e. without the parentheses)

Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification. What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles. I see that the default storage level for RDD

Re: Information on Spark UI

2014-06-10 Thread Neville Li
We are seeing this issue as well. We run on YARN and see logs about lost executor. Looks like some stages had to be re-run to compute RDD partitions lost in the executor. We were able to complete 20 iterations with 20% full matrix but not beyond that (total > 100GB). On Tue, Jun 10, 2014 at 8:32

Re: output tuples in CSV format

2014-06-10 Thread Mikhail Strebkov
you can just use something like this: myRdd(_.productIterator.mkString(",")).saveAsTextFile On Tue, Jun 10, 2014 at 6:34 PM, SK wrote: > My output is a set of tuples and when I output it using saveAsTextFile, my > file looks as follows: > > (field1_tup1, field2_tup1, field3_tup1,...) > (field

RE: output tuples in CSV format

2014-06-10 Thread Shao, Saisai
It would be better to add one more transformation step before saveAsTextFile, like: rdd.map(tuple => "%s,%s,%s".format(tuple._1, tuple._2, tuple._3)).saveAsTextFile(...) By manually convert to the format you what, and then write to HDFS. Thanks Jerry -Original Message- From: SK [mailt

Re: How to process multiple classification with SVM in MLlib

2014-06-10 Thread littlebird
Thanks. Now I know how to broadcast the dataset but I still wonder after broadcasting the dataset how can I apply my algorithm to training the model in the wokers. To describe my question in detail, The following code is used to train LDA(Latent Dirichlet Allocation) model with JGibbLDA in single

Re: problem starting the history server on EC2

2014-06-10 Thread zhen
I checked the permission on root and it is the following: drwxr-xr-x 20 root root 4096 Jun 11 01:05 root So anyway, I changed to use /tmp/spark_log instead and this time I made sure that all permissions are given to /tmp and /tmp/spark_log like below. But it still does not work: drwxrwxrwt 8 r

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
Can you try file:/root/spark_log? 2014-06-10 19:22 GMT-07:00 zhen : > I checked the permission on root and it is the following: > > drwxr-xr-x 20 root root 4096 Jun 11 01:05 root > > So anyway, I changed to use /tmp/spark_log instead and this time I made > sure > that all permissions are given

Re: How to process multiple classification with SVM in MLlib

2014-06-10 Thread littlebird
Someone suggests me to use Mahout, but I'm not familiar with it. And in that case, using Mahout will add difficulties to my program. I'd like to run the algorithm in Spark. I'm a beginner, can you give me some suggestions? -- View this message in context: http://apache-spark-user-list.1001560.n

Re: problem starting the history server on EC2

2014-06-10 Thread zhen
Sure here it is: drwxrwxrwx 2 1000 root 4096 Jun 11 01:05 spark_logs Zhen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-the-history-server-on-EC2-tp7361p7373.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Question about RDD cache, unpersist, materialization

2014-06-10 Thread innowireless TaeYun Kim
Hi, What I (seems to) know about RDD persisting API is as follows: - cache() and persist() is not an action. It only does a marking. - unpersist() is also not an action. It only removes a marking. But if the rdd is already in memory, it is unloaded. And there seems no API to forcefully materializ

Re: getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Thanks Sean. I realized that I was supplying train() with a very low rank so I will retry with something higher and then play with lambda as-needed. On Tue, Jun 10, 2014 at 4:58 PM, Sean Owen wrote: > For trainImplicit(), the output is an approximation of a matrix of 0s > and 1s, so the values

RE: Question about RDD cache, unpersist, materialization

2014-06-10 Thread innowireless TaeYun Kim
BTW, it is possible that rdd.first() does not compute the whole partitions. So, first() cannot be uses for the situation below. -Original Message- From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] Sent: Wednesday, June 11, 2014 11:40 AM To: user@spark.apache.org Subject

Re: Problem in Spark Streaming

2014-06-10 Thread Ashish Rangole
Have you considered the garbage collection impact and if it coincides with your latency spikes? You can enable gc logging by changing Spark configuration for your job. Hi, as I searched the keyword "Total delay" in the console log, the delay keeps increasing. I am not sure what does this "total del

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
No, I meant pass the path to the history server start script. 2014-06-10 19:33 GMT-07:00 zhen : > Sure here it is: > > drwxrwxrwx 2 1000 root 4096 Jun 11 01:05 spark_logs > > Zhen > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/problem-starting-t

Re: How to specify executor memory in EC2 ?

2014-06-10 Thread Matei Zaharia
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete that line if possible. Matei On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka wrote: > I am testing my application in EC2 cluster of m3.medium

Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar
Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers On Tue, Jun 10, 2014 at 6:29 PM, zhen wrote: > I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I > do not seem to be able to start the history server on the master n

RE: Question about RDD cache, unpersist, materialization

2014-06-10 Thread Nick Pentreath
If you want to force materialization use .count() Also if you can simply don't unpersist anything, unless you really need to free the memory  — Sent from Mailbox On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim wrote: > BTW, it is possible that rdd.first() does not compute the whole p

Re: Spark Streaming not processing file with particular number of entries

2014-06-10 Thread praveshjain1991
Well i was able to get it to work by running spark over mesos. But it looks like a bug while running spark alone. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-tp6694p7382.html Sent from

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-10 Thread elyast
Hi, I'm facing similar problem According to: http://tachyon-project.org/Running-Spark-on-Tachyon.html in order to allow tachyon client to connect to tachyon master in HA mode you need to pass 2 system properties: -Dtachyon.zookeeper.address=zookeeperHost1:2181,zookeeperHost2:2181 -Dtachyon.use