Re: Many Receiver vs. Many threads per Receiver

2015-02-23 Thread Akhil Das
I believe when you go with 1, it will distribute the consumer across your cluster (possibly on 6 machines), but still it i don't see a away to tell from which partition it will consume etc. If you are looking to have a consumer where you can specify the partition details and all, then you are bette

Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das
Yes. On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi wrote: > @Tathagata Das so basically you are saying it is supported out of the > box, but we should expect a significant performance hit - is that right? > > > > On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das > wrote: > >> The default persistence

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das
I think it will not affect. We are ignore the offsets store any where outside Spark Streaming. It is the fact that progress information was being stored in two different places (SS and Kafka/ZK) that was causing inconsistencies and duplicates. TD On Mon, Feb 23, 2015 at 11:27 PM, Felix C wrote:

Re: How to start spark-shell with YARN?

2015-02-23 Thread Arush Kharbanda
Hi Are you sure that you built Spark for Yarn.If standalone works, not sure if its build for Yarn. Thanks Arush On Tue, Feb 24, 2015 at 12:06 PM, Xi Shen wrote: > Hi, > > I followed this guide, > http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to > start spark-shell with yar

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Felix C
Kafka 0.8.2 has built-in offset management, how would that affect direct stream in spark? Please see KAFKA-1012 --- Original Message --- From: "Tathagata Das" Sent: February 23, 2015 9:53 PM To: "V Dineshkumar" Cc: "user" Subject: Re: Write ahead Logs and checkpoint Exactly, that is the reas

Spark sql issue

2015-02-23 Thread Udit Mehta
Hi, I am using spark sql to create/alter hive tables. I have a highly nested json and I am using the schemRDD to infer the schema. The json has 6 columns and 1 of the column (which is a struct) has around 60 fields (key value pairs). When I run the spark sql query for the above table, it just hang

Re: spark streaming window operations on a large window size

2015-02-23 Thread Avi Levi
@Tathagata Das so basically you are saying it is supported out of the box, but we should expect a significant performance hit - is that right? On Tue, Feb 24, 2015 at 5:37 AM, Tathagata Das wrote: > The default persistence level is MEMORY_AND_DISK, so the LRU policy would > discard the blocks

Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread fanooos
I have resolved this issue. Actually there was two problems. The first problem in the application was the port of the HDFS. It was configured (in core-site.xml) to 9000 but in the application I was using 50070 which (as I think) the default port. The second problem, I forgot to put the file into

Re: issue Running Spark Job on Yarn Cluster

2015-02-23 Thread sachin Singh
I am using CDH5.3.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p21780.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

issue Running Spark Job on Yarn Cluster

2015-02-23 Thread sachin Singh
Hi, I want to run my spark Job in Hadoop yarn Cluster mode, I am using below command - spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 --class com.dc.analysis.jobs.AggregationJob sparkanalitic.jar param1 param2 param3 I am getting error as under, kindly

Fwd: How to start spark-shell with YARN?

2015-02-23 Thread Xi Shen
Hi, I followed this guide, http://spark.apache.org/docs/1.2.1/running-on-yarn.html, and tried to start spark-shell with yarn-client ./bin/spark-shell --master yarn-client But I got WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@10.0.2.15:38171] has fail

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Try to set lambda to 0.1. -Xiangrui On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar wrote: > The RSME varies a little bit between the versions. > Partitioned the training,validation,test set like so: > > training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6) > validation = ratings_rdd_01.fil

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Xiangrui Meng
FYI, in 1.3 we support save/load tree models in Scala and Java. We will add save/load support to Python soon. -Xiangrui On Mon, Feb 23, 2015 at 2:57 PM, Sebastián Ramírez < sebastian.rami...@senseta.com> wrote: > In your log it says: > > pickle.PicklingError: Can't pickle : it's not found as > th

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
The behvior is exactly what I expected. Thanks Akhil and Tathagata! bit1...@163.com From: Akhil Das Date: 2015-02-24 13:32 To: bit1129 CC: Tathagata Das; user Subject: Re: Re: About FlumeUtils.createStream That depends on how many machines you have in your cluster. Say you have 6 workers and

Re: Task not serializable exception

2015-02-23 Thread Kartheek.R
I could trace where the problem is. If I run without any threads, it works fine. When I allocate threads, I run into Not serializable problem. But, I need to have threads in my code. Any help please!!! This is my code: object SparkKart { def parseVector(line: String): Vector[Double] = { Dens

Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread أنس الليثي
Hadoop version : 2.6.0 Spark Version : 1.2.1 here is also the pom.xml http://maven.apache.org/POM/4.0.0"; xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation=" http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 TestSpark TestSpark

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Tathagata Das
Exactly, that is the reason. To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API (called direct stream) which does not use Zookeeper at all to keep track of progress, and maintains offset within Spark Streaming. That can guarantee all records being received exactly-once. Its

Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread necro351 .
Sorry Ted, that was me clumsily trying to redact my organization's name from the computer output (in my e-mail editor). I can assure you that basically defend7 and rick are the same thing in this case so the class is present in the jar. On Mon Feb 23 2015 at 9:39:09 PM Ted Yu wrote: > The classn

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-23 Thread Tathagata Das
In case this mystery has not been solved, DStream.print() essentially does a RDD.take(10) on each RDD, which computes only a subset of the partitions in the RDD. But collects forces the evaluation of all the RDDs. Since you are writing to json in the mapI() function, this could be the reason. TD

Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread Ted Yu
bq. have installed hadoop on a local virtual machine Can you tell us the release of hadoop you installed ? What Spark release are you using ? Or be more specific, what hadoop release was the Spark built against ? Cheers On Mon, Feb 23, 2015 at 9:37 PM, fanooos wrote: > Hi > > I have installed

Write ahead Logs and checkpoint

2015-02-23 Thread V Dineshkumar
Hi, My spark streaming application is pulling data from Kafka.To prevent data loss I have implemented WAL and enable checkpointing.On killing my application and restarting it I am able to prevent data loss now but however I am getting duplicate messages. Is it because the application got killed b

Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread Ted Yu
The classname given in stack trace was com.rick.reports.Reports In the output from jar command the class is com.defend7.reports.Reports. FYI On Mon, Feb 23, 2015 at 9:33 PM, necro351 . wrote: > Hi Ted, > > Yes it appears to be: > rick@ubuntu:~/go/src/rick/sparksprint/containers/tests/Streaming

InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread fanooos
Hi I have installed hadoop on a local virtual machine using the steps from this URL https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10 In the local machine I write a little Spark application in java to read a file from the hadoop instance installed in the vir

Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread necro351 .
Hi Ted, Yes it appears to be: rick@ubuntu:~/go/src/rick/sparksprint/containers/tests/StreamingReports$ jar tvf ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar|grep SensorReports 1128 Mon Feb 23 17:34:46 PST 2015 com/defend7/reports/Reports$SensorReports$1.class 13507

How to get yarn logs to display in the spark or yarn history-server?

2015-02-23 Thread Colin Kincaid Williams
Hi, I have been trying to get my yarn logs to display in the spark history-server or yarn history-server. I can see the log information yarn logs -applicationId application_1424740955620_0009 15/02/23 22:15:14 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to us3sm2hbqa04r07-comp-pr

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread Akhil Das
That depends on how many machines you have in your cluster. Say you have 6 workers and its most likely it is to be distributed across all worker (assuming your topic has 6 partitions). Now when you have more than 6 partition, say 12. Then these 6 receivers will start to consume from 2 partitions at

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread Tathagata Das
Distributed among cluster nodes. On Mon, Feb 23, 2015 at 8:45 PM, bit1...@163.com wrote: > Hi, Akhil,Tathagata, > > This leads me to another question ,For the Spark Streaming and Kafka > Integration, If there are more than one Receiver in the cluster, such as > val streams = (1 to 6).map ( _ =

Task not serializable exception

2015-02-23 Thread Kartheek.R
Hi, I have a file containig data in the following way: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Now I do the folloowing: val kPoints = data.takeSample(withReplacement = false, 4, 42).toArray val thread1= new Thread(new Runnable { def run() { v

Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread Ted Yu
bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.Reports$ SensorReports Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ? Thanks On Mon, Feb 23, 2015 at 8:43 PM, necro351 . wrote: > Hello, > > I am trying to deserialize some data encoded using proto buff fro

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
Hi, Akhil,Tathagata, This leads me to another question ,For the Spark Streaming and Kafka Integration, If there are more than one Receiver in the cluster, such as val streams = (1 to 6).map ( _ => KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) ), then these Receivers will

Getting to proto buff classes in Spark Context

2015-02-23 Thread necro351 .
Hello, I am trying to deserialize some data encoded using proto buff from within Spark and am getting class-not-found exceptions. I have narrowed the program down to something very simple that shows the problem exactly (see 'The Program' below) and hopefully someone can tell me the easy fix :) So

Re: Re: About FlumeUtils.createStream

2015-02-23 Thread bit1...@163.com
Thanks both of you guys on this! bit1...@163.com From: Akhil Das Date: 2015-02-24 12:58 To: Tathagata Das CC: user; bit1129 Subject: Re: About FlumeUtils.createStream I see, thanks for the clarification TD. On 24 Feb 2015 09:56, "Tathagata Das" wrote: Akhil, that is incorrect. Spark will li

Re: About FlumeUtils.createStream

2015-02-23 Thread Akhil Das
I see, thanks for the clarification TD. On 24 Feb 2015 09:56, "Tathagata Das" wrote: > Akhil, that is incorrect. > > Spark will list on the given port for Flume to push data into it. > When in local mode, it will listen on localhost: > When in some kind of cluster, instead of localhost you wi

Re: About FlumeUtils.createStream

2015-02-23 Thread Tathagata Das
Akhil, that is incorrect. Spark will list on the given port for Flume to push data into it. When in local mode, it will listen on localhost: When in some kind of cluster, instead of localhost you will have to give the hostname of the cluster node where you want Flume to forward the data. Spark

Re: Re: Re_ Re_ Does Spark Streaming depend on Hadoop_(4)

2015-02-23 Thread bit1...@163.com
Thanks Tathagata! You are right, I have packaged the contents of the spark shipped example jar into my jarwhich contains serveral HDFS configuration files like hdfs-default.xml etc. Thanks! bit1...@163.com From: Tathagata Das Date: 2015-02-24 12:04 To: bit1...@163.com CC: yuzhihong; silv

Re: Accumulator in SparkUI for streaming

2015-02-23 Thread Tathagata Das
Unless I am unaware some latest changes, the SparkUI shows stages, and jobs, not accumulator results. And the UI not designed to be pluggable for showing user-defined stuff. TD On Fri, Feb 20, 2015 at 12:25 AM, Tim Smith wrote: > On Spark 1.2: > > I am trying to capture # records read from a ka

Re: Executor size and checkpoints

2015-02-23 Thread Tathagata Das
Hey Yana, I think you posted screenshots, but they are not visible in the email. Probably better to upload them and post links. Are you using StreamingContext.getOrCreate? If that is being used, then it will recreate the SparkContext with SparkConf having whatever configuration is present in the

Many Receiver vs. Many threads per Receiver

2015-02-23 Thread bit1...@163.com
Hi, I am experimenting Spark Streaming and Kafka Integration, To read messages from Kafka in parallel, basically there are two ways 1. Create many Receivers like (1 to 6).map(_ => KakfaUtils.createStream). 2. Specifiy many threads when calling KakfaUtils.createStream like val topicMap("myTopic"

Re: Re_ Re_ Does Spark Streaming depend on Hadoop_(4)

2015-02-23 Thread Tathagata Das
You could have a hdfs configuration files in the classpath of the program. HDFS libraries that Spark uses automatically picks those up and starts using them. TD On Mon, Feb 23, 2015 at 7:47 PM, bit1...@163.com wrote: > I am crazy for frequent mail rejection so I create a new thread > SMTP error

Re_ Re_ Does Spark Streaming depend on Hadoop_(4)

2015-02-23 Thread bit1...@163.com
I am crazy for frequent mail rejection so I create a new thread SMTP error, DOT: 552 spam score (5.7) exceeded threshold (FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_BL_SPAMCOP_NET,SPF_PASS Hi Silvio and Ted I know there is a configuration parameter to cont

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
Is your laptop behind a NAT ? I got bitten by a similar issue and (I think) it was because I was behind a NAT that did not forward the public ip back to my private ip unless the connection originated from my private ip cheers On Tue, Feb 24, 2015 at 5:20 AM, Oleg Shirokikh wrote: > Dear Patric

Re: spark streaming window operations on a large window size

2015-02-23 Thread Tathagata Das
The default persistence level is MEMORY_AND_DISK, so the LRU policy would discard the blocks to disk, so the streaming app will not fail. However, since things will get constantly read in and out of disk as windows are processed, the performance wont be great. So it is best to have sufficient memor

Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread Ted Yu
This thread could be related: http://search-hadoop.com/m/JW1q592kqi&subj=Re+spark+shell+working+in+scala+2+11+breaking+change+ On Mon, Feb 23, 2015 at 7:08 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Looks like your Spark config may be trying to log to an HDFS path. Can > you re

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
Here, I wanted to ask a different thing though. Let me put it this way. What is the relationship between the performance of a Spark Job and the number of cores in the standalone Spark single node cluster. Thank You On Tue, Feb 24, 2015 at 8:39 AM, Deep Pradhan wrote: > You mean SPARK_WORKER_COR

Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread Silvio Fiorito
Looks like your Spark config may be trying to log to an HDFS path. Can you review your config settings? From: bit1...@163.com Sent: ‎Monday‎, ‎February‎ ‎23‎, ‎2015 ‎9‎:‎54‎ ‎PM To: yuzhihong Cc: user@spark.apache.org

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui wrote: > In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there > are slots for task threads. The slot count is configured by the num_cores > setting. Generally over subscribe

Re: Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread bit1...@163.com
[hadoop@hadoop bin]$ sh submit.log.streaming.kafka.complicated.sh Spark assembly has been built with Hive, including Datanucleus jars on classpath Start to run MyKafkaWordCount Exception in thread "main" java.net.ConnectException: Call From hadoop.master/192.168.26.137 to hadoop.master:9000 fa

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

2015-02-23 Thread Cheng Lian
(Move to user list.) Hi Kannan, You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code , which overrides |spark.default.parallelism|. Also,

Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread Ted Yu
Can you pastebin the whole stack trace ? Thanks > On Feb 23, 2015, at 6:14 PM, "bit1...@163.com" wrote: > > Hi, > > When I submit a spark streaming application with following script, > > ./spark-submit --name MyKafkaWordCount --master local[20] --executor-memory > 512M --total-executor-cor

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I *think* this may have been related to the default memory overhead setting being too low. I raised the value to 1G it and tried my job again but i had to leave the office before it finished. It did get further but I'm not exactly sure if that's just because i raised the memory. I'll see tomorrow-

Does Spark Streaming depend on Hadoop?

2015-02-23 Thread bit1...@163.com
Hi, When I submit a spark streaming application with following script, ./spark-submit --name MyKafkaWordCount --master local[20] --executor-memory 512M --total-executor-cores 2 --class spark.examples.streaming.MyKafkaWordCount my.kakfa.wordcountjar An exception occurs: Exception in thread "ma

Re: Spark SQL Where IN support

2015-02-23 Thread Michael Armbrust
Yes. On Mon, Feb 23, 2015 at 1:45 AM, Paolo Platter wrote: > I was speaking about 1.2 version of spark > > Paolo > > *Da:* Paolo Platter > *Data invio:* ‎lunedì‎ ‎23‎ ‎febbraio‎ ‎2015 ‎10‎:‎41 > *A:* user@spark.apache.org > > Hi guys, > > Is the “IN” operator supported in Spark SQL over

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-23 Thread Michael Armbrust
This is not currently supported. Right now you can only get RDD[Row] as Ted suggested. On Sun, Feb 22, 2015 at 2:52 PM, Ted Yu wrote: > Haven't found the method in > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD > > The new DataFrame has this method: >

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tobias Pfeiffer
Hi, On Tue, Feb 24, 2015 at 4:34 AM, Tathagata Das wrote: > There are different kinds of checkpointing going on. updateStateByKey > requires RDD checkpointing which can be enabled only by called > sparkContext.setCheckpointDirectory. But that does not enable Spark > Streaming driver checkpoints,

Performance Instrumentation for Spark Jobs

2015-02-23 Thread Neil Ferguson
Hi all I wanted to share some details about something I've been working on with the folks on the ADAM project: performance instrumentation for Spark jobs. We've added a module to the bdg-utils project ( https://github.com/bigdatagenomics/bdg-utils) to enable Spark users to instrument RDD operatio

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) >= 6 and (x[3] % 10) < 8) - test = ratin

Re: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Imran Rashid
I think you're getting tripped up lazy evaluation and the way stage boundaries work (admittedly its pretty confusing in this case). It is true that up until recently, if you unioned two RDDs with the same partitioner, the result did not have the same partitioner. But that was just fixed here: htt

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-23 Thread Sebastián Ramírez
In your log it says: pickle.PicklingError: Can't pickle : it's not found as thread.lock As far as I know, you can't pickle Spark models. If you go to the documentation for Pickle you can see that you can pickle only simple Python structures and code (written in Python), at least as I understand:

Re: Spark configuration

2015-02-23 Thread Shlomi Babluki
I guess you downloaded the source code. You can build it with the following command: mvn -DskipTests clean package Or just download a compiled version. Shlomi > On 24 בפבר׳ 2015, at 00:40, King sami wrote: > > Hi Experts, > I am new in Spark, so I want manipulate it locally on my machine wi

Re: Spark configuration

2015-02-23 Thread Sean Owen
It sounds like you downloaded the source distribution perhaps, but have not built it. That's what the message is telling you. See http://spark.apache.org/docs/latest/building-spark.html Or maybe you intended to get a binary distribution. On Mon, Feb 23, 2015 at 10:40 PM, King sami wrote: > Hi Ex

Spark configuration

2015-02-23 Thread King sami
Hi Experts, I am new in Spark, so I want manipulate it locally on my machine with Ubuntu as OS. I dowloaded the last version of Spark. I ran this command to start it : ./sbin/start-master.sh but an error is occured : *starting org.apache.spark.deploy.master.Master, logging to /home/spuser/Bur

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I've got the opposite problem with regards to partitioning. I've got over 6000 partitions for some of these RDDs which immediately blows the heap somehow- I'm still not exactly sure how. If I coalesce them down to about 600-800 partitions, I get the problems where the executors are dying without an

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I've no context of this book, AFAIK union will not trigger shuffle, as they just put the partitions together, the operator reduceByKey() will actually trigger shuffle. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 12:26 PM To: Shao, Saisai Cc: use

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I think some RDD APIs like zipPartitions or others can do this as you wanted. I might check the docs. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 1:35 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger s

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
This also trigger an interesting question: how can I do this locally by code if I want. For example: I have RDD A and B, which has some partition, then if I want to join A to B, I might just want to do a mapper side join (although B itself might be big, but B's local partition is known small enoug

Re: Missing shuffle files

2015-02-23 Thread Anders Arpteg
Sounds very similar to what I experienced Corey. Something that seems to at least help with my problems is to have more partitions. Am already fighting between ending up with too many partitions in the end and having too few in the beginning. By coalescing at late as possible and avoiding too few i

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Just to close the loop in case anyone runs into the same problem I had. By setting --hadoop-major-version=2 when using the ec2 scripts, everything worked fine. Darin. - Original Message - From: Darin McBeath To: Mingyu Kim ; Aaron Davidson Cc: "user@spark.apache.org" Sent: Monday, F

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
I'm looking @ my yarn container logs for some of the executors which appear to be failing (with the missing shuffle files). I see exceptions that say "client.TransportClientFactor: Found inactive connection to host/ip:port, closing it." Right after that I see "shuffle.RetryingBlockFetcher: Excepti

On app upgrade, restore sliding window data.

2015-02-23 Thread Matus Faro
Hi, Our application is being designed to operate at all times on a large sliding window (day+) of data. The operations performed on the window of data will change fairly frequently and I need a way to save and restore the sliding window after an app upgrade without having to wait the duration of t

Re: Movie Recommendation tutorial

2015-02-23 Thread Xiangrui Meng
Which Spark version did you use? Btw, there are three datasets from MovieLens. The tutorial used the medium one (1 million). -Xiangrui On Mon, Feb 23, 2015 at 8:36 AM, poiuytrez wrote: > What do you mean? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabbl

Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Xiangrui Meng
Yes, we are going to expose the developer API. There was a long discussion in the PR: https://github.com/apache/spark/pull/3637. So we marked them package private and look for feedback on how to improve it. Please implement your classes under `spark.ml` for now and let us know your feedback. Thanks

Re: shuffle data taking immense disk space during ALS

2015-02-23 Thread Xiangrui Meng
Did you try to use less number of partitions (user/product blocks)? Did you use implicit feedback? In the current implementation, we only do checkpointing with implicit feedback. We should adopt the checkpoint strategy implemented in LDA: https://github.com/apache/spark/blob/master/mllib/src/main/s

Re: Efficient way of scoring all items and users in an ALS model

2015-02-23 Thread Xiangrui Meng
You can use rdd.cartesian then find top-k by key to distribute the work to executors. There is a trick to boost the performance: you need to blockify user/product features and then use native matrix-matrix multiplication. There is a relevant PR from Deb: https://github.com/apache/spark/pull/3098 .

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
In the book of learning spark: So here it means only no shuffle happen crossing network but still will do shuffle locally? Even it is the case, why union will trigger shuffle? I think union will only just append the RDD together. From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent:

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions. I removed the "persist" call from program. Doing so I started it with: spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn /tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/ This takes all the default and only runs 2 execut

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark itself do not know the data is already well partitioned. So if you want to avoid Shuffle, you have to write the code explicitly to avoid this, from my unde

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Thanks. I think my problem might actually be the other way around. I'm compiling with hadoop 2, but when I startup Spark, using the ec2 scripts, I don't specify a -hadoop-major-version and the default is 1. I'm guessing that if I make that a 2 that it might work correctly. I'll try it and

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Mingyu Kim
Cool, we will start from there. Thanks Aaron and Josh! Darin, it¹s likely because the DirectOutputCommitter is compiled with Hadoop 1 classes and you¹re running it with Hadoop 2. org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it became an interface in Hadoop 2. Mingyu

Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
Hi All, I am running a simple page rank program, but it is slow. And I dig out part of reason is there is shuffle happen when I call an union action even both RDD share the same partition: Below is my test code in spark shell: import org.apache.spark.HashPartitioner sc.getConf.set("

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Aaron. Thanks for the class. Since I'm currently writing Java based Spark applications, I tried converting your class to Java (it seemed pretty straightforward). I set up the use of the class as follows: SparkConf conf = new SparkConf() .set("spark.hadoop.mapred.output.committer.class", "com

Re: Query data in Spark RRD

2015-02-23 Thread Tathagata Das
You could build a rest API, but you may have issue if you want to return back arbitrary binary data. A more complex but robust alternative is to use some RPC libraries like Akka, Thrift, etc. TD On Mon, Feb 23, 2015 at 12:45 AM, Nikhil Bafna wrote: > > Tathagata - Yes, I'm thinking on that line

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
Yes, if you're willing to add an explicit foreach(println), then that is the simplest solution. Else changing maxPrintString should modify the default output of the Scala/Spark REPL. On Mon, Feb 23, 2015 at 11:25 AM, Sean Owen wrote: > I'd imagine that myRDD.take(10).foreach(println) is the mos

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-02-23 Thread Tathagata Das
There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is enab

Re: Access time to an elemnt in cached RDD

2015-02-23 Thread Sean Owen
It may involve access an element of an RDD from a remote machine and copying it back to the driver. That and the small overhead of job scheduling could be a millisecond. You're comparing to just reading an entry from memory, which is of course faster. I don't think you should think of an RDD as s

Re: How to print more lines in spark-shell

2015-02-23 Thread Sean Owen
I'd imagine that myRDD.take(10).foreach(println) is the most straightforward thing but yeah you can probably change shell default behavior too. On Mon, Feb 23, 2015 at 7:15 PM, Mark Hamstra wrote: > That will produce very different output than just the 10 items that Manas > wants. > > This is ess

Re: How to print more lines in spark-shell

2015-02-23 Thread Mark Hamstra
That will produce very different output than just the 10 items that Manas wants. This is essentially a Scala shell issue, so this should apply: http://stackoverflow.com/questions/9516567/settings-maxprintstring-for-scala-2-9-repl On Mon, Feb 23, 2015 at 10:25 AM, Akhil Das wrote: > You can do i

Re: RDD groupBy

2015-02-23 Thread Vijayasarathy Kannan
You are right. I was looking at the wrong logs. I ran it on my local machine and saw that the println actually wrote the vertexIds. I was then able to find the same in the executors' logs in the remote machine. Thanks for the clarification. On Mon, Feb 23, 2015 at 2:00 PM, Sean Owen wrote: > He

Re: RDD groupBy

2015-02-23 Thread Sean Owen
Here, println isn't happening on the driver. Are you sure you are looking at the right machine's logs? Yes this may be parallelized over many machines. On Mon, Feb 23, 2015 at 6:37 PM, kvvt wrote: > In the snippet below, > > graph.edges.groupBy[VertexId](f1).foreach { > edgesBySrc => { > f

Re: Requested array size exceeds VM limit

2015-02-23 Thread Sean Owen
It doesn't mean 'out of memory'; it means 'you can't allocate a byte[] over 2GB in the JVM'. Something is serializing a huge block somewhere. there are a number of related JIRAs and discussions on JIRA and this mailing list; have a browse of those first for back story. On Mon, Feb 23, 2015 at 6:44

Requested array size exceeds VM limit

2015-02-23 Thread insperatum
Hi,I'm using MLLib to train a random forest. It's working fine to depth 15, but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array size exceeds VM limit* on the driver, from the collectAsMap operation in DecisionTree.scala, around line 642.It doesn't happen until a good hour into

RDD groupBy

2015-02-23 Thread kvvt
In the snippet below, graph.edges.groupBy[VertexId](f1).foreach { edgesBySrc => { f2(edgesBySrc).foreach { vertexId => { *println(vertexId)* } } } } "f1" is a function that determines how to group the edges (in my case it groups by source vertex) "f2" is another fu

Re: How to print more lines in spark-shell

2015-02-23 Thread Akhil Das
You can do it like myRDD.foreach(println(_)) to print everything. Thanks Best Regards On Mon, Feb 23, 2015 at 11:49 PM, Manas Kar wrote: > Hi experts, > I am using Spark 1.2 from CDH5.3. > When I issue commands like > myRDD.take(10) the result gets truncated after 4-5 records. > > Is there a

Re: How to integrate HBASE on Spark

2015-02-23 Thread sandeep vura
Hi Deepak, Thanks for posting the link.Looks Like it supports only for cloudera distributions as per given in github. We are using apache hadoop multinode cluster not cloudera distribution.Please confirm me whether i can use it on apache hadoop cluster. Regards, Sandeep.v On Mon, Feb 23, 2015 a

RE: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Oleg Shirokikh
Dear Patrick, Thanks a lot again for your help. > What happens if you submit from the master node itself on ec2 (in client > mode), does that work? What about in cluster mode? If I SSH to the machine with Spark master, then everything works - shell, and regular submit in both client and cluste

How to print more lines in spark-shell

2015-02-23 Thread Manas Kar
Hi experts, I am using Spark 1.2 from CDH5.3. When I issue commands like myRDD.take(10) the result gets truncated after 4-5 records. Is there a way to configure the same to show more items? ..Manas

Access time to an elemnt in cached RDD

2015-02-23 Thread shahab
Hi, I just wonder what would be the access time to "take" one element from a cached RDD? if I have understood correctly, access to RDD elements is not as fast as accessing e.g. HashMap and it could take up to mili seconds compare to nano seconds in HashMap, which is quite significant difference i

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan wrote: >

Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
How is task slot different from # of Workers? >> so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui wrote: > In general you should first figure out how many

Re: Running Example Spark Program

2015-02-23 Thread Deepak Vohra
The Spark cluster has no memory allocated. Memory: 0.0 B Total, 0.0 B Used   From: Surendran Duraisamy <2013ht12...@wilp.bits-pilani.ac.in> To: user@spark.apache.org Sent: Sunday, February 22, 2015 6:00 AM Subject: Running Example Spark Program Hello All, I am new to Apache Spark,

Re: Force RDD evaluation

2015-02-23 Thread Nicholas Pritchard
Thanks, Sean! Yes, I agree that this logging would still have some cost and so would not be used in production. On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen wrote: > I think the cheapest possible way to force materialization is something > like > > rdd.foreachPartition(i => None) > > I get the use

  1   2   >