MLlib ALS-- Errors communicating with MapOutputTracker

2014-05-21 Thread Sue Cai
Hello, I am currently using MLlib ALS to process a large volume of data, about 1.2 billion Rating(userId, productId, rates) triples. The dataset was sepatated into 4000 partitions for parallized computation on our yarn clusters. I encountered this error "Errors communicating with MapOutputTracke

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are "received". If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Apologies for the premature send. Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are "received". If you are using any of the standard receivers (Flume, Kafka, etc.), I recomm

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread sagi
if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus wrote: > Aaron: > > I see this in the Master's l

ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Hi, I have set up a cluster with Mesos (backed by Zookeeper) with three master and three slave instances. I set up Spark (git HEAD) for use with Mesos according to this manual: http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html Using the spark-shell, I can connect to this clus

Re: Ignoring S3 0 files exception

2014-05-21 Thread Laurent T
Noone has any idea ?It's really troublesome, it seems like i have no way to catch errors while an action is beeing processed and just ignore it.Here's a bit more details on what i'm doing: JavaRDD a = sc.textFile("s3n://"+missingFilenamePattern) JavaRDD b = sc.textFile("s3n://"+existingFilenamePat

RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi, I want to do union of all RDDs in each window of DStream. I found Dstream.union and haven't seen anything like DStream.windowRDDUnion. Is there any way around it? I want to find mean and SD of all values which comes under each sliding window for which I need to union all the RDDs in each w

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, I was curious about this issue and tried to run your example on my local Mesos. I was able to reproduce your issue using your current config: [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 1.0:4 failed 4 times (most recent failure: Exception failure: java.lang.

Log analysis

2014-05-21 Thread Shubhabrata
I am new to spark and we are developing a data science pipeline based on spark on ec2. So far we have been using python on spark standalone cluster. However, being a newbie I would like to know more about how can I do debugging (program level) from spark logs (is it stderr ?). I find it a bit diffi

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Madhu
Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a "bad header" error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaia

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Gerard, thanks very much for your investigation! After hours of trial and error, I am kind of happy to hear it is not just a broken setup on my side that's causing the error. Could you explain briefly how you created that simple jar file? Thanks, Tobias On Wed, May 21, 2014 at 9:47 PM, Gerard M

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, For your simple example, I just used sbt package, but for more complex jobs that have external dependencies, either: - you should use sbt assembly [1] or mvn shade plugin [2] to build a "fat jar" (aka jar-with-dependencies) - or provide a list of jars including your job jar along wit

pyspark.rdd.ResultIterable?

2014-05-21 Thread T.J. Alumbaugh
Hi, I'm noticing a difference between two installations of Spark. I'm pretty sure both are version 0.9.1. One is able to import pyspark.rdd.ResultIterable and the other isn't. Is this an environment problem or do we actually have two different versions of Spark? To be clear, on one box, one can d

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Michael Armbrust
You can already extract fields from json data using Hive UDFs. We have an intern working on on better native support this summer. We will be sure to post updates once there is a working prototype. Michael On Tue, May 20, 2014 at 6:46 PM, Nick Chammas wrote: > The Apache Drill

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, Regarding my comment on closure serialization: I was discussing it with my fellow Sparkers here and I totally overlooked the fact that you need the class files to de-serialize the closures (or whatever) on the workers, so you always need the jar file delivered to the workers in order f

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Tobias Pfeiffer
Hi Gerard, first, thanks for your explanations regarding the jar files! On Thu, May 22, 2014 at 12:32 AM, Gerard Maas wrote: > I was discussing it with my fellow Sparkers here and I totally overlooked > the fact that you need the class files to de-serialize the closures (or > whatever) on the wo

Re: Python, Spark and HBase

2014-05-21 Thread twizansk
Thanks Nick and Matei. I'll take a look at the patch and keep you updated. Tommer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Han JU
I've seen also worker loss and that's way I asked a question about worker re-spawn. My typical case is there's some job got OOM exception. Then on the master UI some worker's state becomes DEAD. In the master's log, there's error like: ``` 14/05/21 15:38:02 ERROR remote.EndpointWriter: Associatio

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Mark Hamstra
After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to fi

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail h

Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by S

Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Does anyone know if: ./bin/spark-shell --master yarn is running yarn-cluster or yarn-client by default? Base on source code: ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala if (args.deployMode == "cluster" && args.master.startsWith("yarn")) { args.master = "yarn-cl

Re: RDD union of a window in Dstream

2014-05-21 Thread Sean Owen
Are you not just looking for the window() function that creates the sliding-window RDDs in the first place? That DStreams' RDDs give you all elements in the sliding window, and you can compute a mean or variance as you like. You should be able to do this quite efficiently without recomputing each

Re: How to run the SVM and LogisticRegression

2014-05-21 Thread yxzhao
Thanks, Debasish, Could you let me know the full path of BinaryClassification.scala and how to run the SVM or LR? I did not find this file in Spark 0.9.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-the-SVM-and-LogisticRegression-tp5720p6182

Re: Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Or
The answer is actually yarn-client. A quick way to find out: $ bin/spark-shell --master yarn --verbose >From the system properties you can see spark.master is set to "yarn-client." From the code, this is because args.deployMode is null, and so it's not equal to "cluster" and so it falls into the

Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
I run the pagerank example processing a large data set, 5GB in size, using 48 machines. The job got stuck at the time point: 14/05/20 21:32:17, as the attached log shows. It was stuck there for more than 10 hours and then I killed it at last. But I did not find any information explaining why it was

ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Mohit Jaggi
Hi, I changed my application to use Joda time instead of java.util.Date and I started getting this: WARN ExternalAppendOnlyMap: Spilling in-memory map of 484 MB to disk (1 time so far) What does this mean? How can I fix this? Due to this a small job takes forever. Mohit. P.S.: I am using kyro

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui On Wed, May 21, 2014 at 11:23 AM, yxzhao wrote: > I run the pagerank example processing a large data set, 5GB in size, using 48 > machines. The job got stuck at the time point: 14/05/20 21:32:17, as the > attached log s

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Tobias, On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer wrote: >first, thanks for your explanations regarding the jar files! No prob :-) > On Thu, May 22, 2014 at 12:32 AM, Gerard Maas > wrote: > > I was discussing it with my fellow Sparkers here and I totally overlooked > > the fact that

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread yxzhao
Thanks Xiangrui, How to check and make sure the data is distributed evenly? Thanks again. On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User List] wrote: > Many OutOfMemoryErrors in the log. Is your data distributed evenly? > -Xiangrui > > On Wed, May 21, 2014 at 11:23 AM, yxzha

Re: Job Processing Large Data Set Got Stuck

2014-05-21 Thread Xiangrui Meng
If the RDD is cached, you can check its storage information in the Storage tab of the Web UI. On Wed, May 21, 2014 at 12:31 PM, yxzhao wrote: > Thanks Xiangrui, How to check and make sure the data is distributed > evenly? Thanks again. > On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Andrew Ash
Here's the 1.0.0rc9 version of the docs: https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html I refreshed them with the goal of steering users more towards prebuilt packages than relying on compiling from source plus improving overall formatting and clarity, but not otherw

Re: unsubscribe

2014-05-21 Thread Shangyu Luo
Does any one know how to configure the digest mailing list? For example, I want to receive daily digest, not every 10 messages. Thanks! On Mon, May 19, 2014 at 4:29 PM, Shangyu Luo wrote: > Hi Andrew and Madhu, > Thank you for your help here! Will unsubscribe through another address and > may s

Re: ClassNotFoundException with Spark/Mesos (spark-shell works fine)

2014-05-21 Thread Gerard Maas
Hi Andrew, Thanks for the current doc. I'd almost gotten to the point where I thought that my custom code needed > to be included in the SPARK_EXECUTOR_URI but that can't possibly be > correct. The Spark workers that are launched on Mesos slaves should start > with the Spark core jars and then t

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
Looking forward to that update! Given a table of JSON objects like this one: { "name": "Nick", "location": { "x": 241.6, "y": -22.5 }, "likes": ["ice cream", "dogs", "Vanilla Ice"]} It would be SUPER COOL if we could query that table in a way that is as natural as follows

tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info] StreamingZigZagSp

Re: RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi, It seems that previously I understood reduceByWindow wrongly. But now for me reduceByWindow means that after this operation all the elements in each window reduces to one RDD. In that case, the code will be as fallows:     val individualpoints = ssc.socketTextStream(args(1), args(2).toInt,

Inconsistent RDD Sample size

2014-05-21 Thread glxc
I have a graph and am trying to take a random sample of vertices without replacement, using the RDD.sample() method verts are the vertices in the graph > val verts = graph.vertices and executing this multiple times in a row > verts.sample(false, 1.toDouble/v1.count.toDouble, > System.cur

RE: tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
Just found this at the top of the log: 17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN o.e.j.u.component.AbstractLifeCycle - FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use build 21-May-2014 17:14:41 java.net.BindException

Re: Inconsistent RDD Sample size

2014-05-21 Thread Xiangrui Meng
It doesn't guarantee the exact sample size. If you fix the random seed, it would return the same result every time. -Xiangrui On Wed, May 21, 2014 at 2:05 PM, glxc wrote: > I have a graph and am trying to take a random sample of vertices without > replacement, using the RDD.sample() method > > ve

Re: Problem with loading files: Loss was due to java.io.EOFException java.io.EOFException

2014-05-21 Thread hakanilter
The problem is solved after hadoop-core dependency added. But I think there is a misunderstanding about local files. I found this one: "Note that if you've connected to a Spark master, it's possible that it will attempt to load the file on one of the different machines in the cluster, so make sure

I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
Hi. Firstly I'm a newb (to both Scala & Spark). I have a stream, that contains multiple types of records, and I would like to create multiple streams based on that currently I have it set up as class ALL class Orange extends ALL class Apple extends ALL now I can easily add a filter ala val reco

RE: Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode?

2014-05-21 Thread Andrew Lee
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful. Date: Wed, 21 May 2014 11:07:55 -0700 Subject: Re: Is spark 1.0.0 "spark-shell --master=yarn" running in yarn-cluster mode or yarn-client mode? From: and...@databricks.com To: user@spark.apache.org The answer is actually yarn-

Re: tests that run locally fail when run through bamboo

2014-05-21 Thread Tathagata Das
This do happens sometimes, but it is a warning because Spark is designed try successive ports until it succeeds. So unless a cray number of successive ports are blocked (runaway processes?? insufficient clearing of ports by OS??), these errors should not be a problem for tests passing. On Wed

Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Kevin Markey
I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2.  The application successfully ran to conclusion but it ultimately failed.  There were 2 anomalies... 1. ASM reported onl

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tathagata Das
You could do records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } On Wed, May 21, 2014 at 3:28 PM, Ian Holsman wrote: > Hi. > Firstly I'm a newb (to both Scala & Spark). > > I have a stream, that contains multiple types of records, and I would like > to create multiple st

Run Apache Spark on Mini Cluster

2014-05-21 Thread Upender Nimbekar
Hi, I would like to setup apache platform on a mini cluster. Is there any recommendation for the hardware that I can buy to set it up. I am thinking about processing significant amount of data like in the range of few terabytes. Thanks Upender

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar wrote: > Hi, >

Re: A new resource for getting examples of Spark RDD API calls

2014-05-21 Thread zhen
Great, thanks for that tip. I will update the documents! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p6210.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tobias Pfeiffer
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das wrote: > records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } I think a Scala-ish way would be records.flatMap(_ match { case i: Int=> Some(i) case _ => None })

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
Thanks Tobias & Tathagata. these are great. On Wed, May 21, 2014 at 8:02 PM, Tobias Pfeiffer wrote: > On Thu, May 22, 2014 at 8:07 AM, Tathagata Das > wrote: > > records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } > > I think a Scala-ish way would be > > records.flatMap(_

yarn-client mode question

2014-05-21 Thread Sophia
As the yarn-client mode,will spark be deployed in the node of yarn? If it is deployed only in the client,can spark submit the job to yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html Sent from the Apache Spark User L

Re: yarn-client mode question

2014-05-21 Thread Andrew Or
Hi Sophia, In yarn-client mode, the node that submits the application can either be inside or outside of the cluster. This node also hosts the driver (SparkContext) of the application. All the executors, however, will be launched on nodes inside the YARN cluster. Andrew 2014-05-21 18:17 GMT-07:

Re: yarn-client mode question

2014-05-21 Thread Sophia
But,I don't understand this point,is it necessary to deploy slave node of spark in the yarn node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: RDD union of a window in Dstream

2014-05-21 Thread Tobias Pfeiffer
Hi, On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed wrote: > I want to do union of all RDDs in each window of DStream. A window *is* a union of all RDDs in the respective time interval. The documentation says "a DStream is represented as a sequence of RDDs". However, data from a certain time inter

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Tobias Pfeiffer
Hi, as far as I understand, if you create an RDD with a relational structure from your JSON, you should be able to do much of that already today. For example, take lift-json's deserializer and do something like val json_table: RDD[MyCaseClass] = json_data.flatMap(json => json.extractOpt[MyCaseC

RE: yarn-client mode question

2014-05-21 Thread Liu, Raymond
Seems you are asking that does spark related jar need to be deploy to yarn cluster manually before you launch application? Then, no , you don't, just like other yarn application. And it doesn't matter it is yarn-client or yarn-cluster mode.. Best Regards, Raymond Liu -Original Message-

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tom Graves
It sounds like something is closing the hdfs filesystem before everyone is really done with it. The filesystem gets cached and is shared so if someone closes it while other threads are still using it you run into this error.   Is your application closing the filesystem?     Are you using the eve

Re: ExternalAppendOnlyMap: Spilling in-memory map

2014-05-21 Thread Andrew Ash
Hi Mohit, The log line about the ExternalAppendOnlyMap is more of a symptom of slowness than causing slowness itself. The ExternalAppendOnlyMap is used when a shuffle is causing too much data to be held in memory. Rather than OOM'ing, Spark writes the data out to disk in a sorted order and reads

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Andrew Ash
One thing you can try is to pull each file out of S3 and decompress with "gzip -d" to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler wrote: > Hi Nick, > > Which version of Hadoop are you using wit

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Nicholas Chammas
That's a good idea. So you're saying create a SchemaRDD by applying a function that deserializes the JSON and transforms it into a relational structure, right? The end goal for my team would be to expose some JDBC endpoint for analysts to query from, so once Shark is updated to use Spark SQL that

RE: yarn-client mode question

2014-05-21 Thread Sophia
Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6224.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Nicholas Chammas
Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar
It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64 t

Best way to deploy a jar to spark cluster?

2014-05-21 Thread Min Li
Hi, I'm quite new and recetly started to try spark. I've setup a single node spark "cluster" and followed the tutorials in Quick Start. But I've come across some issues. The thing I was trying to do is to try the java api and run it on the single-node "cluster". I followed the Quick Start/A Stand

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tathagata Das
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 / HDP(?) ? We may be able to reproduce this in that case. TD On Wed, May 21, 2014 at 8:35 PM, Tom Graves wrote: > It sounds like something is closing the hdfs filesystem before everyone is > really done with it. The filesyst

java.io.IOException: Failed to save output of task

2014-05-21 Thread Grega Kešpret
Hello, my last reduce task in the job always fails with "java.io.IOException: Failed to save output of task" when using saveAsTextFile with s3 endpoint (all others are successful). Has anyone had similar problems? https://gist.github.com/gregakespret/813b540faca678413ad4 - 14/05/21

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Michael Cutler
Hi Nick, Here is an illustrated example which extracts certain fields from Facebook messages, each one is a JSON object and they are serialised into files with one complete JSON object per line. Example of one such message: CandyCrush.json You