Re: hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson wrote: > First issue was because your cluster was configured incorrectly. You could > probably read 1 file because that was done on the driver n

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Aaron Davidson
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the "official" quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisi

Spark on EC2

2014-05-31 Thread superback
Hi, I am trying to run an example on AMAZON EC2 and have successfully set up one cluster with two nodes on EC2. However, when I was testing an example using the following command, * ./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077* I got the following warnings

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Yadid Ayzenberg
Yep, I just issued a pull request. Yadid On 5/31/14, 1:25 PM, Patrick Wendell wrote: 1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah, I

spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf

can not access app details on ec2

2014-05-31 Thread wxhsdp
hi, all i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything goes well except that i can not access application details on the web ui, i just click on the application name, but there's not response, has anyone met this before? is this a bug? thanks! -- View this mes

hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
I'm running the following code to load an entire directory of Avros using hadoopRDD. val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/*" // Setup the path for the job vai a Hadoop JobConf val jobConf= new JobConf(sc.hadoopConfiguration) jobConf.setJobName("Test Scala Job") F

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Nicholas Chammas
That's a neat idea. I'll try that out. On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell wrote: > I think there are a few ways to do this... the simplest one might be to > manually build a set of comma-separated paths that excludes the bad file, > and pass that to textFile(). > > When you call t

Re: Trouble with EC2

2014-05-31 Thread Matei Zaharia
What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ wrote: > Hey Folks, > > I'm really having quite a bit of trouble gettin

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls setInputPaths(

Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
> 1) Is there a guarantee that a partition will only be processed on a node > which is in the "getPreferredLocations" set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a "non preferred" location after `spark.locality.wait` has expired. http://spark.apache.org/doc

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
> 1. ctx is an instance of JavaSQLContext but the textFile method is called as > a member of ctx. > According to the API JavaSQLContext does not have such a member, so im > guessing this should be sc instead. Yeah, I think you are correct. > 2. In that same code example the object sqlCtx is refer

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM, ans

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
Hey There, You can remove an accumulator by just letting it go out of scope and it will be garbage collected. For broadcast variables we actually store extra information for it, so we provide hooks for users to remove the associated state. There is no such need for accumulators, though. - Patrick

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the "user@" list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k wrote: > Hi, > >

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Patrick Wendell
I've removed my docs from my site to avoid confusion... somehow that link propogated all over the place! On Sat, May 31, 2014 at 1:58 AM, Xiangrui Meng wrote: > The documentation you looked at is not official, though it is from > @pwendell's website. It was for the Spark SQL release. Please find

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
Oh, sorry, I forgot to add: here are the extra lines in my spark_ec2.py @205 "r3.large":"hvm", "r3.xlarge": "hvm", "r3.2xlarge": "hvm", "r3.4xlarge": "hvm", "r3.8xlarge": "hvm" Clearly a masterpiece of hacking. :-) I haven't tested all of them. The r3 set seems to act

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
Hi there, Patrick. Thanks for the reply... It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like that. :-) Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat, SUSE) So, users such

Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode - local mode - mesos cluster with single node - spark cluster with multi node Schema creation and q

Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-05-31 Thread Xiangrui Meng
Hi Tobias, One hack you can try is: rdd.mapPartitions(iter => { val x = new X() iter.map(row => x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty } }) Best, Xiangrui On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer wrote: > Hi, > > I want to use an object x in my RDD processing as

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Xiangrui Meng
The documentation you looked at is not official, though it is from @pwendell's website. It was for the Spark SQL release. Please find the official documentation here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm It contains a working example show

Re: Failed to remove RDD error

2014-05-31 Thread Mayur Rustagi
You can increase your akka timeout, should give you some more life.. are you running out of memory by any chance? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, May 31, 2014 at 6:52 AM, Michael Chang wrote: > I'm

Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga wrote: > Some