Mongo-Hadoop Connector with Spark

2014-04-07 Thread Pavan Kumar
Hi Everyone, I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce. previously i have successfully processed normal mongoDB collections(not GridFS) with Apache spark using Mongo-Hadoop connector. now i'm unable to handle input

[BLOG] For Beginners

2014-04-07 Thread prabeesh k
Hi all, Here I am sharing a blog for beginners, about creating spark streaming stand alone application and bundle the app as single runnable jar. Take a look and drop your comments in blog page. http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html http://prabstec

Re: trouble with "join" on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote: > I am running the latest version of PySpark branch-0.9 and having some > trouble with join. > > One RDD is about 100G (25GB compressed and serialized in memory) with > 130K records, the other RDD is about 10G (2.5G compressed and > serialized in

答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-07 Thread Francis . Hu
Great!!! When i built it on another disk whose format is ext4, it works right now. hadoop@ubuntu-1:~$ df -Th FilesystemType Size Used Avail Use% Mounted on /dev/sdb6 ext4 135G 8.6G 119G 7% / udev devtmpfs 7.7G 4.0K 7.7G 1% /dev tmpfs

RE: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
1.: I will paste the full content of the environment page of the example application running against the cluster at the end of this message. 2. and 3.: Following #2 I was able to see that the count was incorrectly 0 when running against the cluster, and following #3 I was able to get the messa

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
ok yeah we are using StageInfo and TaskInfo too... On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or wrote: > Hi Koert, > > Other users have expressed interest for us to expose similar classes too > (i.e. StageInfo, TaskInfo). In the newest release, they will be available > as part of the developer API

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Andrew Or
Hi Koert, Other users have expressed interest for us to expose similar classes too (i.e. StageInfo, TaskInfo). In the newest release, they will be available as part of the developer API. The particular PR that will change this is: https://github.com/apache/spark/pull/274. Cheers, Andrew On Mon,

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS ins

job offering

2014-04-07 Thread Rault, Severan
Hi, I am looking for users of spark to join my teams here at Amazon. If you are reading this you probably qualify. I am looking for developer of ANY level, but with an interest in spark. My teams are leveraging spark to solve real business scenarios. If you are interested, just shoot me a note a

Re: Creating a SparkR standalone job

2014-04-07 Thread pawan kumar
Thanks Shivaram! Will give it a try and let you know. Regards, Pawan Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > You can create standalone jobs in SparkR as just R files that are run > using the sparkR script. These commands will be sen

CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get t

Re: Creating a SparkR standalone job

2014-04-07 Thread Shivaram Venkataraman
You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent to a Spark cluster and the examples on the SparkR repository ( https://github.com/amplab-extras/SparkR-pkg#examples-unit-tests) are in fact standalone jobs. However I don't th

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini < silvio.co

Driver Out of Memory

2014-04-07 Thread Eduardo Costa Alfaia
Hi Guys, I would like understanding why the Driver's RAM goes down, Does the processing occur only in the workers? Thanks # Start Tests computer1(Worker/Source Stream) 23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44 total used free sharedbuffers

Creating a SparkR standalone job

2014-04-07 Thread pawan kumar
Hi, Is it possible to create a standalone job in scala using sparkR? If possible can you provide me with the information of the setup process. (Like the dependencies in SBT and where to include the JAR files) This is my use-case: 1. I have a Spark Streaming standalone Job running in local machin

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW, whe

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Right now the spark-ec2 scripts assume that you have root access and a lot of internal scripts assume have the user's home directory hard coded as /root. However all the Spark AMIs we build should have root ssh access -- Do you find this not to be the case ? You can also enable root ssh access i

Re: ui broken in latest 1.0.0

2014-04-07 Thread Koert Kuipers
got it thanks On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote: > This is fixed in https://github.com/apache/spark/pull/281. Please try > again with the latest master. -Xiangrui > > On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote: > > i noticed that for spark 1.0.0-SNAPSHOT which i chec

Re: ui broken in latest 1.0.0

2014-04-07 Thread Xiangrui Meng
This is fixed in https://github.com/apache/spark/pull/281. Please try again with the latest master. -Xiangrui On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers wrote: > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago > (apr 5) that the "application detail ui" no longer show

ui broken in latest 1.0.0

2014-04-07 Thread Koert Kuipers
i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days ago (apr 5) that the "application detail ui" no longer shows any RDDs on the storage tab, despite the fact that they are definitely cached. i am running spark in standalone mode.

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska
I might be wrong here but I don't believe it's discouraged. Maybe part of the reason there's not a lot of examples is that sql2rdd returns an RDD (TableRDD that is https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala). I haven't done anything too complicated yet but m

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
That work is under submission at an academic conference and will be made available if/when the paper is published. In terms of algorithms for hyperparameter tuning, we consider Grid Search, Random Search, a couple of older derivative-free optimization methods, and a few newer methods - TPE (aka Hy

SparkContext.addFile() and FileNotFoundException

2014-04-07 Thread Thierry Herrmann
Hi, I'm trying to use SparkContext.addFile() to propagate a file to worker nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the master). I don't have HDFS or any distributed file system. Just playing with basic stuff. Here's the code in my driver (actually spark-shell runnin

AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to l

Re: reduceByKeyAndWindow Java

2014-04-07 Thread Eduardo Costa Alfaia
Hi TD Could you explain me this code part? .reduceByKeyAndWindow( 109 new Function2() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2;

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're tr

Re: non-lazy execution of sortByKey?

2014-04-07 Thread Matei Zaharia
Yeah, the reason it happens is that sortByKey tries to sample the data to figure out the right range partitions for it. But we could do this later, as the suggestion in there says. Matei On Apr 7, 2014, at 10:06 AM, Diana Carroll wrote: > Aha! Well I'm not crazy then, thanks. > > > On Mon,

PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman
Hi, We have a situation where a Pyspark script works fine as a local process ("local" url) on the Master and the Worker nodes, which would indicate that all python dependencies are set up properly on each machine. But when we try to run the script at the cluster level (using the master's url), if

Re: Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Thanks Rahul, let me try that. On Apr 7, 2014 7:33 PM, "Rahul Singhal" wrote: > Hi Sai, > > I recently also ran into this problem on 0.9.1. The problem is that > spark tries to read yarn's class path but when it finds it be empty does > not fallback to it's default value. To resolve this, eith

Re: How to create a RPM package

2014-04-07 Thread Will Benton
> For issue #2 I was concerned that the build & packaging had to be > internal. So I am using the already packaged make-distribution.sh > (modified to use a maven build) to create a tar ball which I then package > it using a RPM spec file. Hi Rahul, so the issue for downstream operating system dis

Re: Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Rahul Singhal
Hi Sai, I recently also ran into this problem on 0.9.1. The problem is that spark tries to read yarn's class path but when it finds it be empty does not fallback to it's default value. To resolve this, either set yarn.application.classpath in yarn-site.xml to its default value or put in a bug f

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All, I wanted Spark on Yarn to up and running. I did "*SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*" Then i ran "*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1

Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani
Hi, I was going through Matei's Advanced Spark presentation at  https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at  http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf The PageRank example int

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam wrote: > Hello everyone, > > I have successfully installed Shark

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > I know if we call persist with the right options, we can have Spark > persist an RDD's data on disk. > > I am wondering what happens in intermediate operat

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are familiar with python, but some of features are developed under java. I am looking for a way to integrate java and python on spark. I notice that the initialization of pyspark does not include a field to distribute ja

hang on sorting operation

2014-04-07 Thread Stuart Zakon
I am seeing a small standalone cluster (master, slave) hang when I reach a certain memory threshold, but I cannot detect how to configure memory to avoid this. I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this allocated, but it does not help. The reduce is by key to get th

Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23
Can you provide an example? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.