Re: persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Ok Thanks! On Mon, May 19, 2014 at 10:09 PM, Matei Zaharia wrote: > This is the patch for it: https://github.com/apache/spark/pull/50/. It > might be possible to backport it to 0.8. > > Matei > > On May 19, 2014, at 2:04 AM, Sai Prasanna wrote: > > Matei, I am using 0.8

Re: persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
lso be > in 0.9.0). > > Matei > > On May 19, 2014, at 12:41 AM, Sai Prasanna > wrote: > > > Hi all, > > > > When i gave the persist level as DISK_ONLY, still Spark tries to use > memory and caches. > > Any reason ? > > Do i need to override some parameter elsewhere ? > > > > Thanks ! > >

persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !

File present but file not found exception

2014-05-15 Thread Sai Prasanna
Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at th

Preferred RDD Size

2014-05-15 Thread Sai Prasanna
Hi, Is there any lower-bound on the size of RDD to optimally utilize the in-memory framework Spark. Say creating RDD for very small data set of some 64 MB is not as efficient as that of some 256 MB, then accordingly the application can be tuned. So is there a soft-lowerbound related to hadoop-blo

saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is >1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !

Spark on Yarn - A small issue !

2014-05-12 Thread Sai Prasanna
Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i followed the given link http://spark.apache.org/docs/0.8.1/running-on-yarn.html I get the pi value correct when i run without launching the shell. When i launch t

Re: File present but file not found exception

2014-05-12 Thread Sai Prasanna
, create an RDD out of it operate * Is there any way out ?? Thanks in advance ! On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna wrote: > Hi Everyone, > > I think all are pretty busy, the response time in this group has slightly > increased. > > But anyways, this is a pretty silly

Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-05-05 Thread Sai Prasanna
I executed the following commands to launch spark app with yarn client mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3 SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly SPARK_YARN_MODE=true \ SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi, Any suggestion to the following issue ?? I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. He

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi All, I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. Here delayed scheduling fails i think b

Spark with Parquet

2014-04-27 Thread Sai Prasanna
Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
astOptionis > used instead of > last to deal with empty file. > > > On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna wrote: > >> Hi All, Finally i wrote the following code, which is felt does optimally >> if not the most optimum one. >> Using file pointers, seeking

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
=new String(bytes); /*bdd contains the last line*/ On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna wrote: > Thanks Guys ! > > > On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra < > sourav.chan...@livestream.com> wrote: > >> Also same thing can be done using rdd.to

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
gt; except for the last element in your iterator. This should leave one >>> element, which is your last element. >>> >>> Frank Austin Nothaft >>> fnoth...@berkeley.edu >>> fnoth...@eecs.berkeley.edu >>> 202-340-0466 >>> >>> On

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
ple: > > RDD.take(RDD.count()).last > > > On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna wrote: > >> Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. >> >> I want only to access the last element. >> >> >> On Thu, Apr 2

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna wrote: > Oh ya, Thanks Adnan. > > > On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob wrote: > >> You c

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob wrote: > You can use following code: > > RDD.take(RDD.count()) > > > On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna wrote: > >> Hi All, Some help ! >> RDD.first or RDD.take(1) gives th

Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !!

Efficient Aggregation over DB data

2014-04-22 Thread Sai Prasanna
Hi All, I want to access a particular column of a DB table stored in a CSV format and perform some aggregate queries over it. I wrote the following query in scala as a first step. *var add=(x:String)=>x.split("\\s+)(2).toInt* *var result=List[Int]()* *input.split("\\n").foreach(x=>result::=add(x

SPARK Shell RDD reuse

2014-04-18 Thread Sai Prasanna
Hi All, In the interactive shell the spark context remains same. So if run a query multiple times, the RDDs created by previous runs will be reused in the subsequent runs and not recomputed until i exit and restart the shell again right? Or is there a way to force to reuse/recompute in the presen

[no subject]

2014-04-18 Thread Sai Prasanna
Hi All, In the interactive shell the spark context remains same. So if run a query multiple times, the RDDs created by previous runs will be reused in the subsequent runs and not recomputed until i exit and restart the shell again right? Or is there a way to force to reuse/recompute in the presen

Re: Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml > > Thanks, > Rahul Singhal > > From: Sai Prasanna > Reply-To: "user@spark.apache.org" > Date: Monday 7 April 2014 6:56 PM > To: "user@spark.apache.org" > Subje

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All, I wanted Spark on Yarn to up and running. I did "*SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*" Then i ran "*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar SPARK_YARN_APP_JAR=examples/target/scala-2.9.3/spark-examples_2.9.3-0.8.1

SSH problem

2014-04-01 Thread Sai Prasanna
Hi All, I have a five node spark cluster, Master, s1,s2,s3,s4. I have passwordless ssh to all slaves from master and vice-versa. But only one machine, s2, what happens is after 2-3 minutes of my connection from master to slave, the write-pipe is broken. So if try to connect again from master i ge

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
Oh sorry, that was a mistake, the default level is MEMORY_ONLY !! My doubt was, between two different experiments, are the RDDs cached in memory need to be unpersisted??? Or it doesnt matter ?

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
ing? I am guessing it is MEMORY_ONLY. In > large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better. > > You can call unpersist on an RDD to remove it from Cache though. > > > On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna wrote: > >> No i am running on 0.8.1. &

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
e a few memory issues like these, some are resolved >> by changing the StorageLevel strategy and employing things like Kryo, some >> are solved by specifying the number of tasks to break down a given >> operation into etc. >> >> Ognen >> >> >> On 3/27/14, 10:

GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
an someone throw some light on it ?? -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
Thanks Chen, its a bit clear now and it up and running... 1) In the WebUI, only memory used per node is given. Though in logs i can find out, but does there exist a port over which i can monitor memory usage, GC memory overhead, RDD creation in UI.

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
master URL > > if the later case, also yes, you can observe the distributed task in the > Spark UI > > -- > Nan Zhu > > On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote: > > Is it possible to run across cluster using Spark Interactive Shell ? > > To

Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
similar ??? -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*

Spark executor memory & relationship with worker threads

2014-03-25 Thread Sai Prasanna
Hi All, Does number of worker threads bear any relationship to setting executor memory ?. I have a 16 GB RAM, with an 8-core processor. I had set SPARK_MEM to 12g and was running locally with default 1 thread. So this means there can be maximum one executor in one node scheduled at any point of tim

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
f you want to have 8 GB executors then, yes, only two can run on each > 16 GB node. (In fact, you should also keep a significant amount of memory > free for the OS to use for buffer caching and such.) > An executor may use many cores, though, so this shouldn't be an issue. > > > On Mo

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
EAMON" - its DAEMON. Thanks Latin. > On Mar 24, 2014 7:25 AM, "Sai Prasanna" wrote: > >> Hi All !! I am getting the following error in interactive spark-shell >> [0.8.1] >> >> >> *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed mo

GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
-env.sh export SPARK_DEAMON_MEMORY=8g export SPARK_WORKER_MEMORY=8g export SPARK_DEAMON_JAVA_OPTS="-Xms8g -Xmx8g" export SPARK_JAVA_OPTS="-Xms8g -Xmx8g" export HADOOP_HEAPSIZE=4000 Any suggestions ?? -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL*

Re: Connect Exception Error in spark interactive shell...

2014-03-19 Thread Sai Prasanna
node > if not & data in hdfs is not critical > hadoop namenode -format > & restart hdfs > > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Tue, Mar 18, 20

Connect Exception Error in spark interactive shell...

2014-03-18 Thread Sai Prasanna
Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) at org.apache.hadoop.ipc.Client.call(Client.java:10

Re: Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
Solved...but dont know whats the difference... just giving ./spark-shell fixes it all...but dont know why !! On Mon, Mar 17, 2014 at 1:32 PM, Sai Prasanna wrote: > Hi everyone !! > > I installed scala 2.9.3, spark 0.8.1, oracle java 7... > > I launched master and logged on to

Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
need to set somewhere a timeout ??? Thank you !! -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL*