Re: Spark on Yarn - A small issue !

2014-05-13 Thread Tom Graves
You need to look at the logs files for yarn.  Generally this can be done with "yarn logs -applicationId ".  That only works if you have log aggregation enabled though.   You should be able to see atleast the application master logs through the yarn resourcemanager web ui.  I would try that first

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-13 Thread zzzzzqf12345
thanks for reply~~ I had solved the problem and found the reason, because I used the Master node to upload files to hdfs, this action may take up a lot of Master's network resources. When I changed to use another computer none of the cluster to upload these files, it got the correct result. QingF

1.0.0 Release Date?

2014-05-13 Thread bhusted
Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to use Mahout VectorWritable in Spark.

2014-05-13 Thread Stuti Awasthi
Hi All, I am very new to Spark and trying to play around with Mllib hence apologies for the basic question. I am trying to run KMeans algorithm using Mahout and Spark MLlib to see the performance. Now initial datasize was 10 GB. Mahout converts the data in Sequence File which is used for KMean

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia wrote: > That API is something the HDFS administrator uses outside of any application > to tell HDFS to cache certain files or directories.

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
Count causes the overall performance to drop drastically. Infact beyond 50 files it starts to hang. if i force materialization. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, May 13, 2014 at 9:34 PM, X

Re: Is their a way to Create SparkContext object?

2014-05-13 Thread Andrew Ash
SparkContext is not serializable, so you can't send it across the cluster like the rdd.map(t => compute(sc, t._2)) would do. There is likely a way to express what you're trying to do with an algorithm that doesn't require serializing SparkContext. Can you tell us more about your goals? Andrew

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
Thanks Xiangrui. After some debugging efforts, it turns out that the problem results from a bug in my code. But it's good to know that a long lineage could also lead to this problem. I will also try checkpointing to see whether the performance can be improved. Best regards, - Guanhua On 5/13/14 1

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia wrote: > That API is something the HDFS administrator uses outside of any application > to tell HDFS to cache certain files or directories. But once you’ve done > that, any existing HDFS client accesses them directly from the cache. Ah, yeah, sure

Dead lock running multiple Spark Jobs on Mesos

2014-05-13 Thread Martin Weindel
I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0. If I run a single Spark Job, the job runs fine on Mesos. Running multiple Spark Jobs also works, if I'm using the coarse-grained mode ("spark.mesos.coarse" = true). But if I run two Spark Jobs in parallel using the fine-gr

Turn BLAS on MacOSX

2014-05-13 Thread Debasish Das
Hi, How do I load native BLAS libraries on Mac ? I am getting the following errors while running LR and SVM with SGD: 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.g

Re: 1.0.0 Release Date?

2014-05-13 Thread Anurag Tangri
Hi All, We are also waiting for this. Does anyone know of tentative date for this release ? We are at spark 0.8.0 right now. Should we wait for spark 1.0 or upgrade to spark 0.9.1 ? Thanks, Anurag Tangri On Tue, May 13, 2014 at 9:40 AM, bhusted wrote: > Can anyone comment on the anticipate

Re: same log4j slf4j error in spark 9.1

2014-05-13 Thread Patrick Wendell
Hey Adrian, If you are including log4j-over-slf4j.jar in your application, you'll still need to manually exclude slf4j-log4j12.jar from Spark. However, it should work once you do that. Before 0.9.1 you couldn't make it work, even if you added an exclude. - Patrick On Thu, May 8, 2014 at 1:52 PM,

Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash wrote: > Hi all, > > Is anyone reading and writing to .bz2 files stored in HDFS from Spark with

Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp, See https://github.com/scalanlp/breeze/issues/142 and https://github.com/fommil/netlib-java/issues/60 for details. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May 13,

Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
Scala's for-loop is not just looping; it's not native looping in bytecode level. It will create a couple of objects at runtime and performs a truckload of method calls on them. As a result, if you are referring the variables outside the for-loop, the whole for-loop object and any variable inside th

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks! On May 13, 2014 3:16 AM, "zhen" wrote: > Hi Everyone, > > I found it quite difficult to find good examples for Spark RDD API calls. > So > my student and I decided to go through the entire API and write examples > for > the vast majority of API calls (basically examples for any

Caching in graphX

2014-05-13 Thread Franco Avi
Hi, i'm writing this post because I would to know a caching approach for iterative algorithms in graphX. So far I was not able to keep stable the time of execution of each iteration. Can you achieve this condition? The code I used is this: var g = ... // my graph var prevG: Graph[VD, ED] = null v

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Gerard Maas
Hi Zhen, Thanks a lot for sharing. I'm sure it will be useful for new users. A small note: On the 'checkpoint' explanation: sc.setCheckpointDir("my_directory_name") it would be useful to specify that 'my_directory_name' should exist in all slaves. As an alternative you could use an HDFS directory

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
After checkPoint, call count directly to materialize it. -Xiangrui On Tue, May 13, 2014 at 4:20 AM, Mayur Rustagi wrote: > We are running into same issue. After 700 or so files the stack overflows, > cache, persist & checkpointing dont help. > Basically checkpointing only saves the RDD when it is

Re: How to use spark-submit

2014-05-13 Thread Sonal Goyal
Hi Stephen, Sorry I just use plain mvn. Best Regards, Sonal Nube Technologies On Mon, May 12, 2014 at 12:29 PM, Stephen Boesch wrote: > @Sonal - makes sense. Is the maven shade plugin runnable within sbt ? If > so would you c

Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding -Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS. Since a this reference to the enclosing class is often what's causing the problem, a general workaround is to move the mapPartitions call to a static method where

Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp
i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Dead lock running multiple Spark Jobs on Mesos

2014-05-13 Thread Eugen Cepoi
I have a similar issue (but with spark 0.9.1) when a shell is active. Multiple jobs run fine, but when the shell is active (even if at the moment is not using any CPU) I encounter the exact same behaviour. At the moment I don't know what happens and how to solve it, but I was planning to have a lo

user@spark.apache.org

2014-05-13 Thread Herman, Matt (CORP)
unsubscribe -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an aut

Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-13 Thread Chester At Yahoo
We are using spray + Akka + spark stack at Alpine data labs Chester Sent from my iPhone > On May 4, 2014, at 8:37 PM, ZhangYi wrote: > > Hi all, > > Currently, our project is planning to adopt spark to be big data platform. > For the client side, we decide expose REST api based on Spray. O

Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to the complicated internal dependency structure that it creates. It's necessary to know exactly what operations you're doing on the graph in order to unpersist correctly (i.e., in a way that avoids recomputation). I have a p

Re: Doubts regarding Shark

2014-05-13 Thread Mayur Rustagi
The table will be cached but 10GB (Most likely more) would be on disk. You can check that in the storage tab in shark application. Java out of memory could be as your worker memory is too low or memory allocated to Shark is too low. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
We are running into same issue. After 700 or so files the stack overflows, cache, persist & checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized & it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 h

Re: How to read a multipart s3 file?

2014-05-13 Thread kamatsuoka
Thanks Nicholas! I looked at those docs several times without noticing that critical part you highlighted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html Sent from the Apache Spark User List mailing list arc

Re: Is their a way to Create SparkContext object?

2014-05-13 Thread yh18190
Thanks Mateh Zahria.Can i pass it as a parameter as part of closures. for example RDD.map(t=>compute(sc,t._2)) can I use sc inside map function?Pls let me know -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-their-a-way-to-Create-SparkContext-object-tp56

Re: details about event log

2014-05-13 Thread wxhsdp
thank you very much, Andrew by the difinition of "Fetch Wait Time", can i make a conclusion that task pipelines block fetch and job doing? Andrew Or-2 wrote > Hi wxhsdp, > > These times are computed from Java's System.currentTimeMillis(), which is > "the > difference, measured in milliseconds,

something about pipeline

2014-05-13 Thread wxhsdp
Dear, all definition of fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Xiangrui Meng
You have a long lineage that causes the StackOverflow error. Try rdd.checkPoint() and rdd.count() for every 20~30 iterations. checkPoint can cut the lineage. -Xiangrui On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote: > Dear Sparkers: > > I am using Python spark of version 0.9.0 to implement so