Reading sequencefile

2014-03-11 Thread Jaonary Rabarisoa
Hi all, I'm trying to read a sequenceFile that represent a set of jpeg image generated using this tool : http://stuartsierra.com/2008/04/24/a-million-little-files . According to the documentation : "Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a B

Re: Reading sequencefile

2014-03-11 Thread Shixiong Zhu
Hi Jaonary, You can use "sc.sequenceFile" to load your file. E.g., scala> import org.apache.hadoop.io._ import org.apache.hadoop.io._ scala> val rdd = sc.sequenceFile("path_to_file", classOf[Text], classOf[BytesWritable]) rdd: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoo

Powered By Spark Page -- Companies & Organizations

2014-03-11 Thread Christoph Böhm
Dear Spark team, thanks for the great work and congrats on becoming an Apache top-level project! You could add us to your Powered-by-page, because we are using Spark (and Shark) to perform interactive exploration of large datasets. Find us on: www.bakdata.com Best, Christoph --

[no subject]

2014-03-11 Thread Gino Mathews
Hi, I am new to spark. I would like to run jobs in Spark stand alone cluster mode. No cluser managers other than psark is used. I have tried wordcount from spark shell and stand alone scala app. The code reads input from HDFS and writes the results to HDFS. uses 2 worker nodes. In shell the w

Re: [BLOG] Spark on Cassandra w/ Calliope

2014-03-11 Thread Rohit Rai
Take a look at https://github.com/tuplejump/cash We will release an update soon to go with Hive 0.11 and Shark 0.9 *Founder & CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Tue, Mar 11, 2014 at 7:11 AM, abhinav chowdary < abhinav.chowd.

Re: Reading sequencefile

2014-03-11 Thread Jaonary Rabarisoa
Thank you. I fogort the classOf[*] arguments. On Tue, Mar 11, 2014 at 10:46 AM, Shixiong Zhu wrote: > Hi Jaonary, > > You can use "sc.sequenceFile" to load your file. E.g., > > scala> import org.apache.hadoop.io._ > import org.apache.hadoop.io._ > > scala> val rdd = sc.sequenceFile("path_to_fil

OpenCV + Spark : Where to put System.loadLibrary ?

2014-03-11 Thread Jaonary Rabarisoa
Hi all, I'm trying to build a stand alone scala spark application that uses opencv for image processing. To get opencv works with scala one need to call System.loadLibrary(Core.NATIVE_LIBRARY_NAME) once per JVM process. How to call it inside spark application distributed on several nodes ? Bes

Spark Applicaion (Stages) UI does not recognize line number

2014-03-11 Thread orly.lampert
Hi, I'm running my spark application on a standalone server mode. When trying to check performance on the Spark stages UI (spark-server:4040) I can see all stages but line number in the description field is always -1 For example: Stage ID: 0 Description: count at null:-1 Stage ID: 3 Descrip

How to set task number in a container

2014-03-11 Thread hequn cheng
When i increase my input data size, the executor will be failed and lost. see below: 14/03/11 20:44:18 INFO AppClient$ClientActor: Executor updated: app-20140311204343-0008/8 is now FAILED (Command exited with code 134) 14/03/11 20:44:18 INFO SparkDeploySchedulerBackend: Executor app-2014031120434

Spark stand alone cluster mode

2014-03-11 Thread Gino Mathews
Hi, I am new to spark. I would like to run jobs in Spark stand alone cluster mode. No cluser managers other than spark is used. (https://spark.apache.org/docs/0.9.0/spark-standalone.html) I have tried wordcount from spark shell and stand alone scala app. The code reads input from HDFS and writ

Re: Spark stand alone cluster mode

2014-03-11 Thread Yana Kadiyska
does sbt "show full-classpath" show spark-core on the classpath? I am still pretty new to scala but it seems like you have val sparkCore = "org.apache.spark" %% "spark-core"% V.spark % "provided" -- I believe the "provided" part means it's in your classpath. Spark-shell script

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-03-11 Thread Debasish Das
Look at jblas operations inside mllib...jblas calls jniloader internally which loadd up native code when available On Mar 11, 2014 4:07 AM, "Jaonary Rabarisoa" wrote: > Hi all, > > I'm trying to build a stand alone scala spark application that uses opencv > for image processing. > To get ope

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, most tasks have GC times of 200ms or less, and then a few tasks take many seconds. example GC activity for a slow one: [GC [PSYoungGen: 1051814K->262624K(1398144K)] 3789259K->3524429K(5592448K), 0.0986800 secs] [Times: user=1.53 sys=0.01, real=0.10 secs] [GC [PSYoungGen: 786935K->524512

Computation time increasing every super-step

2014-03-11 Thread Alessandro Lulli
Hi All, I'm facing a performance degradation running an iterative algorithm built using Spark 0.9 and GraphX. I'm using org.apache.spark.graphx.Pregel to run the iterative algorithm. My graph has 2395 vertex 7462 edges. Every super step the computation time increase significantly. The steps 1-5

NO SUCH METHOD EXCEPTION

2014-03-11 Thread Jeyaraj, Arockia R (Arockia)
Hi, Can anyone help me to resolve this issue? Why am I getting NoSuchMethod exception? 14/03/11 09:56:11 ERROR executor.Executor: Exception in task ID 0 java.lang.NoSuchMethodError: scala.Predef$.augmentString(Ljava/lang/String;)Lsca la/collection/immutable/StringOps; at kafka.utils.VerifiableP

Re: Out of memory on large RDDs

2014-03-11 Thread Domen Grabec
Hi I have a spark cluster with 4 workers each with 13GB ram. I would like to process a large data set (does not fit in memory) that consists of JSON entries. These are the transformations applied: SparkContext.textFile(s3url). // read files from s3 keyBy(_.parseJson.id) // key by id that is locat

Re: Out of memory on large RDDs

2014-03-11 Thread Mayur Rustagi
Shuffle data is not kept in memory. Did you try additional memory configurations( https://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence ) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Pig on Spark

2014-03-11 Thread Mayur Rustagi
Hi Lin, We are working on getting Pig on spark functional with 0.8.0, have you got it working on any spark version ? Also what all functionality works on it? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On M

RDD.saveAs...

2014-03-11 Thread Koert Kuipers
I find the current design to write RDDs to disk (or a database, etc) kind of ugly. It will lead to a proliferation of saveAs methods. A better abstraction would be nice (perhaps a Sink trait to write to)

Re: Out of memory on large RDDs

2014-03-11 Thread sparrow
I don't understand how exactly will that help. There are no persisted RDD's in storage. Our input data is ~ 100GB, but output of the flatMap is ~40Mb. The small RDD is then persisted. Memory configuration should not affect shuffle data if I understand you correctly? On Tue, Mar 11, 2014 at 4:5

Pyspark Memory Woes

2014-03-11 Thread Aaron Olson
Dear Sparkians, We are working on a system to do relational modeling on top of Spark, all done in pyspark. While we've been learning a lot about Spark internals so far, we're currently running into memory issues and wondering how best to profile to fix them. Here are our symptoms: - We're oper

Re: Out of memory on large RDDs

2014-03-11 Thread Mayur Rustagi
Shuffle data is always stored on disk, its unlikely to cause OOM. Your input data read as RDD may be causing OOM, so thats where you can use different memory configuration. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On

Spark usage patterns and questions

2014-03-11 Thread Sourav Chandra
Hi, I have some questions regarding usage patterns and debugging in spark/spark streaming. 1. What is some used design patterns of using broadcast variable? In my application i created some and also created a scheduled task which periodically refreshes the variables. I want to know how efficientl

Re: NO SUCH METHOD EXCEPTION

2014-03-11 Thread Matei Zaharia
Since it’s from Scala, it might mean you’re running with a different version of Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while Spark 0.9 uses Scala 2.10. Matei On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia) wrote: > Hi, > > Can anyone help me to r

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-03-11 Thread Matei Zaharia
In short you should add it to a static initializer or singleton object that you call before accessing your library. Also add your library to SPARK_LIBRARY_PATH so it can find the .so / .dll. Matei On Mar 11, 2014, at 7:05 AM, Debasish Das wrote: > Look at jblas operations inside mllib...jblas

Re: Powered By Spark Page -- Companies & Organizations

2014-03-11 Thread Matei Zaharia
Thanks, added you. On Mar 11, 2014, at 2:47 AM, Christoph Böhm wrote: > Dear Spark team, > > thanks for the great work and congrats on becoming an Apache top-level > project! > > You could add us to your Powered-by-page, because we are using Spark (and > Shark) to perform interactive explora

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Hi Aaron, When you say "Java heap space is 1.5G per worker, 24 or 32 cores across 46 nodes. It seems like we should have more than enough to do this comfortably.", how are you configuring this? -Sandy On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson wrote: > Dear Sparkians, > > We are working on

unsubscribe

2014-03-11 Thread Abhishek Pratap

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Matei Zaharia
Right, that’s it. I think what happened is the following: all the nodes generated some garbage that put them very close to the threshold for a full GC in the first few runs of the program (when you cached the RDDs), but on the subsequent queries, only a few nodes are hitting full GC per query, s

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread Matei Zaharia
Yeah, we could make it take Iterable too if that helped. What data structure did you have here? Matei On Mar 10, 2014, at 6:29 PM, wallacemann wrote: > I was right ... I was missing something obvious. The answer to my question > is to use JavaSparkContext.parallelize which works with List or

Re: unsubscribe

2014-03-11 Thread Matei Zaharia
To unsubscribe from this list, please send a message to user-unsubscr...@spark.apache.org and it will automatically unsubscribe you. Matei On Mar 11, 2014, at 12:15 PM, Abhishek Pratap wrote: >

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Andrew Ash
Note that calling System.gc() is just a suggestion to the JVM that it should run a garbage collection and doesn't force it right then 100% of the time. http://stackoverflow.com/questions/1481178/forcing-garbage-collection-in-java On Tue, Mar 11, 2014 at 12:17 PM, Matei Zaharia wrote: > Right, t

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, ha i will definitely that one! looks like a total hack... i might just schedule it after the precaching of rdds defensively. also trying java 7 with g1 On Tue, Mar 11, 2014 at 3:17 PM, Matei Zaharia wrote: > Right, that's it. I think what happened is the following: all the nodes > ge

Re: Pyspark Memory Woes

2014-03-11 Thread Aaron Olson
Hi Sandy, We're configuring that with the JAVA_OPTS environment variable in $SPARK_HOME/spark-worker-env.sh like this: # JAVA OPTS export SPARK_JAVA_OPTS="-Dspark.ui.port=0 -Dspark.default.parallelism=1024 -Dspark.cores.max=256 -Dspark.executor.memory=1500m -Dspark.worker.timeout=500 -Dspark.akka

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Matei Zaharia
Yeah, System.gc() is a suggestion but in practice it does invoke full GCs on the Sun JVM. Matei On Mar 11, 2014, at 12:35 PM, Koert Kuipers wrote: > hey matei, > ha i will definitely that one! looks like a total hack... i might just > schedule it after the precaching of rdds defensively. > >

Re: "Too many open files" exception on reduceByKey

2014-03-11 Thread Matthew Cheah
Thanks. Just curious, is there a default number of reducers that are used? -Matt Cheah On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell wrote: > Hey Matt, > > The best way is definitely just to increase the ulimit if possible, > this is sort of an assumption we make in Spark that clusters will

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2014-03-11 Thread Jaonary Rabarisoa
Do you have a snippets showing how to do this. I'm relatively new to spark and scala and for now, my code is just a single file inspired from spark example : object SparkOpencv { def main(args: Array[String]) { val conf = new SparkConf() .setMaster("local[8]") .set

Re: "Too many open files" exception on reduceByKey

2014-03-11 Thread Matthew Cheah
Sorry, I also have some follow-up questions. "In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing." Some questions came to mind just now: 1) It would be nice to have a brief overview as to what th

is spark.cleaner.ttl safe?

2014-03-11 Thread Michael Allman
Hello, I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by ava

Re: is spark.cleaner.ttl safe?

2014-03-11 Thread Mark Hamstra
Actually, TD's work-in-progress is probably more what you want: https://github.com/apache/spark/pull/126 On Tue, Mar 11, 2014 at 1:58 PM, Michael Allman wrote: > Hello, > > I've been trying to run an iterative spark job that spills 1+ GB to disk > per iteration on a system with limited disk spa

Re: is spark.cleaner.ttl safe?

2014-03-11 Thread Aaron Davidson
And to answer your original question, spark.cleaner.ttl is not safe for the exact reason you brought up. The PR Mark linked intends to provide a much cleaner (and safer) solution. On Tue, Mar 11, 2014 at 2:01 PM, Mark Hamstra wrote: > Actually, TD's work-in-progress is probably more what you wan

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread wallacemann
The question would be whether or not Iterable would save memory. It's trivial for me to build a list out of my iterable. If I understood the code correctly, Spark takes that List and converts it to an array, so I built an ArrayList out of the iterable in the hopes that Spark would use the under

RE: unsubscribe

2014-03-11 Thread Kapil Malik
Ohh ! I thought you're unsubscribing :) Kapil Malik | kma...@adobe.com | 33430 / 8800836581 -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: 12 March 2014 00:51 To: user@spark.apache.org Subject: Re: unsubscribe To unsubscribe from this list, please se

Re: Block

2014-03-11 Thread Patrick Wendell
A block is an internal construct that isn't directly exposed to users. Internally though, each partition of an RDD is mapped to one block. - Patrick On Mon, Mar 10, 2014 at 11:06 PM, David Thomas wrote: > What is the concept of Block and BlockManager in Spark? How is a Block > related to a Parti

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Are you aware that you get an executor (and the 1.5GB) per machine, not per core? On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson wrote: > Hi Sandy, > > We're configuring that with the JAVA_OPTS environment variable in > $SPARK_HOME/spark-worker-env.sh like this: > > # JAVA OPTS > export SPARK_JA

Re: Out of memory on large RDDs

2014-03-11 Thread Grega Kespret
> Your input data read as RDD may be causing OOM, so thats where you can use > different memory configuration. We are not getting any OOM exceptions, just akka future timeouts in mapoutputtracker and unsuccessful get of shuffle outputs, therefore refetching them. What is the industry practic

Applications for Spark on HDFS

2014-03-11 Thread Paul Schooss
Hello Folks, I was wondering if anyone had experience placing application jars for Spark onto HDFS. Currently I have distributing the jars manually and would love to source the jar via HDFS a la distributed caching with MR. Any ideas? Regards, Paul

Re: computation slows down 10x because of cached RDDs

2014-03-11 Thread Koert Kuipers
hey matei, ok when i switch to java 7 with G1 the GC time for all the "quick" tasks goes from 150ms to 10ms, but the slow ones stay just as slow. all i did was add -XX:+UseG1GC so maybe thats wrong, i still have to read up on G1. an example of GC in a slow task is below. best, koert [GC pause (y

possible bug in Spark's ALS implementation...

2014-03-11 Thread Michael Allman
Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it, howe

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Hi Michael, I can help check the current implementation. Would you please go to https://spark-project.atlassian.net/browse/SPARK and create a ticket about this issue with component "MLlib"? Thanks! Best, Xiangrui On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman wrote: > Hi, > > I'm implementing

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread wallacemann
Ah! Thank you. That'll work for now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Applications for Spark on HDFS

2014-03-11 Thread Sandy Ryza
Hi Paul, What do you mean by distributing the jars manually? If you register jars that are local to the client with SparkContext.addJars, Spark should handle distributing them to the workers. Are you taking advantage of this? -Sandy On Tue, Mar 11, 2014 at 3:09 PM, Paul Schooss wrote: > Hell

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread wallacemann
In a similar vein, it would be helpful to have an Iterable way to access the data inside an RDD. The collect method takes everything in the RDD and puts in a list, but this blows up memory. Since everything I want is already inside the RDD, it could be easy to iterate over the content without rep

Re: How to create RDD from Java in-memory data?

2014-03-11 Thread Mark Hamstra
https://github.com/apache/incubator-spark/pull/421 Works pretty good, but really needs to be enhanced to work with AsyncRDDActions. On Tue, Mar 11, 2014 at 4:50 PM, wallacemann wrote: > In a similar vein, it would be helpful to have an Iterable way to access > the > data inside an RDD. The co

Re: RDD.saveAs...

2014-03-11 Thread Matei Zaharia
I agree that we can’t keep adding these to the core API, partly because it will get unwieldy to maintain and partly just because each storage system will bring in lots of dependencies. We can simply have helper classes in different modules for each storage system. There’s some discussion on this

Re: Block

2014-03-11 Thread dachuan
In my opinion, BlockManager manages many types of Block, RDD's partition, a.k.a. RDDBlock, is one type of them. Other types of Blocks are ShuffleBlock, IndirectBlock (if the task's return status is too large), etc. So, BlockManager is a layer that is independent of RDD concept. On Mar 11, 2014 2:0

Re: Applications for Spark on HDFS

2014-03-11 Thread Paul Schooss
Thanks Sandy, I have not taken advantage of that yet but will research how to invoke that option when submitting the application to the spark master. Currently I am running a standalone spark master and using the run-class script to invoke the application we crafted as a test. On Tue, Mar 11, 20

Re: TriangleCount & Shortest Path under Spark

2014-03-11 Thread moxiecui
1. There is a driver program named Analystics with a main method in graphx/.../lib, you should start the TriangleCount by this driver. 2. Good luck. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TriangleCount-Shortest-Path-under-Spark-tp2439p2577.html Se

Re: SPARK_JAVA_OPTS not picked up by the application

2014-03-11 Thread Robin Cjc
I haven't tried it, but I think you still can use the system.setproperty to set the property. or if you run the application with sbt, I think you also can set the javaOptions in sbt. is that working for you? Thanks Best Regards, Chen Jingci On Tue, Mar 11, 2014 at 1:15 PM, Linlin wrote: > Tha

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Sean Owen
On Tue, Mar 11, 2014 at 10:18 PM, Michael Allman wrote: > I'm seeing counterintuitive, sometimes nonsensical recommendations. For > comparison, I've run the training data through Oryx's in-VM implementation > of implicit ALS with the same parameters. Oryx uses the same algorithm. > (Source in this

Re: possible bug in Spark's ALS implementation...

2014-03-11 Thread Xiangrui Meng
Line 376 should be correct as it is computing \sum_i (c_i - 1) x_i x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some metrics to tell which recommendation is better? -Xiangrui On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng wrote: > Hi Michael, > > I can help check the current impleme

how to config worker HA

2014-03-11 Thread qingyang li
i have one table in memery, when one worker becomes dead, i can not query data from that table. Here is it's storage status: RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize on Disk table01 Memory Deserialized 1x Replicated 11

Are all transformations lazy?

2014-03-11 Thread David Thomas
For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct? /** * Return a new RDD

Re: Are all transformations lazy?

2014-03-11 Thread Ewen Cheslack-Postava
You should probably be asking the opposite question: why do you think it *should* be applied immediately? Since the driver program hasn't requested any data back (distinct generates a new RDD, it doesn't return any data), there's no need to actually compute anything yet. As the documentation d

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
I think you misunderstood my question - I should have stated it better. I'm not saying it should be applied immediately, but I'm trying to understand how Spark achieves this lazy computation transformations. May be this is due to my ignorance of how Scala works, but when I see the code, I see that

Re: Are all transformations lazy?

2014-03-11 Thread Mayur Rustagi
The only point where some *actual* computation happens is when data is requested by driver (using collect()) or materialized in external storage (ex: saveashadoopfile). Rest of the time operations are merely stored & saved. Once you actually ask for the data, the operations are compiled into a DAG

Re: Are all transformations lazy?

2014-03-11 Thread Sandy Ryza
distinct is lazy because the map and reduceByKey functions it calls are lazy as well. When they're called, the only thing that happens is that state is built up on the client side. distinct will return an RDD for the map operation that points to the RDD that it depends on, that in turn point to t

Re: Are all transformations lazy?

2014-03-11 Thread Ewen Cheslack-Postava
Ah, I see. You need to follow those other calls through to their implementations to see what ultimately happens. For example, the map() calls are to RDD.map, not one of Scala's built-in map methods for collections. The implementation looks like this: /** * Return a new RDD by applying a functio

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
Perfect! That answers my question. I was under the impression that map and reduceByKey were Scala collection functions, but they weren't. Now it makes sense. On Tue, Mar 11, 2014 at 10:38 PM, Ewen Cheslack-Postava wrote: > Ah, I see. You need to follow those other calls through to their > imp

Re: is spark.cleaner.ttl safe?

2014-03-11 Thread Sourav Chandra
Yes, we are also facing same problem. The workaround we came up with is - store the broadcast variable id when it was first created - then create a scheduled job which runs every (spark.cleaner.ttl - 1minute) interval and creates the same broadcast variable using same id. This way spark is happy