Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Yes, this is a useful trick we found that made our algorithm implementation noticeably faster (btw, we'll send a pull request for this GLMNET implementation, so interested people could look at it). It would be nice if Spark supported something akin to this natively, as I believe that many efficien

Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi I am running a simple word count program on spark standalone cluster. The cluster is made up of 6 node, each run 4 worker and each worker own 10G memory and 16 core thus total 96 core and 240G memory. ( well, also used to configed as 1 worker with 40G memory on each node )

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sean Owen
On Mon, Apr 28, 2014 at 8:22 AM, Sung Hwan Chung wrote: > > e.g. something like > > rdd.mapPartition((rows : Iterator[String]) => { > var idx = 0 > rows.map((row: String) => { > val valueMap = SparkWorker.getMemoryContent("valMap") > val prevVal = valueMap(idx) > idx += 1 > ...

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Actually, I do not know how to do something like this or whether this is possible - thus my suggestive statement. Can you already declare persistent memory objects per worker? I tried something like constructing a singleton object within map functions, but that didn't work as it seemed to actually

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sean Owen
On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung wrote: > Actually, I do not know how to do something like this or whether this is > possible - thus my suggestive statement. > > Can you already declare persistent memory objects per worker? I tried > something like constructing a singleton object w

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
In our case, we'd like to keep memory content from one iteration to the next, and not just during a single mapPartition call because then we can do more efficient computations using the values from the previous iteration. So essentially, we need to declare objects outside the scope of the map/redu

Spark 1.0 run job fail

2014-04-28 Thread Shihaoliang (Shihaoliang)
Hi all, I get spark 1.0 snapshot code from git; and I compiled it using command: mvn -Pbigtop-dist -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests package -e in cluster, I add [export SPARK_YARN_MODE=true] to spark-env.sh, and run HdfsTest examples; and I got error, any one got similar

getting an error

2014-04-28 Thread Joe L
Hi, while I was testing an example, I have encountered a problem in running Scala on cluster. I searched it on Google but couldn't solve it and posted about it on spark mailing list but it couldn't help me solve the problem as well. The problem is that I could run Spark successfully in local mode,

what does broadcast_0 stand for

2014-04-28 Thread wxhsdp
Hi, guys when i read in a file on spark shell, the console shows that broadcast_0 is stored to memory. i guess it's related to the file, but broadcast_0 is not the file itself, because they have different size. what does broadcast_0 stand for? logs: 14/04/28 18:02:50 INFO MemoryStore: ensureFreeS

Re: what does broadcast_0 stand for

2014-04-28 Thread Sourav Chandra
In case HttpBroadcast is used, spark creates a jetty server and uses http protocol for trasporting the broadcast variables to all workers. To do so it write the serialized broadcast variable into their corresponding files, and file names denote the broadcast id of the variable. So broadcast_0 is t

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp
thank you for your help, Sourav. i found broadcast_0 binary file in /tmp directory. it's size is 33.4kB, not equal to estimated size 135.6 KB. i opened it and found it's content has no relations with my read in file. i guess broadcast_0 is a config file about spark, is that right? -- View this m

NoSuchMethodError from Spark Java

2014-04-28 Thread Jared Rodriguez
I am seeing the following exception from a very basic test project when it runs on spark local. java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Lscala/Tuple2; The project is built with Java 1.6, Scala 2.10.3 and spark 0.9.1

Re: what does broadcast_0 stand for

2014-04-28 Thread Sourav Chandra
Apart from user defined broadcast variable, there are others which is being created by spark. This could be one of those. As I had mentioned you can do a small program where you create a broadcast variable. Check the broadcast variable id(say its x). Then go to the /tmp to open broadcast_x file.

Re: what does broadcast_0 stand for

2014-04-28 Thread wxhsdp
ye, you're right, thanks for your patience:) Sourav Chandra wrote > Apart from user defined broadcast variable, there are others which is > being > created by spark. This could be one of those. > > > > As I had mentioned you can do a small program where you create a broadcast > variable. Check

Cannot compile SIMR with Spark 9.1

2014-04-28 Thread lukas nalezenec
Hi, I am trying to recompile SIMR with Spark 9.1 but it fails on incompatible method: [error] /home/lukas/src/simr/src/main/scala/org/apache/spark/simr/RelayServer.scala:213: not enough arguments for method createActorSystem: (name: String, host: String, port: Int, indestructible: Boolean, conf: o

Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Kulkarni, Vikram
Hi Spark-users, Within my Spark Streaming program, I am able to ingest data sent by my Flume Avro Client. I configured a 'spooling directory source' to write data to a Flume Avro Sink (the Spark Streaming Driver program in this case). The default deserializer i.e. LINE is used to parse the fil

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Daniel Darabos
That is quite mysterious, and I do not think we have enough information to answer. JavaPairRDD.lookup() works fine on a remote Spark cluster: $ MASTER=spark://localhost:7077 bin/spark-shell scala> val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 10, 3).map(x => ((x%3).toS

Re: questions about debugging a spark application

2014-04-28 Thread Daniel Darabos
Good question! I am also new to the JVM and would appreciate some tips. On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wrote: > Hi, all > i have some questions about debug in spark: > 1) when application finished, application UI is shut down, i can not see > the details about the app, like >

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
Thanks for your answer. I tried running on a single machine - master and worker on one host. I get exactly the same results. Very little CPU activity on the machine in question. The web UI shows a single task and its state is RUNNING. it will remain so indefinitely. I have a single partition, a

Running parallel jobs in the same driver with Futures?

2014-04-28 Thread Ian Ferreira
I recall asking about this, and I think Matei suggest it was, but is the scheduler thread safe? I am running mllib libraries as futures in the same driver using the same dataset as input and this error 14/04/28 08:29:48 ERROR TaskSchedulerImpl: Exception in statusUpdate java.util.concurrent.Reje

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell wrote: > A mutable map in an object should do what your looking for then I beli

MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Shubham Chopra
I am trying to use Spark/MLLib on Yarn and do not have libgfortran installed on my cluster. Is there any way I can set LD_LIBRARY_PATH so Spark/MLLib/jblas pick up the library from a non-standard location (like say the distributed cache or /tmp instead of /usr/lib). Thanks, ~Shubham.

K-means with large K

2014-04-28 Thread Buttler, David
Hi, I am trying to run the K-means code in mllib, and it works very nicely with small K (less than 1000). However, when I try for a larger K (I am looking for 2000-4000 clusters), it seems like the code gets part way through (perhaps just the initialization step) and freezes. The compute nodes

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Yadid Ayzenberg
Could this be related to the size of the lookup result ? I tried to recreate a similar scenario on the spark shell which causes an exception: scala> val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 4, 3).map(x => ( ( 0,"52fb9b1a3004f07d1a87c8f3" ), Seq.fill(40

Re: K-means with large K

2014-04-28 Thread Chester Chen
David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, "Buttler, David" wrote: > Hi, > I am trying to run the K-means code in mllib, and it works very nicely with > small K (less than 1000). However, when I

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
Yes, this is what we've done as of now (if you read earlier threads). And we were saying that we'd prefer if Spark supported persistent worker memory management in a little bit less hacky way ;) On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell wrote: > A mutable map in an object should do what yo

RE: K-means with large K

2014-04-28 Thread Buttler, David
One thing I have used this for was to create codebooks for SIFT features in images. It is a common, though fairly naïve, method for converting high dimensional features into a simple word-like features. Thus, if you have 200 SIFT features for an image, you can reduce that to 200 ‘words’ that c

What is the recommended way to store state across RDDs?

2014-04-28 Thread Adrian Mocanu
What is the recommended way to store state across RDDs as you traverse a DStream and go from 1 RDD to another? Consider a trivial example of moving average. Between RDDs should the average be saved in a cache (ie redis) or is there another globar var type available in Spark? Accumulators are on

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of losses

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Chester Chen
Tom, Are you suggesting two RDDs, one with loss and another for the rest info, using zip to tie them together, but do update on loss RDD (copy) ? Chester Sent from my iPhone On Apr 28, 2014, at 9:45 AM, Tom Vacek wrote: > I'm not sure what I said came through. RDD zip is not hacky at al

Re: K-means with large K

2014-04-28 Thread Matei Zaharia
Try turning on the Kryo serializer as described at http://spark.apache.org/docs/latest/tuning.html. Also, are there any exceptions in the driver program’s log before this happens? Matei On Apr 28, 2014, at 9:19 AM, Buttler, David wrote: > Hi, > I am trying to run the K-means code in mllib, an

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Right---They are zipped at each iteration. On Mon, Apr 28, 2014 at 11:56 AM, Chester Chen wrote: > Tom, > Are you suggesting two RDDs, one with loss and another for the rest > info, using zip to tie them together, but do update on loss RDD (copy) ? > > Chester > > Sent from my iPhone > > On

running SparkALS

2014-04-28 Thread Diana Carroll
Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm missing something). Can someone help me out? I'm particularly interested in running SparkALS, which wants parameters: M U F iter slices What are these variables? They appear to

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek wrote: > As to your last line: I've used RDD zipping to avoid GC since MyBaseData > is large and doe

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Sung Hwan Chung
That might be a good alternative to what we are looking for. But I wonder if this would be as efficient as we want to. For instance, will RDDs of the same size usually get partitioned to the same machines - thus not triggering any cross machine aligning, etc. We'll explore it, but I would still ver

Re: running SparkALS

2014-04-28 Thread Xiangrui Meng
Hi Diana, SparkALS is an example implementation of ALS. It doesn't call the ALS algorithm implemented in MLlib. M, U, and F are used to generate synthetic data. I'm updating the examples. In the meantime, you can take a look at the updated MLlib guide: http://50.17.120.186:4000/mllib-collaborativ

Re: running SparkALS

2014-04-28 Thread Sean Owen
Yeah you'd have to look at the source code in this case; it's not explained in the scaladoc or usage message as far as I can see either. The args refer specifically to the example of recommending Movies to Users. This example makes up a bunch of ratings and then makes recommendations using ALS. M

Re: running SparkALS

2014-04-28 Thread Debasish Das
Diana, Here are the parameters: ./bin/spark-class org.apache.spark.mllib.recommendation.ALS Usage: ALS [] [] [] [] Master: Local/Deployed spark cluster master ratings_file: Netflix format data rank: Reduced dimension of the User and Product factors iterations: How many ALS iterations you wo

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
If you create your auxiliary RDD as a map from the examples, the partitioning will be inherited. On Mon, Apr 28, 2014 at 12:38 PM, Sung Hwan Chung wrote: > That might be a good alternative to what we are looking for. But I wonder > if this would be as efficient as we want to. For instance, will

Re: running SparkALS

2014-04-28 Thread Li Pu
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1 One thing which is undocumented: the integers representing users and items have to be positive. Otherwise it throws exceptions. Li On 28 avr. 2014, at 10:30, Diana Carroll wrote: > Hi everyone. I'm trying to run som

Re: running SparkALS

2014-04-28 Thread Diana Carroll
Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS, which is not in the mllib examples, and does not take any file parameters. I don't see the class you refer to in the examples ...however, if I did want to run that example, where would I find the file in question? It would be g

Read from list of files in parallel

2014-04-28 Thread Pat Ferrel
Warning noob question: The sc.textFile(URI) method seems to support reading from files in parallel but you have to supply some wildcard URI, which greatly limits how the storage is structured. Is there a simple way to pass in a URI list or is it an exercise left for the student?

Re: K-means with large K

2014-04-28 Thread Dean Wampler
You might also investigate other clustering algorithms, such as canopy clustering and nearest neighbors. Some of them are less accurate, but more computationally efficient. Often they are used to compute approximate clusters followed by k-means (or a variant thereof) for greater accuracy. dean O

RE: help

2014-04-28 Thread Laird, Benjamin
Joe, Do you have your SPARK_HOME variable set correctly in the spark-env.sh script? I was getting that error when I was first setting up my cluster, turned out I had to make some changes in the spark-env script to get things working correctly. Ben -Original Message- From: Joe L [mailt

RE: running SparkALS

2014-04-28 Thread Laird, Benjamin
Good clarification Sean. Diana, I was also referring to this example when setting up some of my bigger ALS runs. I don't this particular example is very helpful, as it is creating the initial matrix locally in memory before parallelizing in spark. So (unless I'm misunderstanding), it is an ok ex

Re: running SparkALS

2014-04-28 Thread Debasish Das
Don't use SparkALS...that's the first version of the code and does not scale... Li is right...you have to do the dictionary generation on users, products and then generate indexed fileI wrote some utilities but looks like it is application dependentthe indexed netflix format is more generi

Re: running SparkALS

2014-04-28 Thread Diana Carroll
Should I file a JIRA to remove the example? I think it is confusing to include example code without explanation of how to run it, and it sounds like this one isn't worth running or reviewing anyway. On Mon, Apr 28, 2014 at 2:34 PM, Debasish Das wrote: > Don't use SparkALS...that's the first v

Re: pySpark memory usage

2014-04-28 Thread Jim Blomo
FYI, it looks like this "stdin writer to Python finished early" error was caused by a break in the connection to S3, from which the data was being pulled. A recent commit to PythonRDDnoted that t

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
Hey Jim, This IOException thing is a general issue that we need to fix and your observation is spot-in. There is actually a JIRA for it here I created a few days ago: https://issues.apache.org/jira/browse/SPARK-1579 Aaron is assigned on that one but not actively working on it, so we'd welcome a P

Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
After upgrading to Spark 0.9.1, sbt assembly is failing. I'm trying to fix it with merge strategy, etc., but is anyone else seeing this? For example, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979.html Sent from the Apa

Re: Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
So, good news and bad news. I have a customized Build.scala that allows me to use the 'run' and 'assembly' commands in sbt without toggling the 'pro

Re: Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Tathagata Das
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using SparkFlumeEvent.event. That should probably give you all the original text data. On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram wrote: > Hi Spark-users, > > Within my Spark Streaming program, I am able to ingest data

Re: Spark 0.9.1 -- assembly fails?

2014-04-28 Thread kamatsuoka
Um. When I updated the spark dependency, I unintentially deleted the "provided" attribute. Oops. Nothing to see here . . . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-assembly-fails-tp4979p4982.html Sent from the Apache Spark User List mai

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
Matei, thank you. That seemed to work but I'm not able to import a class from my jar. Using the verbose options, I can see that my jar should be included Parsed arguments: ... jars /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar And I see the class I want to load in t

File list read into single RDD

2014-04-28 Thread Pat Ferrel
sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
A couple of issues: 1) the jar doesn't show up on the classpath even though SparkSubmit had it in the --jars options. I tested this by running > :cp in spark-shell 2) After adding it the classpath using (:cp /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar), it still fails.

processing s3n:// files in parallel

2014-04-28 Thread Art Peel
I’m trying to process 3 S3 files in parallel, but they always get processed serially. Am I going about this the wrong way? Full details below. Regards, Art I’m trying to do the following on a Spark cluster with 3 slave nodes. Given a list of N URLS for files in S3 (with s3n:// urls), For eac

launching concurrent jobs programmatically

2014-04-28 Thread ishaaq
Hi all, I have a central app that currently kicks of old-style Hadoop M/R jobs either on-demand or via a scheduling mechanism. My intention is to gradually port this app over to using a Spark standalone cluster. The data will remain on HDFS. Couple of questions: 1. Is there a way to get Spark j

Re: processing s3n:// files in parallel

2014-04-28 Thread Andrew Ash
The way you've written it there, I would expect that to be serial runs. The reason is, you're looping over matches with a driver-level map, which is serialized. Then calling foreachWith on the RDDs executes the action in a blocking way, so you don't get a result back until the cluster finishes.

Re: launching concurrent jobs programmatically

2014-04-28 Thread Andrew Ash
For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. >From the documentation: Second, *within* each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they we

Re: how to declare tuple return type

2014-04-28 Thread wxhsdp
you need to import org.apache.spark.rdd.RDD to include RDD. http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD here are some examples you can learn https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib SK wrote > I am a new user of

Re: questions about debugging a spark application

2014-04-28 Thread wxhsdp
thanks for your reply, daniel what do you mean by "the logs contain everything to reconstruct the same data." ? i also use times to look into the logs, but only get a little. as i can see, it logs the flow to run the application, but there are no more details about each task, for example, see the

RE: questions about debugging a spark application

2014-04-28 Thread Liu, Raymond
If you are using the trunk code, you should be able to config spark to use eventlog to log the application/task UI contents into the history server and be able to check out the application/task details later. There are different config need to be done for standalone mode v.s. yarn/mesos mode.

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
It would be useful to have some way to open multiple files at once into a single RDD (e.g. sc.textFile(iterable_over_uris)). Logically, it would be equivalent to opening a single file which is made by concatenating the various files together. This would only be useful, of course, if the source file

Re: processing s3n:// files in parallel

2014-04-28 Thread Andrew Ash
This is already possible with the sc.textFile("/path/to/filedir/*") call. Does that work for you? Sent from my mobile phone On Apr 29, 2014 2:46 AM, "Nicholas Chammas" wrote: > It would be useful to have some way to open multiple files at once into a > single RDD (e.g. sc.textFile(iterable_over_

Re: processing s3n:// files in parallel

2014-04-28 Thread Matei Zaharia
Actually wildcards work too, e.g. s3n://bucket/file1*, and I believe so do comma-separated lists (e.g. s3n://file1,s3n://file2). These are all inherited from FileInputFormat in Hadoop. Matei On Apr 28, 2014, at 6:05 PM, Andrew Ash wrote: > This is already possible with the sc.textFile("/path/

How to declare Tuple return type for a function

2014-04-28 Thread SK
Hi, I am a new user of Spark. I have a class that defines a function as follows. It returns a tuple : (Int, Int, Int). class Sim extends VectorSim { override def input(master:String): (Int,Int,Int) = { sc = new SparkContext(master, "Test") val ratings = s

question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
In classic MapReduce/Hadoop, you may optionally define setup() and cleanup() methods. They ( setup() and cleanup() ) are called for each task, so if you have 20 mappers running, the setup/cleanup will be called for each one. What is the equivalent of these in Spark? Thanks, best regards, Mahmoud

RE: help

2014-04-28 Thread Joe L
Yes, here it is I set it up like this: export STANDALONE_SPARK_MASTER_HOST=`hostname` export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST ### Let's run everything with JVM runtime, instead of Scala export SPARK_LAUNCH_WITH_SCALA=10.2 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRA

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover wrote: > A couple of issues: > 1) the

Re: question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Ameet Kini
I don't think there is a setup() or cleanup() in Spark but you can usually achieve the same using mapPartitions and having the "setup" code at the top of the mapPartitions and "cleanup" at the end. The reason why this usually works is that in Hadoop map/reduce, each map task runs over an input spl

Shuffle phase is very slow, any help, thx!

2014-04-28 Thread gogototo
I has an application using grapx, and some phase is very slow. Stage IdDescription Submitted Duration ▴ Tasks: Succeeded/Total Shuffle ReadShuffle Write 282 reduce at VertexRDD.scala:912014/04/28 14:07:13 5.20 h 100/100 3.8 MB 419 zipPartitions at

Re: processing s3n:// files in parallel

2014-04-28 Thread Nicholas Chammas
Oh snap! I didn’t know that! Confirmed that both the wildcard syntax and the comma-separated syntax work in PySpark. For example: sc.textFile('s3n://file1,s3n://file2').count() Art, Would this approach work for you? It would let you load your 3 files into a single RDD, which your workers could

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/file1,/path/to/file2') So once you have your list of files, concatenate their paths like that and pass the single string to textFile().

Re: File list read into single RDD

2014-04-28 Thread Pat Ferrel
Perfect. BTW just so I know where to look next time, was that in some docs? On Apr 28, 2014, at 7:04 PM, Nicholas Chammas wrote: Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load. For example: sc.textFile('/path/to/

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not ob

Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The problem is this object can't be Serializerable, it holds a RDD field and SparkContext. But Spark shows an error that it need Serialization. The order of my debug output is really strange. ~ Training Start! Round 0 Hehe? Hehe? started? failed? Round 1 Hehe? ~ here is my code 69 impo

Re: question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
Thank you very much Ameet! Can you please point me to an example? Best, Mahmoud Sent from my iPhone On Apr 28, 2014, at 6:32 PM, "Ameet Kini" mailto:ameetk...@gmail.com>> wrote: I don't think there is a setup() or cleanup() in Spark but you can usually achieve the same using mapPartitions and

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set spark.executor.extraLibraryPath. On Mon, Apr 28, 2014 at 9:16 AM, Shubham Chopra wrote: > I am trying to use Spark/MLLib on Yarn and do not have libgfortran > installed on my cluster. Is there any way I can set LD_LIBRARY_PATH s

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
Or what is the action that make the rdd run. I don't what to save it as file, and I've tried cache(), it seems to be some kind of lazy too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-require-this-object-to-be-serializerable-tp5009p5011.html Se

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
This can only be a local filesystem though, it can't refer to an HDFS location. This is because it gets passed directly to the JVM. On Mon, Apr 28, 2014 at 9:55 PM, Patrick Wendell wrote: > Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set > spark.executor.extraLibraryPath. >

Re: NullPointerException when run SparkPI using YARN env

2014-04-28 Thread Patrick Wendell
This was fixed in master. I think this happens if you don't set HADOOP_CONF_DIR to the location where your hadoop configs are (e.g. yarn-site.xml). On Sun, Apr 27, 2014 at 7:40 PM, martin.ou wrote: > 1.my hadoop 2.3.0 > 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > 3.SPARK_YARN

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The RDD hold "this" in its closure? How to fix such a problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-require-this-object-to-be-serializerable-tp5009p5015.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
In general, as Andrew points out, it's possible to submit jobs from multiple threads and many Spark applications do this. One thing to check out is the job server from Ooyala, this is an application on top of Spark that has an automated submission API: https://github.com/ooyala/spark-jobserver Yo

Re: Shuffle Spill Issue

2014-04-28 Thread Patrick Wendell
Could you explain more what your job is doing and what data types you are using? These numbers alone don't necessarily indicate something is wrong. The relationship between the in-memory and on-disk shuffle amount is definitely a bit strange, the data gets compressed when written to disk, but unles

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Roger Hoover
Patrick, Thank you for replying. That didn't seem to work either. I see the option parsed using verbose mode. Parsed arguments: ... driverExtraClassPath /Users/rhoover/Work/spark-etl/target/scala-2.10/spark-etl_2.10-1.0.jar But the jar still doesn't show up if I run ":cp" in the repl and t

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread DB Tsai
You are using some objects outside the scope of train method so spark has to serialize ADLDA model. You can just have those objects having a local copy or reference in train method. On Apr 28, 2014 8:48 PM, "Earthson" wrote: > The problem is this object can't be Serializerable, it holds a RDD fie

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is RDD is trying to make itself lazy. and some BroadCast Object need to be generate dynamicly, so the closure have SparkContext inside, so the task complete faile

RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi Patrick I am just doing simple word count , the data is generated by hadoop random text writer. This seems to me not quite related to compress , If I turn off compress on shuffle, the metrics is something like below for the smaller 240MB Dataset. Executor ID Address Ta

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread DB Tsai
Your code is unformatted. Can u paste the whole file in gist and i can take a look for u. On Apr 28, 2014 10:42 PM, "Earthson" wrote: > I've moved SparkContext and RDD as parameter of train. And now it tells me > that SparkContext need to serialize! > > I think the the problem is RDD is trying to

How to run spark well on yarn

2014-04-28 Thread Sophia
Hi,I am sophia. I followed the blog from the Internet to configure and test spark on Yarn,which has configue hadoop 2.0.0-CDH4.The spark version is 0.9.1,the scala version is 2.11.0-RC4 cd spark-0.9.1 SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 SPARK_YARN=true sbt/sbt assembly This cannot work,Invalid o

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The code is here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala I've change it to from Broadcast to Serializable. Now it works:) But There are too many rdd cache, It is the problem? -- View this message in context: http://apache-spark-user-list.1

RE: Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Kulkarni, Vikram
Thanks Tathagata. Here’s the code snippet: // insert the records read in this batch interval into DB flumeStream.foreach(new Function,Void> () { @Override public Void call(JavaRDD eventsData) throws Exception { String logRecord = null

Re: What is the recommended way to store state across RDDs?

2014-04-28 Thread Gerard Maas
Have you tried 'updateStateByKey' [1] I think that it's meant to cover for the usecase you mention. [1] http://spark.apache.org/docs/0.9.1/streaming-programming-guide.html#transformations -kr, Gerard. On Mon, Apr 28, 2014 at 6:44 PM, Adrian Mocanu wrote: > What is the recommended way to st