openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

Re: Normalizations in MLBase

2014-06-13 Thread Aslan Bekirov
Thanks a lot DB. I will test it and let you know the results. BR, Aslan On Fri, Jun 13, 2014 at 12:34 AM, DB Tsai wrote: > Hi Asian, > > I'm not sure if mlbase code is maintained for the current spark > master. The following is the code we use for standardization in my > company. I'm intended

Spark 1.0.0 on yarn cluster problem

2014-06-13 Thread Sophia
With the yarn-client mode,I submit a job from client to yarn,and the spark file spark-env.sh: export HADOOP_HOME=/usr/lib/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop SPARK_EXECUTOR_INSTANCES=4 SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_MEMORY=1G SPARK_DRIVER_MEMORY=2G SPARK_YARN_APP_NAME="Spar

Convert text into tfidf vectors for Classification

2014-06-13 Thread Stuti Awasthi
Hi all, I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was looking for the way to convert text into sparse vector with TFIDF weighting scheme. I found that MLI library supports that but it is compatible with Spark 0.8. What are all the options available to achieve text ve

Re: Convert text into tfidf vectors for Classification

2014-06-13 Thread Xiangrui Meng
You can create tf vectors and then use RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth having a simple interface for it. -Xiangrui On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi wrote: > Hi all, > > > > I wa

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-13 Thread visenger
Hi guys, I ran into the same exception (while trying the same example), and after overriding hadoop-client artifact in my pom.xml, I got another error (below). System config: ubuntu 12.04 intellijj 13. scala 2.10.3 maven: org.apache.spark spark-core_2.10 1.

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: "rd1.cache() rd2.cache() ... rdN.cache()" How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: "rd1.cache() rd2.cache() ... rdN.cache()" How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

Re: list of persisted rdds

2014-06-13 Thread Daniel Darabos
Check out SparkContext.getPersistentRDDs! On Fri, Jun 13, 2014 at 1:06 PM, mrm wrote: > Hi, > > How do I check the rdds that I have persisted? I have some code that looks > like: > "rd1.cache() > > rd2.cache() > ... > > rdN.cache()" > > How can I unpersist all rdd's at once? And is it

Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

2014-06-13 Thread Yana Kadiyska
Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again: (here is the original thread I found: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E ) I have a set of workers dying and coming back agai

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Daniel, Thank you for your help! This is the sort of thing I was looking for. However, when I type "sc.getPersistentRDDs", i get the error "AttributeError: 'SparkContext' object has no attribute 'getPersistentRDDs'". I don't get any error when I type "sc.defaultParallelism" for example. I wou

Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFil

BUG? Why does MASTER have to be set to spark://hostname:port?

2014-06-13 Thread Hao Wang
Hi, all When I try to run Spark PageRank using: ./bin/spark-submit \ --master spark://192.168.1.12:7077 \ --class org.apache.spark.examples.bagel.WikipediaPageRank \ ~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \ hdfs://192.168.1.12:9000/freebase-13G 0.05 100 Tru

Re: how to set spark.executor.memory and heap size

2014-06-13 Thread Hao Wang
Hi, Laurent You could set Spark.executor.memory and heap size by following methods: 1. in you conf/spark-env.sh: *export SPARK_WORKER_MEMORY=38g* *export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m"* 2. you could also add modification for

Re: Transform pair to a new pair

2014-06-13 Thread lalit1303
Hi, You can use map functions like flatmapValues and mapValues, which will apply the map fucntion on each pairRDD contained in your input pairDstream and returns the paired Dstream On Fri, Jun 13, 2014 at 8:48 AM, ryan_seq [via Apache Spark User List] < ml-node+s1001560n7550...@n3.nabble.com> wr

Re: list of persisted rdds

2014-06-13 Thread Mayur Rustagi
val myRdds = sc.getPersistentRDDs assert(myRdds.size === 1) It'll return a map. Its pretty old 0.8.0 onwards. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Jun 13, 2014 at 9:42 AM, mrm w

Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

2014-06-13 Thread Mayur Rustagi
I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: multiple passes in mapPartitions

2014-06-13 Thread Mayur Rustagi
Sorry if this is a dumb question but why not several calls to map-partitions sequentially. Are you looking to avoid function serialization or is your function damaging partitions? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
This appears to be missing from PySpark. Reported in SPARK-2141 . On Fri, Jun 13, 2014 at 10:43 AM, Mayur Rustagi wrote: > > > val myRdds = sc.getPersistentRDDs > > assert(myRdds.size === 1) > > > > It'll return a map. Its pretty old 0.

Re: specifying fields for join()

2014-06-13 Thread Mayur Rustagi
You can resolve the columns to create keys using them.. then join. Is that what you did? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Jun 12, 2014 at 9:24 PM, SK wrote: > This issue is resolved. > > > > -- > Vi

Re: Java Custom Receiver onStart method never called

2014-06-13 Thread jsabin
I just forgot to call start on the context. Works now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-Custom-Receiver-onStart-method-never-called-tp7525p7579.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at Nabble

process local vs node local subtlety question/issue

2014-06-13 Thread Albert Chu
There is probably a subtlety between the ability to run tasks with data process-local and node-local that I think I'm missing. I'm doing a basic test which is the following: 1) Copy a large text file from the local file system into HDFS using hadoop fs -copyFromLocal 2) Run Spark's wordcount exa

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
Yeah, unfortunately PySpark still lags behind the Scala API a bit, but it's being patched up at a good pace. On Fri, Jun 13, 2014 at 1:43 PM, mrm wrote: > Hi Nick, > > Thank you for the reply, I forgot to mention I was using pyspark in my > first > message. > > Maria > > > > -- > View this mess

Re: Spilled shuffle files not being cleared

2014-06-13 Thread Michael Chang
Thanks Saisai, I think I will just try lowering my spark.cleaner.ttl value - I've set it to an hour. On Thu, Jun 12, 2014 at 7:32 PM, Shao, Saisai wrote: > Hi Michael, > > > > I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata > cleaner, which will clean old un-used shu

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-13 Thread Michael Chang
I'm interested in this issue as well. I have spark streaming jobs that seems to run well for a while, but slowly degrade and don't recover. On Wed, Jun 11, 2014 at 11:08 PM, Boduo Li wrote: > It seems that the slow "reduce" tasks are caused by slow shuffling. Here is > the logs regarding one s

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-13 Thread Congrui Yi
? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Missing-Regularization-Parameter-and-Intercept-for-Logistic-Regression-tp7522p7586.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-13 Thread Manish Amde
Hi Suraj, I can't answer 1) without knowing the data. However, the results for 2) are surprising indeed. We have tested with a billion samples for regression tasks so I am perplexed with the behavior. Could you try the latest Spark master to see whether this problem goes away. It has code that li

MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-13 Thread Congrui Yi
Hi All, I'm new to Spark and currently exploring around the ML package mllib. Two questions about LogisticRegressionWithSGD in MLlib classification package. 1. I have checked the source code on github and found that the class LogisticRegressionWithSGD has "regParam" as its private argument. I canno

MLlib-a problem of example code for L-BFGS

2014-06-13 Thread Congrui Yi
Hi All, I'm new to Spark. Just tried out the example code on Spark website for L-BFGS. But the code "val model = new LogisticRegressionModel(..." gave me an error: :19: error: constructor LogisticRegressionModel in class LogisticRegres sionModel cannot be accessed in class $iwC val model

Re: How to specify executor memory in EC2 ?

2014-06-13 Thread Aliaksei Litouka
Aaron, spark.executor.memory is set to 2454m in my spark-defaults.conf, which is a reasonable value for EC2 instances which I use (they are m3.medium machines). However, it doesn't help and each executor uses only 512 MB of memory. To figure out why, I examined spark-submit and spark-class scripts

Re: specifying fields for join()

2014-06-13 Thread SK
I used groupBy to create the keys for both RDDs. Then I did the join. I think though it be useful if in the future Spark could allows us to specify the fields on which to join, even when the keys are different. Scalding allows this feature. -- View this message in context: http://apache-spark-

spark-submit fails to get jar from http source

2014-06-13 Thread lbustelo
I'm running a 1.0.0 standalone cluster based on amplab/dockerscripts with 3 workers. I'm testing out spark-submit and I'm getting errors using *--deploy-mode cluster* and using an http:// url to my JAR. I'm getting the following error back. Sending launch command to spark://master:7077 Driver succ

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai
Hi Congrui, Since it's private in mllib package, one workaround will be write your code in scala file with mllib package in order to use the constructor of LogisticRegressionModel. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn

Re: Command exited with code 137

2014-06-13 Thread Jim Blomo
I've seen these caused by the OOM killer. I recommend checking /var/log/syslog to see if it was activated due to lack of system memory. On Thu, Jun 12, 2014 at 11:45 PM, libl <271592...@qq.com> wrote: > I use standalone mode submit task.But often,I got an error.The stacktrace as > > 2014-06-12 11

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread Congrui Yi
Hi DB, Thank you for the help! I'm new to this, so could you give a bit more details how this could be done? Sincerely, Congrui Yi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html Sent from the

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
This is very odd. If it is running fine on mesos, I dont see a obvious reason why it wont work on Spark standalone cluster. Is the .4 million file already present in the monitored directory when the context is started? In that case, the file will not be picked up (unless textFileStream is created w

Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

2014-06-13 Thread Gino Bustelo
I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster. On Fri, Jun 13, 2014 at 9:44

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread praveshjain1991
There doesn't seem to be any obvious reason - that's why it looks like a bug. The .4 million file is present in the directory when the context is started - same as for all other files (which are processed just fine by the application). In the logs we can see that the file is being picked up by the

How Spark Choose Worker Nodes for respective HDFS block

2014-06-13 Thread anishs...@yahoo.co.in
Hi All I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark scope in analytics, my Spark codes interacts with HDFS mostly. I have a confusion that how Spark choose on which node it will distribute its work. Since we assume that it can be an alternative to Hadoop MapRe

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
In the logs you posted (the 2nd set), i dont see the file being picked up. The lines having "FileInputDStream: Finding new files ..." should show the file name that has been picked up and i dont see any file in the second set logs. If the file is already present in the directory by the time streami

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread praveshjain1991
If you look at the file 400k.output, you'll see the string file:/newdisk1/praveshj/pravesh/data/input/testing4lk.txt This file contains 0.4 mn records. So the file is being picked up but the app goes on to hang later on. Also you mentioned the term "Standalone cluster" in your previous reply

Re: process local vs node local subtlety question/issue

2014-06-13 Thread Nicholas Chammas
On Fri, Jun 13, 2014 at 1:55 PM, Albert Chu wrote: > 1) How is this data process-local? I *just* copied it into HDFS. No > spark worker or executor should have loaded it. > Yeah, I thought that PROCESS_LOCAL meant the data was already in the JVM on the worker node, but I do see the same thing

guidance on simple unit testing with Spark

2014-06-13 Thread SK
Hi, I have looked through some of the test examples and also the brief documentation on unit testing at http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, but still dont have a good understanding of writing unit tests using the Spark framework. Previously, I have written uni

spark.eventLog.enabled not working on spark on AWS EC2

2014-06-13 Thread zhen
I have been trying to record event logging in a standalone application submitted to spark on AWS EC2. However, the application keeps failing when trying to write to the event logs. I tried various different logging directories by setting spark.eventLog.dir but it does not work. I tried the followi

convert List to RDD

2014-06-13 Thread SK
Hi, I have a List[ (String, Int, Int) ] that I would liek to convert to an RDD. I tried to use sc.parallelize and sc.makeRDD, but in each case the original order of items in the List gets modified. Is there a simple way to convert a List to RDD without using SparkContext? thanks -- View this

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
I may be wrong, but I think RDDs must be created inside a SparkContext. To somehow preserve the order of the list, perhaps you could try something like: sc.parallelize((1 to xs.size).zip(xs)) On Fri, Jun 13, 2014 at 6:08 PM, SK wrote: > Hi, > > I have a List[ (String, Int, Int) ] that I would li

spark master UI does not keep detailed application history

2014-06-13 Thread zhen
I have been trying to get detailed history of previous spark shell executions (after exiting spark shell). In standalone mode and Spark 1.0, I think the spark master UI is supposed to provide detailed execution statistics of all previously run jobs. This is supposed to be viewable by clicking on th

Re: convert List to RDD

2014-06-13 Thread SK
Thanks. But that did not work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606p7609.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
Sorry I wasn't being clear. The idea off the top of my head was that you could append an original position index to each element (using the line above), and modified what ever processing functions you have in mind to make them aware of these indices. And I think you are right that RDD collections a

printing in unit test

2014-06-13 Thread SK
Hi, My unit test is failing (the output is not matching the expected output). I would like to printout the value of the output. But rdd.foreach(r=>println(r)) does not work from the unit test. How can I print or write out the output to a file/screen? thanks. -- View this message in context: h

Re: guidance on simple unit testing with Spark

2014-06-13 Thread Matei Zaharia
You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you can’t pass one from outside, and your code has to read data from a file and write it

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

MLLib : Decision Tree with minimum points per node

2014-06-13 Thread Justin Yip
Hello, I have been playing around with mllib's decision tree library. It is working great, thanks. I have a question regarding overfitting. It appears to me that the current implementation doesn't allows user to specify the minimum number of samples per node. This results in some nodes only conta

Re: multiple passes in mapPartitions

2014-06-13 Thread zhen
Thank you for your suggestion. We will try it out and see how it performs. We think the single call to mapPartitions will be faster but we could be wrong. It would be nice to have a "clone method" on the iterator. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions() works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records & memory efficient. heers On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar wrote: > Hi, >Would appreciate insights

Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-06-13 Thread anishs...@yahoo.co.in
Hi All Is there any communication between Spark MASTER node and Hadoop NameNode while distributing work to WORKER nodes, like we have in MapReduce. Please suggest TIA --  Anish Sneh "Experience is the best teacher." http://in.linkedin.com/in/anishsneh

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size)) gives the total & unique. The question : is it scalable & efficient ? Would appreciate insights. Cheers On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar wrote: > Answered one of my questi

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-13 Thread Xiangrui Meng
1. "examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala" contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor a