Re: key class requirement for PairedRDD ?

2014-10-17 Thread Sonal Goyal
We use our custom classes which are Serializable and have well defined hashcode and equals methods through the Java API. Whats the issue you are getting? Best Regards, Sonal Nube Technologies On Fri, Oct 17, 2014 at 12:28 PM, Jaon

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
Here what I'm trying to do. My case class is the following : case class PersonID(id: String, group: String, name: String) I want to use PersonID as a key in a PairedRDD. But I think the default equal function don't fit to my need because two PersonID("a","a","a") are not the same. When I use a tu

MLlib and pyspark features

2014-10-17 Thread poiuytrez
Hello, I would like to use areaUnderROC from MLlib in Apache Spark. I am currently running Spark 1.1.0 and this function is not available in pyspark but is available in scala. Is there a feature tracker that tracks the advancement of porting Scala apis to Python apis? I have tried to search in t

Re: MLlib linking error Mac OS X

2014-10-17 Thread poiuytrez
Hello MLnick, Have you found a solution on how to install MLlib for Mac OS ? I have also some trouble to install the dependencies. Best, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16668.html Sent from the

why do RDD's partitions migrate between worker nodes in different iterations

2014-10-17 Thread randylu
Dear all, In my test programer, there are 3 partitions for each RDD, the iteration procedure is as follows: var rdd_0 = ... // init for (...) { *rdd_1* = *rdd_0*.reduceByKey(...).partitionBy(p) // calculate rdd_1 from rdd_0 *rdd_0* = *rdd_0*.partitionBy(p).join(*rdd_1*)... // upd

Re: How does reading the data from Amazon S3 works?

2014-10-17 Thread jan.zikes
Hi,  I have seen in the video from Spark summit that usually (when I use HDFS) are data distributed across the whole cluster and usually computations goes to the data. My question is how does it work when I read the data from Amazon S3? Is the whole input dataset readed by the master node and

How to assure that there will be run only one map per cluster node?

2014-10-17 Thread jan.zikes
hi, I have cluster that has several nodes and every node has several cores. I'd like to run multi-core algorithm within every map. So I'd like to assure that there will be performed only one map per cluster node. Is there some way, how to assure this? It seems to me that it should be possible b

Gracefully stopping a Spark Streaming application

2014-10-17 Thread Massimiliano Tomassi
Hi all, I have a Spark Streaming application running on a cluster, deployed with the spark-submit script. I was reading here that it's possible to gracefully shutdown the application in order to allow the deployment of a new one: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread neeraj
Hi, When I run given Spark SQL commands in the exercise, it returns with unexpected results. I'm explaining the results below for quick reference: 1. The output of query : wikiData.count() shows some count in the file. 2. after running following query: sqlContext.sql("SELECT username, COUNT(*) A

What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-17 Thread shahab
Hi, Probably I am missing very simple principle , but something is wrong with my filter, i get "org.apache.spark.SparkException: Task not serializable" expetion. here is my filter function: object OBJ { def f1(): Boolean = { var i = 1; for (j<-1 to 10) i = i +1; true

Re: What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-17 Thread Sourav Chandra
It might be due to the object is nested within some class which may not be serializable. Also you can run the appluication using this jvm parameter to see detailed info about serialization -Dsun.io.serialization.extendedDebugInfo=true On Fri, Oct 17, 2014 at 4:07 PM, shahab wrote: > Hi, > > Pro

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function. The problem is that the cartesian

Designed behavior when master is unreachable.

2014-10-17 Thread preeze
Hi all, I am running a standalone spark cluster with a single master. No HA or failover is configured explicitly (no ZooKeeper etc). What is the default designed behavior for submission of new jobs when a single master went down or became unreachable? I couldn't find it documented anywhere. Than

Regarding using spark sql with yarn

2014-10-17 Thread twinkle sachdeva
Hi, I have been using spark sql with yarn. It works fine with yarn-client mode, but with yarn-cluster mode, we are facing 2 issues. Is yarn-cluster mode not recommended for spark-sql using hiveContext ?? *Problem #1* We are not able to use any query with very simple filtering operation "like",

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal
Cartesian joins of large datasets are usually going to be slow. If there is a way you can reduce the problem space to make sure you only join subsets with each other, that may be helpful. Maybe if you explain your problem in more detail, people on the list can come up with more suggestions. Best R

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, Thanks a lot for your reply. It is true that python API has default parameters except ranks(the default iterations is 5). At the very beginning, in order to estimate the speed of ALS.trainImplicit, I used ALS.trainImplicit(ratings, rank, 1) and it worked. So I tried ALS with more iterations,

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, Today, I tried again with the following code, but it didn't work... Could you please tell me your running environment? /from pyspark.mllib.recommendation import ALS from pyspark import SparkContext sc = SparkContext() r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.paral

Re: Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-17 Thread Michael Campbell
For posterity's sake, I solved this. The problem was the Cloudera cluster I was submitting to is running 1.0, and I was compiling against the latest 1.1 release. Downgrading to 1.0 on my compile got me past this. On Tue, Oct 14, 2014 at 6:08 PM, Michael Campbell < michael.campb...@gmail.com> wro

What is akka-actor_2.10-2.2.3-shaded-protobuf.jar?

2014-10-17 Thread Ruebenacker, Oliver A
Hello, My SBT pulls in, among others, the following dependency for Spark 1.1.0: akka-actor_2.10-2.2.3-shaded-protobuf.jar What is this? How is this different from the regular Akka Actor JAR? How do I reconcile with other libs that use Akka, such as Play? Thanks! Best, Olive

Strange duplicates in data when scaling up

2014-10-17 Thread Jacob Maloney
Issue was solved by clearing hashmap and hashset at the beginning of the call method. From: Jacob Maloney [mailto:jmalo...@conversantmedia.com] Sent: Thursday, October 16, 2014 5:09 PM To: user@spark.apache.org Subject: Strange duplicates in data when scaling up I have a flatmap function that s

Re: Join with large data set

2014-10-17 Thread Ankur Srivastava
Hi Sonal Thank you for the response but since we are joining to reference data different partitions of application data would need to join with same reference data and thus I am not sure if spark join would be a good fit for this. Eg out application data has person with zip code and then the refe

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread Michael Armbrust
Looks like this data was encoded with an old version of Spark SQL. You'll need to set the flag to interpret binary data as a string. More info on configuration can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration sqlContext.sql("set spark.sql.parquet.bi

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, I created an issue in JIRA. https://issues.apache.org/jira/browse/SPARK-3990 I uploaded the error information in JIRA. Thanks in advance for your help. Best Gen Davies Liu-2 wrote > It seems a bug, Could you create a JIRA for it? thanks!

bug with MapPartitions?

2014-10-17 Thread davidkl
Hello, Maybe there is something I do not get to understand, but I believe this code should not throw any serialization error when I run this in the spark shell. Using similar code with map instead of mapPartitions works just fine. import java.io.BufferedInputStream import java.io.FileInputStream

Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
Thank you for sharing this Cheng! This is fantastic. I was able to implement it and it seems like it's working quite well. I'm definitely on the right track now! I'm still having a small problem with the rows inside each partition being out of order - but I suspect this is because in the code curr

Re: Folding an RDD in order

2014-10-17 Thread Cheng Lian
Hm, a little confused here. What exactly the ordering do you expect? It seems that you want all the elements in the RDD to be sorted first by timestamp and then by user_id. If this is true, then you can simply do this: |rawData.map {case (time, user, amount) => (time, user) -> amount }.sortBy

Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
My goal is for rows to be partitioned according to timestamp bins (e.g. with each partition representing an even interval of time), and then ordered by timestamp *within* each partition. Ordering by user ID is not important. In my aggregate function, in the seqOp function, I am checking to verify t

Re: object in an rdd: serializable?

2014-10-17 Thread Duy Huynh
interesting. why does case class work for this? thanks boromir! On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas wrote: > make it a case class should work. > > On Thu, Oct 16, 2014 at 8:30 PM, ll wrote: > >> i got an exception complaining about serializable. the sample code is >> below... >>

Re: Spark in cluster and errors

2014-10-17 Thread Cheuk Lam
I wasn't the original person who posted the question, but this helped me! :) Thank you. I had a similar issue today when I tried to connect using the IP address (spark://:7077). I got it resolved by replacing it with the URL displayed in the Spark web console - in my case it is (spark://:7077).

Re: Gracefully stopping a Spark Streaming application

2014-10-17 Thread Sean Owen
You will have to write something in your app like an endpoint or button that triggers this code in your app. Hi all, I have a Spark Streaming application running on a cluster, deployed with the spark-submit script. I was reading here that it's possible to gracefully shutdown the application in orde

Re: What is akka-actor_2.10-2.2.3-shaded-protobuf.jar?

2014-10-17 Thread Chester @work
They should be the same except the package names are changed to avoid protopuf conflict. You can use it just like other Akka jars Chester Sent from my iPhone > On Oct 17, 2014, at 5:56 AM, "Ruebenacker, Oliver A" > wrote: > > > Hello, > > My SBT pulls in, among others, the followi

Attaching schema to RDD created from Parquet file

2014-10-17 Thread Akshat Aranya
Hi, How can I convert an RDD loaded from a Parquet file into its original type: case class Person(name: String, age: Int) val rdd: RDD[Person] = ... rdd.saveAsParquetFile("people") val rdd2: sqlContext.parquetFile("people") How can I map rdd2 back into an RDD[Person]? All of the examples just

Re: bug with MapPartitions?

2014-10-17 Thread Akshat Aranya
There seems to be some problem with what gets captured in the closure that's passed into the mapPartitions (myfunc in your case). I've had a similar problem before: http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html Try

Spark/HIVE Insert Into values Error

2014-10-17 Thread arthur.hk.c...@gmail.com
Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int); OK hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1); NoViable

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Andrew Or
Hm, it works for me. Are you sure you have provided the right jars? What happens if you pass in the `--verbose` flag? 2014-10-16 23:51 GMT-07:00 eric wong : > Hi, > > i using the comma separated style for submit multiple jar files in the > follow shell but it does not work: > > bin/spark-submit -

Re: how can I make the sliding window in Spark Streaming driven by data timestamp instead of absolute time

2014-10-17 Thread st553
I believe I have a similar question to this. I would like to process an offline event stream for testing/debugging. The stream is stored in a CSV file where each row in the file has a timestamp. I would like to feed this file into Spark Streaming and have the concept of time be driven by the timest

complexity of each action / transformation

2014-10-17 Thread ll
hello... is there a list that shows the complexity of each action/transformation? for example, what is the complexity of RDD.map()/filter() or RowMatrix.multiply() etc? that would be really helpful. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/

spark best practices / guidelines?

2014-10-17 Thread ll
i'm writing distributed algorithms for the first time and also using spark for the first time. my code looks extremely long and messy. the code runs, but it looks extremely inefficient. there gotta be a better way to write these distributed algorithms. is there a list of guidelines / best pra

Re: why do RDD's partitions migrate between worker nodes in different iterations

2014-10-17 Thread Sean Owen
The RDDs aren't changing; you are assigning new RDDs to rdd_0 and rdd_1. Operations like join and reduceByKey are making distinct, new partitions that don't correspond 1-1 with old partitions anyway. On Fri, Oct 17, 2014 at 5:32 AM, randylu wrote: > Dear all, > In my test programer, there are 3

Re: how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-17 Thread Sean Owen
It doesn't contain commons math3 since Spark does not depend on it. Its tests do, but tests are not built into the Spark assembly. On Thu, Oct 16, 2014 at 9:57 PM, Henry Hung wrote: > HI All, > > > > I try to build spark 1.1.0 using sbt with command: > > sbt/sbt -Dhadoop.version=2.2.0 -Pyarn asse

Re: bug with MapPartitions?

2014-10-17 Thread Sean Owen
You are making an RDD out of a local object, c. This means it has to be serialized to send it from the driver to workers. c is not serializable because it contains a BufferedInputStream. This isn't serializable and that makes sense as it's a stream from a local file. This is what the error is telli

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Marcelo Vanzin
On top of what Andrew said, you shouldn't need to manually add the mllib jar to your jobs; it's already included in the Spark assembly jar. On Thu, Oct 16, 2014 at 11:51 PM, eric wong wrote: > Hi, > > i using the comma separated style for submit multiple jar files in the > follow shell but it doe

Re: Designed behavior when master is unreachable.

2014-10-17 Thread Andrew Ash
I'm not sure what the design is, but I think the current behavior if the driver can't reach the master is to attempt to connect once and fail if that attempt fails. Is that what you're observing? (What version of Spark also?) On Fri, Oct 17, 2014 at 3:51 AM, preeze wrote: > Hi all, > > I am ru

Re: complexity of each action / transformation

2014-10-17 Thread Alec Ten Harmsel
On 10/17/2014 02:08 PM, ll wrote: > hello... is there a list that shows the complexity of each > action/transformation? for example, what is the complexity of > RDD.map()/filter() or RowMatrix.multiply() etc? that would be really > helpful. > > thanks! I'm pretty new to Spark, so I only know ab

Unable to connect to Spark thrift JDBC server with pluggable authentication

2014-10-17 Thread Jenny Zhao
Hi, if Spark thrift JDBC server is started with non-secure mode, it is working fine. with a secured mode in case of pluggable authentication, I placed the authentication class configuration in conf/hive-site.xml hive.server2.authentication CUSTOM hive.server2.custom.authentication.cla

Re: bug with MapPartitions?

2014-10-17 Thread davidkl
Hi Sowen, the constructor just reads from the stream and stores only read data, but it's not keeping a reference to the stream itself, so I think this should work as it is. Akshat, will try your suggestion in any case, thanks! -- View this message in context: http://apache-spark-user-list.10

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Reza Zadeh
Hi Jaonary, What are the numbers, i.e. number of points you're trying to do all-pairs on, and the dimension of each? Have you tried the new implementation of columnSimilarities in RowMatrix? Setting the threshold high enough (potentially above 1.0) might solve your problem, here is an example

Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-17 Thread Gerard Maas
Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces. We've been following the pattern: dstream.foreachRDD(rdd => val records = rdd.map(elem => record(elem)) targets.foreach(tar

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi Reza, Thank you for the suggestion. The number of point are not that large about 1000 for each set. So I have 1000x1000 pairs. But, my similarity is obtained using a metric learning to rank and from spark it is viewed as a black box. So my idea was just to distribute the computation of the 1000

Re: MLlib and pyspark features

2014-10-17 Thread Xiangrui Meng
Davies is porting features from mllib.feature.* to pyspark (https://github.com/apache/spark/pull/2819). I'm not aware of anyone who is working on porting mllib.evaluation.*. So feel free to create a JIRA and someone may be interested in working on it. -Xiangrui On Fri, Oct 17, 2014 at 12:58 AM, po

Re: MLlib linking error Mac OS X

2014-10-17 Thread Xiangrui Meng
Could you post the error message? -Xiangrui On Fri, Oct 17, 2014 at 2:00 AM, poiuytrez wrote: > Hello MLnick, > > Have you found a solution on how to install MLlib for Mac OS ? I have also > some trouble to install the dependencies. > > Best, > poiuytrez > > > > -- > View this message in context:

How to write a RDD into One Local Existing File?

2014-10-17 Thread Parthus
Hi, I have a spark mapreduce task which requires me to write the final rdd to an existing local file (appending to this file). I tried two ways but neither works well: 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to local, but I never make it work. Moreover, the result

Re: Spark on YARN driver memory allocation bug?

2014-10-17 Thread Boduo Li
It may also cause a problem when running in the yarn-client mode. If --driver-memory is large, Yarn has to allocate a lot of memory to the AM container, but AM doesn't really need the memory. Boduo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YA

mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread ll
hello... i'm looking at the source code for mllib.linalg.Vectors and it looks like it's a wrapper around Breeze with very small changes (mostly changing the names). i don't have any problem with using spark wrapper around Breeze or Breeze directly. i'm just curious to understand why this wrapper

Re: mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread Nicholas Chammas
I don't know the answer for sure, but just from an API perspective I'd guess that the Spark authors don't want to tie their API to Breeze. If at a future point they swap out a different implementation for Breeze, they don't have to change their public interface. MLlib's interface remains consistent

PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
https://gist.github.com/rjurney/fd5c0110fe7eb686afc9 Any way I try to join my data fails. I can't figure out what I'm doing wrong. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com ᐧ

Re: mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread Sean Owen
Yes, I think that's the logic, but then what do toBreezeVector return if it is not based on Breeze? and this is called a lot by client code since you often have to do something nontrivial to the vector. I suppose you can still have that thing return a Breeze vector and use it for no other purpose.

input split size

2014-10-17 Thread Larry Liu
What is the default input split size? How to change it?

How to disable input split

2014-10-17 Thread Larry Liu
Is it possible to disable input split if input is already small?

Re: How to write a RDD into One Local Existing File?

2014-10-17 Thread Sean Owen
You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably not a good idea as you are bottlenecking the output and some upstream computation by disabling parallelism. How about just combining the files on

Re: input split size

2014-10-17 Thread Andrew Ash
When reading out of HDFS it's the HDFS block size. On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu wrote: > What is the default input split size? How to change it? >

Re: PySpark joins fail - please help

2014-10-17 Thread Davies Liu
Hey Russell, join() can only work with RDD of pairs (key, value), such as rdd1: (k, v1) rdd2: (k, v2) rdd1.join(rdd2) will be (k1, v1, v2) Spark SQL will be more useful for you, see http://spark.apache.org/docs/1.1.0/sql-programming-guide.html Davies On Fri, Oct 17, 2014 at 5:01 PM, Russel

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
Is that not exactly what I've done in j3/j4? The keys are identical strings.The k is the same, the value in both instances is an associative array. devices = devices.map(lambda x: (dh.index('id'), {'deviceid': x[dh.index('id')], 'foo': x[dh.index('foo')], 'bar': x[dh.index('bar')]})) bytes_in_out

Re: input split size

2014-10-17 Thread Larry Liu
Thanks, Andrew. What about reading out of local? On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash wrote: > When reading out of HDFS it's the HDFS block size. > > On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu wrote: > >> What is the default input split size? How to change it? >> > >

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
There was a bug in the devices line: dh.index('id') should have been x[dh.index('id')]. ᐧ On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney wrote: > Is that not exactly what I've done in j3/j4? The keys are identical > strings.The k is the same, the value in both instances is an associative > arra

Re: How to disable input split

2014-10-17 Thread Davies Liu
You can call coalesce() to merge the small splits into bigger ones. Davies On Fri, Oct 17, 2014 at 5:35 PM, Larry Liu wrote: > Is it possible to disable input split if input is already small? - To unsubscribe, e-mail: user-unsu