Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
Please use the following snippet. I am still working on to make it a generic vector, so that input should not Vector[String] always. But String will work fine for now. def main(args:Array[String]) { val sc = new SparkContext("local", "OutlierDetection") val dir = "hdfs://localhost:54310/tra

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
Hi, I have a doubt regarding the input to your algorithm. __ val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], percent : Double, sc :SparkContext) Here our input data is an RDD[Vector[String]]. How we can create this RDD

回复:Re: Problems with spark.locality.wait

2014-11-13 Thread MaChong
Hi Mridul, I have tried your method, it works fine for this case. I set locality.process to 30 minutes, locality.node and locality.rack to 3 seconds. I got loading stage's RACK level tasks submitted after 3 seconds wait, and only PROCESS tasks after loading stage. Hi Kay, Our case is exactly

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Shixiong Zhu
OK. I'll take it. Best Regards, Shixiong Zhu 2014-11-14 12:34 GMT+08:00 Reynold Xin : > That seems like a great idea. Can you submit a pull request? > > > On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu wrote: > >> If we put the `implicit` into "pacakge object rdd" or "object rdd", when >> we wri

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
That seems like a great idea. Can you submit a pull request? On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu wrote: > If we put the `implicit` into "pacakge object rdd" or "object rdd", when > we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala > compiler will search `object rdd`(

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew
Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* __ O

Re: Re: Problems with spark.locality.wait

2014-11-13 Thread MaChong
In the specific example stated, the user had two taskset if I understood right ... the first taskset reads off db (dfs in your example), and does some filter, etc and caches it. Second which works off the cached data (which is, now, process local locality level aware) to do map, group, etc. The ta

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master and branch-1.2 is 10,000 rows per batch. On 11/14/14 1:27 AM, Sadhan Sood wrote: Thanks Chneg, Just one

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Shixiong Zhu
If we put the `implicit` into "pacakge object rdd" or "object rdd", when we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala compiler will search `object rdd`(companion object) and `package object rdd`(pacakge object) by default. We don't need to import them explicitly. Here is a po

Re: Problems with spark.locality.wait

2014-11-13 Thread Nathan Kronenfeld
This sounds like it may be exactly the problem we've been having (and about which I recently posted on the user list). Is there any way of monitoring it's attempts to wait, giving up, and trying another level? In general, I'm trying to figure out why we can have repeated identical jobs, the firs

Re: TimSort in 1.2

2014-11-13 Thread Reza Zadeh
See https://issues.apache.org/jira/browse/SPARK-2045 and https://issues.apache.org/jira/browse/SPARK-3280 On Thu, Nov 13, 2014 at 4:19 PM, Debasish Das wrote: > Hi, > > I am noticing the first step for Spark jobs does a TimSort in 1.2 > branch...and there is some time spent doing the TimSort...I

RE: Spark- How can I run MapReduce only on one partition in an RDD?

2014-11-13 Thread Ganelin, Ilya
For testing purposes you can take a sample of your data with take() and then transform that smaller dataset into an rdd. -Original Message- From: Tim Chou [timchou@gmail.com] Sent: Thursday, November 13, 2014 06:41 PM Eastern Standard Time To: Ganelin, Il

TimSort in 1.2

2014-11-13 Thread Debasish Das
Hi, I am noticing the first step for Spark jobs does a TimSort in 1.2 branch...and there is some time spent doing the TimSort...Is this assigning the RDD blocks to different nodes based on a sort order ? Could someone please point to a JIRA about this change so that I can read more about it ? Th

Re: Problems with spark.locality.wait

2014-11-13 Thread Mridul Muralidharan
In the specific example stated, the user had two taskset if I understood right ... the first taskset reads off db (dfs in your example), and does some filter, etc and caches it. Second which works off the cached data (which is, now, process local locality level aware) to do map, group, etc. The ta

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi Mridul, In the case Shivaram and I saw, and based on my understanding of Ma chong's description, I don't think that completely fixes the problem. To be very concrete, suppose your job has two tasks, t1 and t2, and they each have input data (in HDFS) on h1 and h2, respectively, and that h1 and

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
Ah right. This is because I'm running Java 8. This was fixed in SPARK-3329 (https://github.com/apache/spark/commit/2b7ab814f9bde65ebc57ebd04386e56c97f06f4a#diff-7bfd8d7c8cbb02aa0023e4c3497ee832). Consider back-porting it if other reasons arise, but this is specific to tests and to Java 8. On Thu,

Re: Problems with spark.locality.wait

2014-11-13 Thread Mridul Muralidharan
Instead of setting spark.locality.wait, try setting individual locality waits specifically. Namely, spark.locality.wait.PROCESS_LOCAL to high value (so that process local tasks are always scheduled in case the task set has process local tasks). Set spark.locality.wait.NODE_LOCAL and spark.locality

Re: Join operator in PySpark

2014-11-13 Thread Josh Rosen
We should implement this using cogroup(); it will just require some tracking to map Python partitioners into dummy Java ones so that Java Spark’s cogroup() operator respects Python’s partitioning.  I’m sure that there are some other subtleties, particularly if we mix datasets that use different

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Andrew Or
Yeah, this seems to be somewhat environment specific too. The same test has been passing here for a while: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.1-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/lastBuild/consoleFull 2014-11-13 11:26 GMT-08:00 Michael Armbrust : > Hey Sean, > > Than

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
Do people usually important o.a.spark.rdd._ ? Also in order to maintain source and binary compatibility, we would need to keep both right? On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu wrote: > I saw many people asked how to convert a RDD to a PairRDDFunctions. I would > like to ask a question

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Michael Armbrust
Hey Sean, Thanks for pointing this out. Looks like a bad test where we should be doing Set comparison instead of Array. Michael On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen wrote: > LICENSE and NOTICE are fine. Signature and checksum is fine. I > unzipped and built the plain source distribution

Re: Problems with spark.locality.wait

2014-11-13 Thread Kay Ousterhout
Hi, Shivaram and I stumbled across this problem a few weeks ago, and AFAIK there is no nice solution. We worked around it by avoiding jobs with tasks that have tasks with two locality levels. To fix this problem, we really need to fix the underlying problem in the scheduling code, which currentl

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
On Thu, Nov 13, 2014 at 10:58 AM, Patrick Wendell wrote: >> That's true, but note the code I posted activates a profile based on >> the lack of a property being set, which is why it works. Granted, I >> did not test that if you activate the other profile, the one with the >> property check will be

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
> That's true, but note the code I posted activates a profile based on > the lack of a property being set, which is why it works. Granted, I > did not test that if you activate the other profile, the one with the > property check will be disabled. Ah yeah good call - I so then we'd trigger 2.11-vs

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
Hey Patrick, On Thu, Nov 13, 2014 at 10:49 AM, Patrick Wendell wrote: > I'm not sure chaining activation works like that. At least in my > experience activation based on properties only works for properties > explicitly specified at the command line rather than declared > elsewhere in the pom. T

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
Hey Marcelo, I'm not sure chaining activation works like that. At least in my experience activation based on properties only works for properties explicitly specified at the command line rather than declared elsewhere in the pom. https://gist.github.com/pwendell/6834223e68f254e6945e I any case,

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Ashutosh
Hi Anant, Please see the changes. https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala I have changed the input format to Vector of String. I think we can also make it generic. Line 59 & 72 : that counter will not affect in parallelism, Since it

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Sandy Ryza
https://github.com/apache/spark/pull/3239 addresses this On Thu, Nov 13, 2014 at 10:05 AM, Marcelo Vanzin wrote: > Hello there, > > So I just took a quick look at the pom and I see two problems with it. > > - "activatedByDefault" does not work like you think it does. It only > "activates by defa

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Marcelo Vanzin
Hello there, So I just took a quick look at the pom and I see two problems with it. - "activatedByDefault" does not work like you think it does. It only "activates by default" if you do not explicitly activate other profiles. So if you do "mvn package", scala-2.10 will be activated; but if you do

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian wrote: > Currently there’s no way to cache the c

Join operator in PySpark

2014-11-13 Thread 夏俊鸾
Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2) r

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain source distribution, which built. However I am seeing a consistent test failure with "mvn -DskipTests clean package; mvn test". In the Hive module: - SET commands semantics for a HiveContext *** FAILED ***

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Krishna Sankar
+1 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min 2. Tested pyspark, mlib 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK 2.5. rdd operations OK 2.6. recommendation OK 2.7.

Problems with spark.locality.wait

2014-11-13 Thread MaChong
Hi, We are running a time sensitive application with 70 partition and 800MB each parition size. The application first load data from database in different cluster, then apply a filter, cache the filted data, then apply a map and a reduce, finally collect results. The application will be finishe