Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Hector Yee
I'm getting a lot of task lost with this build in a large mesos cluster. Happens with both hash and sort shuffles. 14/11/20 18:08:38 WARN TaskSetManager: Lost task 9.1 in stage 1.0 (TID 897, i-d4d6553a.inst.aws.airbnb.com): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=9, message= org.apache.s

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
I'm still seeing the fetch failed error and updated https://issues.apache.org/jira/browse/SPARK-3633 On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin wrote: > +1 (non-binding) > > . ran simple things on spark-shell > . ran jobs in yarn client & cluster modes, and standalone cluster mode > > On W

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
I think it is a race condition caused by netty deactivating a channel while it is active. Switched to nio and it works fine --conf spark.shuffle.blockTransferService=nio On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee wrote: > I'm still seeing the fetch failed error and updated

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
This is whatever was in http://people.apache.org/~andrewor14/spark-1 .1.1-rc2/ On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia wrote: > Hector, is this a comment on 1.1.1 or on the 1.2 preview? > > Matei > > > On Nov 20, 2014, at 11:39 AM, Hector Yee wrote: > >

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Hector Yee
ice property doesn't > exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem? > > Matei > > On Nov 20, 2014, at 11:50 AM, Hector Yee wrote: > > This is whatever was in http://people.apache.org/~andrewor14/spark-1 > .1.1-rc2/ > > On Thu, Nov 20, 20

Re: over 10000 commits!

2015-03-06 Thread Hector Yee
Congrats! On Thu, Mar 5, 2015 at 1:34 PM, shane knapp wrote: > WOOT! > > On Thu, Mar 5, 2015 at 1:26 PM, Reynold Xin wrote: > > > We reached a new milestone today. > > > > https://github.com/apache/spark > > > > > > 10,001 commits now. Congratulations to Xiangrui for making the 1th > > comm

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
I use Thrift and then base64 encode the binary and save it as text file lines that are snappy or gzip encoded. It makes it very easy to copy small chunks locally and play with subsets of the data and not have dependencies on HDFS / hadoop for server stuff for example. On Thu, Mar 26, 2015 at 2:5

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
files instead of file lines? > > > > *From:* Hector Yee [mailto:hector@gmail.com] > *Sent:* Wednesday, April 01, 2015 11:36 AM > *To:* Ulanov, Alexander > *Cc:* Evan R. Sparks; Stephen Boesch; dev@spark.apache.org > > *Subject:* Re: Storing large data for MLlib mach

Re: Spark/Mesos

2015-05-05 Thread Hector Yee
Speaking as a user of spark on mesos Yes it appears that each app appears as a separate framework on the mesos master In fine grained mode the number of executors goes up and down vs fixed in coarse. I would not run fine grained mode on a large cluster as it can potentially spin up a lot of execu

Jcenter / bintray support for spark packages?

2015-06-10 Thread Hector Yee
Hi Spark devs, Is it possible to add jcenter or bintray support for Spark packages? I'm trying to add our artifact which is on jcenter https://bintray.com/airbnb/aerosolve but I noticed in Spark packages it only accepts Maven coordinates. -- Yee Yang Li Hector google.com/+HectorYee -

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling wrote: > Hi all, > > MLlib currently has one clustering algorithm implementation, KMeans. > It would

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee wrote: > > > I would say for bigdata applications the most useful would be > hierarchical > > k-means with back tracking and the ability to support k nearest > centroids. > > > > > > On Tue, Jul 8, 20

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
. more interesting problem here is choosing k at each level. Kernel > methods seem to be most promising. > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee wrote: > > > No idea, never looked it up. Always just implemented it as doing k-means > > again on each cluster. > >

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
t; Is something like that you were thinking Hector? > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov > wrote: > > sure. more interesting problem here is choosing k at each level. Kernel > > methods seem to be most promising. > > > > > > On Tue, Jul 8, 201

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee
ower > communication overheads than, say, shuffling data around that belongs to > one cluster or another. Something like that could work here as well. > > I'm not super-familiar with hierarchical K-Means so perhaps there's a more > efficient way to implement it, though. > >