date:20140708

Meetup invitation: Consensus based replication in Hadoop

2014-07-08 Thread Konstantin Boudnik

[cross-posted from hdfs-dev@hadoop, common-dev@hadoop] We'd like to invite you to the Consensus based replication in Hadoop: A deep dive event that we are happy to hold in our San Ramon office on July 15th at noon. We'd like to accommodate as many people as possible, but I think are physicall

Not starting the Web ui in the driver

2014-07-08 Thread Usman Ghani

Is there a way to run the spark driver program without starting the monitoring web UI in-process? I didn't see any config setting around it.

odd test suite failures while adding functions to Catalyst

2014-07-08 Thread Will Benton

Hi all, I was testing an addition to Catalyst today (reimplementing a Hive UDF) and ran into some odd failures in the test suite. In particular, it seems that what most of these have in common is that an array is spuriously reversed somewhere. For example, the stddev tests in the HiveCompatib

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks wrote: > If you're thinking along these lines, have a look at the DecisionTree > implementation

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks

If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass o

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a dis

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics, 2^10

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov

sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee wrote: > No idea, never looked it up. Always just implemented it as doing k-means > again on each cluster. > > FWIW standard k-means with euclide

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clust

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov

Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee wrote: > I would say for bigdata applications the most useful would be hierarchical > k-means with back tracking and the ability to support k nearest centroids. > > > On Tue, Jul

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Sandy Ryza

Having a common framework for clustering makes sense to me. While we should be careful about what algorithms we include, having solid implementations of minibatch clustering and hierarchical clustering seems like a worthwhile goal, and we should reuse as much code and APIs as reasonable. On Tue,

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee wrote: > I would say for bigdata applications the most useful would be hierarchical > k-means with back tracking and the ability to support k nearest centroids. > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling > wrot

Re: Data Locality In Spark

2014-07-08 Thread Sandy Ryza

Hi Anish, Spark, like MapReduce, makes an effort to schedule tasks on the same nodes and racks that the input blocks reside on. -Sandy On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in < anishs...@yahoo.co.in> wrote: > Hi All > > My apologies for very basic question, do we have full suppo

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Hector Yee

I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling wrote: > Hi all, > > MLlib currently has one clustering algorithm implementation, KMeans. > It would

Re: Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread Reynold Xin

This blog post probably clarifies a lot of things: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html On Tue, Jul 8, 2014 at 12:24 PM, anishs...@yahoo.co.in < anishs...@yahoo.co.in> wrote: > Hi All > > I read somewhere that Cloudera announce

Data Locality In Spark

2014-07-08 Thread anishs...@yahoo.co.in

Hi All My apologies for very basic question, do we have full support of data locality in Spark MapReduce. Please suggest. -- Anish Sneh "Experience is the best teacher." http://in.linkedin.com/in/anishsneh

Cloudera's Hive on Spark vs AmpLab's Shark

2014-07-08 Thread anishs...@yahoo.co.in

Hi All I read somewhere that Cloudera announced Hive on Spark, since AmpLab already have Shark. I was trying to understand is it rebranding of Shark or they are planning something new altogether. Please suggest. -- Anish Sneh "Experience is the best teacher." http://in.linkedin.com/in/anishsn

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling

Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Haoyuan Li

Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory", have the same ser/de overhead. Shark loads data from outsize of the process in Tachyon mode with the following benefits: - In-memory data sharing across multiple Shark instances (i.e. stronger isolation) - Instant

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson

Shark's in-memory format is already serialized (it's compressed and column-based). On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan wrote: > You are ignoring serde costs :-) > > - Mridul > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote: > > Tachyon should only be marginally less per

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Mridul Muralidharan

You are ignoring serde costs :-) - Mridul On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote: > Tachyon should only be marginally less performant than memory_only, because > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer > the data over a pipe from Tachyon; we can di

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks

As Sean mentions, if you can change the data to the standard format, that's probably a good idea. If you'd rather read the data raw, then writing your own version of loadLibSVMFile - then you could make your own loader function which is very similar to the existing one with a few characters removed

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson

Tachyon should only be marginally less performant than memory_only, because we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer the data over a pipe from Tachyon; we can directly read from the buffers in the same way that Shark reads from its in-memory columnar format. On T

Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu

This is the correct page: http://spark.apache.org/community.html Cheers On Jul 8, 2014, at 4:43 AM, Ted Yu wrote: > See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html > > Cheers > > On Jul 8, 2014, at 4:17 AM, Leon Zhang wrote: > >>

Re: (send this email to subscribe)

2014-07-08 Thread Ted Yu

See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html Cheers On Jul 8, 2014, at 4:17 AM, Leon Zhang wrote: >

(send this email to subscribe)

2014-07-08 Thread Leon Zhang

on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread qingyang li

hi, when i create a table, i can point the cache strategy using shark.cache, i think "shark.cache=memory_only" means data are managed by spark, and data are in the same jvm with excutor; while "shark.cache=tachyon" means data are managed by tachyon which is off heap, and data are not in the s

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Sean Owen

On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) < zhengbing...@huawei.com> wrote: > > 1) I download the imdb data from > http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test > LBFGS > 2) I find the imdb data are zero-based-index data > Since the method is for parsing t

Meetup invitation: Consensus based replication in Hadoop

Not starting the Web ui in the driver

odd test suite failures while adding functions to Catalyst

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Data Locality In Spark

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Cloudera's Hive on Spark vs AmpLab's Shark

Data Locality In Spark

Cloudera's Hive on Spark vs AmpLab's Shark

Contributing to MLlib: Proposal for Clustering Algorithms

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

Re: (send this email to subscribe)

Re: (send this email to subscribe)

(send this email to subscribe)

on shark, is tachyon less efficient than memory_only cache strategy ?

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

29 matches

Site Navigation

Mail list logo

Footer information