[cross-posted from hdfs-dev@hadoop, common-dev@hadoop]
We'd like to invite you to the
Consensus based replication in Hadoop: A deep dive
event that we are happy to hold in our San Ramon office on July 15th at noon.
We'd like to accommodate as many people as possible, but I think are physicall
Is there a way to run the spark driver program without starting the
monitoring web UI in-process? I didn't see any config setting around it.
Hi all,
I was testing an addition to Catalyst today (reimplementing a Hive UDF) and ran
into some odd failures in the test suite. In particular, it seems that what
most of these have in common is that an array is spuriously reversed somewhere.
For example, the stddev tests in the HiveCompatib
Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.
On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks
wrote:
> If you're thinking along these lines, have a look at the DecisionTree
> implementation
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass o
No was thinking more top-down:
assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.
I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a dis
K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10
The scikit-learn implementation may be of interest:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
It's a bottom up approach. The pair of clusters for merging are
chosen to minimize variance.
Their code is under a BSD license so it can be used as
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee wrote:
> No idea, never looked it up. Always just implemented it as doing k-means
> again on each cluster.
>
> FWIW standard k-means with euclide
No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.
FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.
Other more useful clust
Hector, could you share the references for hierarchical K-means? thanks.
On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee wrote:
> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul
Having a common framework for clustering makes sense to me. While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.
On Tue,
Thanks, Hector! Your feedback is useful.
On Tuesday, July 8, 2014, Hector Yee wrote:
> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling > wrot
Hi Anish,
Spark, like MapReduce, makes an effort to schedule tasks on the same nodes
and racks that the input blocks reside on.
-Sandy
On Tue, Jul 8, 2014 at 12:27 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:
> Hi All
>
> My apologies for very basic question, do we have full suppo
I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.
On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling wrote:
> Hi all,
>
> MLlib currently has one clustering algorithm implementation, KMeans.
> It would
This blog post probably clarifies a lot of things:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
On Tue, Jul 8, 2014 at 12:24 PM, anishs...@yahoo.co.in <
anishs...@yahoo.co.in> wrote:
> Hi All
>
> I read somewhere that Cloudera announce
Hi All
My apologies for very basic question, do we have full support of data locality
in Spark MapReduce.
Please suggest.
--
Anish Sneh
"Experience is the best teacher."
http://in.linkedin.com/in/anishsneh
Hi All
I read somewhere that Cloudera announced Hive on Spark, since AmpLab already
have Shark. I was trying to understand is it rebranding of Shark or they are
planning something new altogether.
Please suggest.
--
Anish Sneh
"Experience is the best teacher."
http://in.linkedin.com/in/anishsn
Hi all,
MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.
I recently submitted a PR [1] for a MiniBatch KMeans
Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory",
have the same ser/de overhead. Shark loads data from outsize of the process
in Tachyon mode with the following benefits:
- In-memory data sharing across multiple Shark instances (i.e. stronger
isolation)
- Instant
Shark's in-memory format is already serialized (it's compressed and
column-based).
On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan
wrote:
> You are ignoring serde costs :-)
>
> - Mridul
>
> On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote:
> > Tachyon should only be marginally less per
You are ignoring serde costs :-)
- Mridul
On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote:
> Tachyon should only be marginally less performant than memory_only, because
> we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> the data over a pipe from Tachyon; we can di
As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
removed
Tachyon should only be marginally less performant than memory_only, because
we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
the data over a pipe from Tachyon; we can directly read from the buffers in
the same way that Shark reads from its in-memory columnar format.
On T
This is the correct page: http://spark.apache.org/community.html
Cheers
On Jul 8, 2014, at 4:43 AM, Ted Yu wrote:
> See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html
>
> Cheers
>
> On Jul 8, 2014, at 4:17 AM, Leon Zhang wrote:
>
>>
See http://spark.apache.org/news/spark-mailing-lists-moving-to-apache.html
Cheers
On Jul 8, 2014, at 4:17 AM, Leon Zhang wrote:
>
hi, when i create a table, i can point the cache strategy using shark.cache,
i think "shark.cache=memory_only" means data are managed by spark, and
data are in the same jvm with excutor; while "shark.cache=tachyon"
means data are managed by tachyon which is off heap, and data are not in
the s
On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
zhengbing...@huawei.com> wrote:
>
> 1) I download the imdb data from
> http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> LBFGS
> 2) I find the imdb data are zero-based-index data
>
Since the method is for parsing t
29 matches
Mail list logo