Hi all, 2014-04-03 9:04 GMT+02:00 Hai Qian <hq...@gopivotal.com>:
> Hi Maxence, > > We are very glad that you are enthusiastic about MADlib.Your proposal looks > very nice. The algorithms that you proposed will be very useful, and can > serve as a good start. After learning about the C++ abstraction layer and > testing suite, you should be able to quickly grasp how to program for > MADlib. We can fill in the details of the plan later. The conflict in your > time schedule shouldn't be a problem. > > It would be nice if you could provide a little bit more introduction to > these algorithms and their importance in the proposal. > > Overall it is very good. And thank you again > > Hai > > As you requested, I've described the clustering algorithms on my github [0]. I'm copying it at the end of this email, just in case. In the case of OPTICS, I still need some time to grasp the details of how it works, so I've only included a basic description of this one. Regards, Maxence [0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst Details about the different clustering algorithms ================================================= This file describes the k-means algorithm, which is currently the only clustering algorithm implemented in MADlib, and the k-medoids and OPTICS algorithm, which I plan to implement this summer, as a GSoC project. I'll also explain the DBSCAN algorithm, as OPTICS is only an extension of this one. The goal of all these algorithms is to determine clusters in a set of points. While this problem is NP-hard, it can be solved efficiently thanks to these algorithms, even though the result is usually not perfect. I've tried not to add too many details, and taken some shortcuts, so there may be some inaccuracies. I wanted to keep this readable (not sure that goal was achieved, though), but if anyone wants more precisions, I'll gladly add them. k-means ------- The k-means algorithm is the one implemented in MADlib. It selects *k* points, either randomly, among the dataset, or set by the user, to use as initial centroids (center of clusters). *k* must be defined by the user; the algorithm is unable to guess the number of clusters. All the points in the dataset are then affected to the nearest centroid, thus making *k* clusters. The next step is to compute the new centroids as a "mean" of all the points in a cluster. Then reassign all the points to then new centroids, and start all over again until there is no change in cluster assignation. This algorithm is usually pretty fast (even though it can be made to take an exponential time to converge). Still, it is easy to get a wrong result because of local convergence, e.g. having one cluster split in two parts by the algorithm, or two clusters merged. It is also pretty sensitive to outliers (points that don't obviously belong to any cluster), and the final result depend greatly on the initial centroids. This algorithm doesn't work well with non-euclidean distance, in general. Another think to note is that k-means will result in Voronoi cell shaped clusters, which is not always what we want. k-medoids --------- The k-medoids algorithm works the same way as k-means, save for a few exceptions. With k-medoids, the centroids are always points of the dataset (and are then called medoids). The new medoids are the points in clusters which minimize the sum of pairwise dissimilarities (in other terms, the point for which the average distance to other points in the cluster is minimal). This makes the algorithm less sensitive to outliers than k-means. This algorithm is computationnally more intensive than k-means, but still fast. As for k-means, it is possible to run the algorithm several times with different initial centroids, and get the best result (i.e. the one that minimizes the sum of distances from points to their centroids). DBSCAN ------- DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*) is a clustering algorithm based on the density of clusters. This adresses several limitations of k-means and k-medoids: it does not assign a cluster to outliers, and allows the detection of weird-shaped clusters. Moreover, it doesn't need to be told the number of clusters. A point is called dense if enough other points are near enough from it, where "enough other points" and "near enough" are defined by the parameters *min_points* and *epsilon*. *min_points* therefore represents the minimum number of points required to make a cluster, and *epsilon* is the maximum distance a point can be from a cluster to be considered part of this cluster. For every unassigned point, count his *epsilon*-neighbours (not sure this term exists, but I'll use it for convenience). If there are too few, consider this point an outlier; else, create a new cluster and put the current point in it, along with its neighbours, the neighbours of its neighbours, and so on. The main problem with DBSCAN is that it doesn't work well for clusters with very different densities. OPTICS ------ OPTICS (*Ordering Points to Identify the Clustering Structure*) addresses this issue. The first step in OPTICS is to order the points so that two points that are near each other will be in nearby positions, and store the smallest reachability distance of each point (i.e. the distance at which the point would be considered dense by DBSCAN). The next step is to determine the clusters from the data obtained at the previous step. This can be done visually: by plotting the ordered points on *x* and their smallest reachability distance on *y*, the clusters are the valleys formed by the curve. I'll add more details to this section when I understand better how all of this works. Bibliography ------------ http://en.wikipedia.org : where clustering algorithms are well detailed http://www.stat.cmu.edu/~ryantibs/datamining/lectures/04-clus1-marked.pdf : introduction to k-means and k-medoids http://www.vitavonni.de/blog/201211/2012110201-dbscan-and-optics-clustering.html : short introduction to DBSCAN and OPTICS http://www.cs.uiuc.edu/class/fa05/cs591han/papers/ankerst99.pdf : original paper presenting OPTICS -- Maxence Ahlouche 06 06 66 97 00