On Sunday, May 17, 2015 at 6:18:51 PM UTC+2, pegah Aliz wrote: > Hello Everybody, > > This question seems simple, but I can't find the solution: > > I use scipy.cluster.hierarchy to do a hierarchical clustering on a set of > points using "cosine" similarity metric. As an example, I have: > > > import scipy.cluster.hierarchy as hac > import matplotlib.pyplot as plt > > Points = > np.array([[ 0. , 0.23508573], > [ 0.00754775 , 0.26717266], > [ 0.00595464 , 0.27775905], > [ 0.01220563 , 0.23622067], > [ 0.00542628 , 0.14185873], > [ 0.03078922 , 0.11273108], > [ 0.06707743 ,-0.1061131 ], > [ 0.04411757 ,-0.10775407], > [ 0.01349434 , 0.00112159], > [ 0.04066034 , 0.11639591], > [ 0. , 0.29046682], > [ 0.07338036 , 0.00609912], > [ 0.01864988 , 0.0316196 ], > [ 0. , 0.07270636], > [ 0. , 0. ]]) > > > z = hac.linkage(Points, metric='cosine', method='complete') > labels = hac.fcluster(z, 0.1, criterion="distance") > > > plt.scatter(Points[:, 0], Points[:, 1], c=labels.astype(np.float)) > plt.show() > > > Since I use cosine metric, in some cases the dot product of two vectors can > be negative or norm of some vectors can be zero. It means z output will have > some negative or infinite elements which is not valid for fcluster (as > below): > > z = > [[ 0.00000000e+00 1.00000000e+01 0.00000000e+00 2.00000000e+00] > [ 1.30000000e+01 1.50000000e+01 0.00000000e+00 3.00000000e+00] > [ 8.00000000e+00 1.10000000e+01 4.26658708e-13 2.00000000e+00] > [ 1.00000000e+00 2.00000000e+00 2.31748880e-05 2.00000000e+00] > [ 3.00000000e+00 4.00000000e+00 8.96700489e-05 2.00000000e+00] > [ 1.60000000e+01 1.80000000e+01 3.98805492e-04 5.00000000e+00] > [ 1.90000000e+01 2.00000000e+01 1.33225099e-03 7.00000000e+00] > [ 5.00000000e+00 9.00000000e+00 2.41120340e-03 2.00000000e+00] > [ 6.00000000e+00 7.00000000e+00 1.52914684e-02 2.00000000e+00] > [ 1.20000000e+01 2.20000000e+01 3.52441432e-02 3.00000000e+00] > [ 2.10000000e+01 2.40000000e+01 1.38662986e-01 1.00000000e+01] > [ 1.70000000e+01 2.30000000e+01 6.99056531e-01 4.00000000e+00] > [ 2.50000000e+01 2.60000000e+01 1.92543748e+00 1.40000000e+01] > [ -1.00000000e+00 2.70000000e+01 inf 1.50000000e+01]] > > To solve this problem, I checked linkage() function and inside it I needed to > check _hierarchy.linkage() method. I use pycharm text editor and when I asked > for "linkage" source code, it opened up a python file namely "_hierarchy.py" > inside the directory like the following: > > .PyCharm40/system/python_stubs/-1247972723/scipy/cluster/_hierarchy.py > > This python file doesn't have any definition for all included functions. > I am wondering what is the correct source of this function to revise it and > solve my problem. > I would be appreciated if someone helps me to explore the correct source. > > Thanks and Regards > Pegah
1 - The platform is Linux 2 - After downloading .tar file, making file and configuring, I use pycharm.sh 3 - these are contents of _hierarchy.py : # encoding: utf-8 # module scipy.cluster._hierarchy # from /users/alizadeh/.local/lib/python2.7/site-packages/scipy/cluster/_hierarchy.so # by generator 1.136 # no doc # imports import __builtin__ as __builtins__ # <module '__builtin__' (built-in)> import numpy as np # /usr/lib/pymodules/python2.7/numpy/__init__.pyc # functions def calculate_cluster_sizes(*args, **kwargs): # real signature unknown """ Calculate the size of each cluster. The result is the fourth column of the linkage matrix. Parameters ---------- Z : ndarray The linkage matrix. The fourth column can be empty. cs : ndarray The array to store the sizes. n : ndarray The number of observations. """ pass def cluster_dist(*args, **kwargs): # real signature unknown """ Form flat clusters by distance criterion. Parameters ---------- Z : ndarray The linkage matrix. T : ndarray The array to store the cluster numbers. The i'th observation belongs to cluster `T[i]`. cutoff : double Clusters are formed when distances are less than or equal to `cutoff`. n : int The number of observations. """ pass def cluster_in(*args, **kwargs): # real signature unknown """ Form flat clusters by inconsistent criterion. Parameters ---------- Z : ndarray The linkage matrix. R : ndarray The inconsistent matrix. T : ndarray The array to store the cluster numbers. The i'th observation belongs to cluster `T[i]`. cutoff : double Clusters are formed when the inconsistent values are less than or or equal to `cutoff`. n : int The number of observations. """ pass def cluster_maxclust_dist(*args, **kwargs): # real signature unknown """ Form flat clusters by maxclust criterion. Parameters ---------- Z : ndarray The linkage matrix. T : ndarray The array to store the cluster numbers. The i'th observation belongs to cluster `T[i]`. n : int The number of observations. mc : int The maximum number of clusters. """ pass def cluster_maxclust_monocrit(*args, **kwargs): # real signature unknown """ Form flat clusters by maxclust_monocrit criterion. Parameters ---------- Z : ndarray The linkage matrix. MC : ndarray The monotonic criterion array. T : ndarray The array to store the cluster numbers. The i'th observation belongs to cluster `T[i]`. n : int The number of observations. max_nc : int The maximum number of clusters. """ pass def cluster_monocrit(*args, **kwargs): # real signature unknown """ Form flat clusters by monocrit criterion. Parameters ---------- Z : ndarray The linkage matrix. MC : ndarray The monotonic criterion array. T : ndarray The array to store the cluster numbers. The i'th observation belongs to cluster `T[i]`. cutoff : double Clusters are formed when the MC values are less than or equal to `cutoff`. n : int The number of observations. """ pass def cophenetic_distances(*args, **kwargs): # real signature unknown """ Calculate the cophenetic distances between each observation Parameters ---------- Z : ndarray The linkage matrix. d : ndarray The condensed matrix to store the cophenetic distances. n : int The number of observations. """ pass def get_max_dist_for_each_cluster(*args, **kwargs): # real signature unknown """ Get the maximum inconsistency coefficient for each non-singleton cluster. Parameters ---------- Z : ndarray The linkage matrix. MD : ndarray The array to store the result. n : int The number of observations. """ pass 4 - because in hierarchy.py I have a line like this: _hierarchy.linkage(dm, Z, n, int(_cpy_non_euclid_methods[method])) which Z value is different before and after it. -- https://mail.python.org/mailman/listinfo/python-list