Re: [R] clustering in R

Joris Meys Fri, 28 May 2010 16:17:09 -0700

Ah OK, I didn't get your question then.

a dist-object is actually a vector of numbers with a couple of attributes.
You can't just cut out values like that. The hclust function needs a perfect
distance matrix to use the calculations.


shortcut is easy : just do f <- f/2*max(f), and all values are below 2.

Otherwise this function could do that for you :

to.dist <- function(x){
x.names <- sort(unique(c(x[[1]],x[[2]])))
n <- length(x.names)
x.dist <- matrix(0,n,n)
dimnames(x.dist) <- list(x.names,x.names)
x.ind <- rbind(cbind(match(x[[1]], x.names), match(x[[2]], x.names)),
                cbind(match(x[[2]], x.names), match(x[[1]], x.names)))
x.dist[x.ind] <- rep(x[[3]], 2)
x.dist <- as.dist(x.dist)
return(x.dist)
}

 d <- to.dist(distB)
 hclust(d)


Cheers
Joris



On Sat, May 29, 2010 at 12:04 AM, Ayesha Khan
<[email protected]>wrote:

> Yes Joris. I did try that and it does produce the results. I am now
> wondering why I wanted a matrix like structure in the first place. However,
> I do want 'f' to contain values less than 2 only. but when i try to get rid
> of values greater than 2 by doing N <- (f[f<2], f strcuture disrupts and
> hclust doesnt want to recognize it anyore again. Because obviously the data
> frame changes again with that. Any ideas on how to do that?
>
>
> On Fri, May 28, 2010 at 4:13 PM, Joris Meys <[email protected]> wrote:
>
>> errr, forget about the output of dput(q), but keep it in mind for next
>> time.
>>
>> f = dist(t(q))
>> hclust(f,method="single")
>>
>> it's as simple as that.
>> Cheers
>> Joris
>>
>>
>> On Fri, May 28, 2010 at 10:39 PM, Ayesha Khan <
>> [email protected]> wrote:
>>
>>> v <- dput(x,"sampledata.txt")
>>> dim(v)
>>> q <- v[1:10,1:10]
>>> f =as.matrix(dist(t(q)))
>>>
>>> distB=NULL
>>> for(k in 1:(nrow(f)-1)) for( m in (k+1):ncol(f)) {
>>> if(f[k,m] <2) distB=rbind(distB,c(k,m,f[k,m]))
>>> }
>>> #now distB looks like this
>>>
>>> > distB
>>>       [,1] [,2]      [,3]
>>>  [1,]    1    2  1.6275568
>>>  [2,]    1    3  0.5252058
>>>  [3,]    1    4  0.7323116
>>>  [4,]    1    5  1 .9966001
>>>  [5,]    1    6  1.6664110
>>>  [6,]    1    7  1.0800540
>>>  [7,]    1    8  1.8698925
>>>  [8,]    1   10  0.5161808
>>>  [9,]    2    3  1.7325811
>>> [10,]    2    5  0.8267843
>>> [11,]    2    6  0.5963280
>>> [12,]    2    7  0.8787230
>>>
>>> #now from this output< i want to cluster all 1's, friedns of 1 and
>>> friends of friends of 1 in one cluster. The same goes for 2,3 and so on
>>> But when i do that using hclust, i get the following error. I think what
>>> I need to do is convert my cureent matrix somehow into a format that would
>>> be accepted by the hclust function but I dont know how to achieve that.
>>>  distclust <- hclust(distB,method="single")
>>>
>>> Error in if (n < 2) stop("must have n >= 2 objects to cluster") :
>>>   argument is of length zero
>>>
>>> P.S: Please let me know if this makes things more clear? "cuz i dont know
>>> how looking at the original data set would help becuase the matrix under
>>> consdieration right now is the distance matrix and how it can be altered. I
>>> have tried as.dist, doesnt work because my matrix as i mentioned eralier is
>>> not a square matrix.
>>>  On Fri, May 28, 2010 at 2:37 PM, Tal Galili <[email protected]>wrote:
>>>
>>>> Hi Ayesha,
>>>> I wish to help you, but without a simple self contained example that
>>>> shows your issue, I will not be able to help.
>>>> Try using the ?dput command to create some simple data, and let us see
>>>> what you are doing.
>>>>
>>>> Best,
>>>> Tal
>>>> ----------------Contact
>>>> Details:-------------------------------------------------------
>>>> Contact me: [email protected] |  972-52-7275845
>>>> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
>>>> | www.r-statistics.com (English)
>>>>
>>>> ----------------------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>>   On Fri, May 28, 2010 at 9:04 PM, Ayesha Khan <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Tal & Joris!
>>>>> I created my distance matrix distA by using the dist() function in R
>>>>> manipulating my output in order to get a matrix.
>>>>> distA =as.matrix(dist(t(x2))) # x2 being my original dataset
>>>>> as according to the documentaion on dist()
>>>>>
>>>>> For the default method, a "dist" object, or a matrix (of distances) or
>>>>> an object which can be coerced to such a matrix using as.matrix()
>>>>>
>>>>>   On Fri, May 28, 2010 at 6:34 AM, Joris Meys <[email protected]>wrote:
>>>>>
>>>>>> As Tal said.
>>>>>>
>>>>>> Next to that, I read that column1 (and column2?) are supposed to be
>>>>>> seen as factors, not as numerical variables. Did you take that into 
>>>>>> account
>>>>>> somehow?
>>>>>>
>>>>>> It's easy to reproduce the error code :
>>>>>> > n <- NULL
>>>>>> > if(n<2)print("This is OK")
>>>>>> Error in if (n < 2) print("This is OK") : argument is of length zero
>>>>>>
>>>>>> In the hclust code, you find following line :
>>>>>> n <- as.integer(attr(d, "Size"))
>>>>>> where d is the distance object entered in the hclust function. Looking
>>>>>> at the error you get, this means that the size attribute of your 
>>>>>> distance is
>>>>>> NULL. Which tells me that distA is not a dist-object.
>>>>>>
>>>>>> > A <- matrix(1:4,ncol=2)
>>>>>> > A
>>>>>>      [,1] [,2]
>>>>>> [1,]    1    3
>>>>>> [2,]    2    4
>>>>>> > hclust(A,method="single")
>>>>>>
>>>>>> Error in if (n < 2) stop("must have n >= 2 objects to cluster") :
>>>>>>   argument is of length zero
>>>>>>
>>>>>> Did you actually put in a distance object? see also ?dist or ?as.dist.
>>>>>>
>>>>>> Cheers
>>>>>> Joris
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Fri, May 28, 2010 at 1:41 AM, Ayesha Khan <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>  i have a matrix with the following dimensions
>>>>>>> 136   3
>>>>>>>
>>>>>>> and it looks something like
>>>>>>>
>>>>>>>         [,1] [,2]     [,3]
>>>>>>>  [1,]  402  675 1.802758
>>>>>>>  [2,]  402  696 1.938902
>>>>>>>  [3,]  402  699 1.994253
>>>>>>>  [4,]  402  945 1.898619
>>>>>>>  [5,]  424  470 1.812857
>>>>>>>  [6,]  424  905 1.816345
>>>>>>>  [7,]  470  905 1.871252
>>>>>>>  [8,]  504  780 1.958191
>>>>>>>  [9,]  504  848 1.997111...............
>>>>>>>
>>>>>>> ................................................................................
>>>>>>> so you get the idea. I want to group similar items in one
>>>>>>> group/cluster
>>>>>>> following the "friends of friends" approach. I tried doing
>>>>>>>
>>>>>>> distclust <- hclust(distA,method="single")
>>>>>>> However, I got the following error.
>>>>>>>
>>>>>>> Error in if (n < 2) stop("must have n >= 2 objects to cluster") :
>>>>>>>  argument
>>>>>>> is of length zero
>>>>>>> which probably means there's something wrong with my input here. Is
>>>>>>> there
>>>>>>> another way of doing this kind of clustering without getting into all
>>>>>>> the
>>>>>>>  looping and ifelse etc. Basically, if 402 is close to 675,696,and699
>>>>>>> and
>>>>>>> thus fall in cluster A then all items close to 675,696,and 699 should
>>>>>>> also
>>>>>>> fall into the same cluster A following a friends of friedns strategy.
>>>>>>> Any help would be highly appreciated.
>>>>>>>
>>>>>>> --
>>>>>>> Ayesha Khan
>>>>>>>
>>>>>>> MS Bioengineering
>>>>>>> Dept. of Bioengineering
>>>>>>> Rice University, TX
>>>>>>>
>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [email protected] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joris Meys
>>>>>> Statistical Consultant
>>>>>>
>>>>>> Ghent University
>>>>>> Faculty of Bioscience Engineering
>>>>>> Department of Applied mathematics, biometrics and process control
>>>>>>
>>>>>> Coupure Links 653
>>>>>> B-9000 Gent
>>>>>>
>>>>>> tel : +32 9 264 59 87
>>>>>> [email protected]
>>>>>> -------------------------------
>>>>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  Ayesha Khan
>>>>>
>>>>> MS Bioengineering
>>>>> Dept. of Bioengineering
>>>>> Rice University, TX
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>  Ayesha Khan
>>>
>>> MS Bioengineering
>>> Dept. of Bioengineering
>>> Rice University, TX
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical Consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> Coupure Links 653
>> B-9000 Gent
>>
>> tel : +32 9 264 59 87
>> [email protected]
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>
>
>
> --
> Ayesha Khan
>
> MS Bioengineering
> Dept. of Bioengineering
> Rice University, TX
>



-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
[email protected]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] clustering in R

Reply via email to