Re: [R] Converting unique strings to unique numbers

Hervé Pagès Fri, 29 May 2015 16:23:53 -0700

Hi Bill,

On 05/29/2015 01:48 PM, William Dunlap wrote:

I'm not sure why which particular ID gets assigned to each string would
matter but maybe I'm missing something. What really matters is that each
string receives a unique ID. match(x, x) does that.


I think each row of the OP's dataset represented an individual (column 2)
followed by its mother and father (columns 3 and 4).  I assume that the
marker "0" means that a parent is not in the dataset.  If you match against
the strings in column 2 only, in their original order, then the
resulting numbers
give the row number of an individual,


Note that the code I gave happens to do exactly that (assuming that
column 2 contains no duplicates, but your code is also relying on that
assumption in order to have the ids match the row numbers).

We're discussing the merit of match(x, x) versus match(x, unique(x)).
All I'm trying to say is that the unique(x) step (which doubles the cost
of the whole operation, because it also uses hashing, like match() does)
is generally not needed. It doesn't seem to be needed in Kate's use
case.

H.

making it straightforward to look up
information regarding the ancestors of an individual.  Hence the choice of
numeric ID's may be important.

Bill Dunlap
TIBCO Software
wdunlap tibco.com <http://tibco.com>

On Fri, May 29, 2015 at 1:29 PM, Hervé Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>> wrote:

    Hi Sarah,

    On 05/29/2015 12:04 PM, Sarah Goslee wrote:

        On Fri, May 29, 2015 at 2:16 PM, Hervé Pagès
        <hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>> wrote:

            Hi Kate,

            I found that matching the character vector to itself is a very
            effective way to do this:

                x <- c("a", "bunch", "of", "strings", "whose", "exact",
            "content",
                       "is", "of", "little", "interest")
                ids <- match(x, x)
                ids
                # [1]  1  2  3  4  5  6  7  8  3 10 11

            By using this trick, many manipulations on character vectors can
            be replaced by manipulations on integer vectors, which are
            sometimes
            way more efficient.


        Hm. I hadn't thought of that approach - I use the
        as.numeric(factor(...)) approach.

        So I was curious, and compared the two:


        set.seed(43)
        x <- sample(letters, 10000, replace=TRUE)

        system.time({
            for(i in seq_len(20000)) {
            ids1 <- match(x, x)
        }})

        #   user  system elapsed
        #  9.657   0.000   9.657

        system.time({
            for(i in seq_len(20000)) {
            ids2 <- as.numeric(factor(x, levels=letters))
        }})

        #   user  system elapsed
        #   6.16    0.00    6.16

        Using factor() is faster.


    That's an unfair comparison, because you already know what the levels
    are so you can supply them to your call to factor(). Most of the time
    you don't know what the levels are so either you just do factor(x) and
    let the factor() constructor compute the levels for you, or you compute
    them yourself upfront with something like factor(x, levels=unique(x)).

       library(microbenchmark)

       microbenchmark(
         {ids1 <- match(x, x)},
         {ids2 <- as.integer(factor(x, levels=letters))},
         {ids3 <- as.integer(factor(x))},
         {ids4 <- as.integer(factor(x, levels=unique(x)))}
       )
       Unit: microseconds
                                                           expr     min
          lq
                                    {     ids1 <- match(x, x) } 245.979
    262.2390
        {     ids2 <- as.integer(factor(x, levels = letters)) } 214.115
    219.2320
                          {     ids3 <- as.integer(factor(x)) } 380.782
    388.7295
      {     ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250
    342.6630
            mean   median      uq     max neval
        267.3210 264.4845 268.348 293.894   100
        226.9913 220.9870 226.147 314.875   100
        402.2242 394.7165 412.075 481.410   100
        349.7405 345.3090 353.162 383.002   100

        More importantly, using factor() lets you
        set the order of the indices in an expected fashion, where match()
        assigns them in the order of occurrence.

        head(data.frame(x, ids1, ids2))

            x ids1 ids2
        1 m    1   13
        2 x    2   24
        3 b    3    2
        4 s    4   19
        5 i    5    9
        6 o    6   15

        In a problem like Kate's where there are several columns for
        which the
        same ordering of indices is desired, that becomes really important.


    I'm not sure why which particular ID gets assigned to each string would
    matter but maybe I'm missing something. What really matters is that each
    string receives a unique ID. match(x, x) does that.

    In Kate's problem, where the strings are in more than one column,
    and you want the ID to be unique across the columns, you need to do
    match(x, x) where 'x' contains the strings from all the columns
    that you want to replace:

       m <- matrix(c(
         "X0001", "BYX859",        0,        0,  2,  1, "BYX859",
         "X0001", "BYX894",        0,        0,  1,  1, "BYX894",
         "X0001", "BYX862", "BYX894", "BYX859",  2,  2, "BYX862",
         "X0001", "BYX863", "BYX894", "BYX859",  2,  2, "BYX863",
         "X0001", "BYX864", "BYX894", "BYX859",  2,  2, "BYX864",
         "X0001", "BYX865", "BYX894", "BYX859",  2,  2, "BYX865"
       ), ncol=7, byrow=TRUE)

       x <- m[ , 2:4]
       id <- match(x, x, nomatch=0, incomparables="0")
       m[ , 2:4] <- id

    No factor needed. No loop needed. ;-)

    Cheers,
    H.


        If you take Bill Dunlap's modification of the match() approach, it
        resolves both problems: matching against the pooled unique values is
        both faster than the factor() version and gives the same result:


        On Fri, May 29, 2015 at 1:31 PM, William Dunlap
        <wdun...@tibco.com <mailto:wdun...@tibco.com>> wrote:

            match() will do what you want.  E.g., run your data through
            the following function.

        f <- function (data)
        {
              uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
              uniqStrings <- setdiff(uniqStrings, "0")
              for (j in 2:4) {
                  data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
              }
              data
        }

        ##

        y <- data.frame(id = 1:5000, v1 = sample(letters, 5000,
        replace=TRUE),
        v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
        replace=TRUE), stringsAsFactors=FALSE)


        system.time({
            for(i in seq_len(20000)) {
              ids3 <- f(data.frame(y))
        }})

        #   user  system elapsed
        # 22.515   0.000  22.518



        ff <- function(data)
        {
              uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
              uniqStrings <- setdiff(uniqStrings, "0")
              for (j in 2:4) {
                  data[[j]] <- as.numeric(factor(data[[j]],
        levels=uniqStrings))
              }
              data
        }

        system.time({
            for(i in seq_len(20000)) {
              ids4 <- ff(data.frame(y))
        }})

        #    user  system elapsed
        #  26.083   0.002  26.090

        head(ids3)

            id v1 v2 v3
        1  1  1  2  8
        2  2  2 19 22
        3  3  3 21 16
        4  4  4 10 17
        5  5  1  8 18
        6  6  1 12 26

        head(ids4)

            id v1 v2 v3
        1  1  1  2  8
        2  2  2 19 22
        3  3  3 21 16
        4  4  4 10 17
        5  5  1  8 18
        6  6  1 12 26

        Kate, if you're getting all zeros, check str(yourdataframe) - it's
        likely that when you imported your data into R the strings were
        already converted to factors, which is not what you want (ask me
        how I
        know this!).

        Sarah



            On 05/29/2015 09:58 AM, Kate Ignatius wrote:


                I have a pedigree file as so:

                X0001 BYX859      0      0  2  1 BYX859
                X0001 BYX894      0      0  1  1 BYX894
                X0001 BYX862 BYX894 BYX859  2  2 BYX862
                X0001 BYX863 BYX894 BYX859  2  2 BYX863
                X0001 BYX864 BYX894 BYX859  2  2 BYX864
                X0001 BYX865 BYX894 BYX859  2  2 BYX865

                And I was hoping to change all unique string values to
                numbers.

                That is:

                BYX859 = 1
                BYX894 = 2
                BYX862 = 3
                BYX863 = 4
                BYX864 = 5
                BYX865 = 6

                But only in columns 2 - 4.  Essentially I would like the
                data to look like
                this:

                X0001 1 0 0  2  1 BYX859
                X0001 2 0 0  1  1 BYX894
                X0001 3 2 1  2  2 BYX862
                X0001 4 2 1  2  2 BYX863
                X0001 5 2 1  2  2 BYX864
                X0001 6 2 1  2  2 BYX865

                Is this possible with factors?

                Thanks!

                K.




    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

    ______________________________________________
    R-help@r-project.org <mailto:R-help@r-project.org> mailing list --
    To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Converting unique strings to unique numbers

Reply via email to