On Mon, Sep 21, 2009 at 8:07 AM, Anne-Marie Ternes <amter...@gmail.com>wrote:

> Dear mailing list,
>
> I'm stuck with a tricky problem here - at least it seems tricky to me,
> being not really talented in pattern matching and regex matters.
>
> I'm analysing amino acid mutations by position and type of mutation.
> E.g. (fictitious example) in position 92, I can find L92V, L92MV,
> L92I... L is in this example the wild-type amino-acid, and everything
> behind the position number is a mutation (single amino acid or
> mixture). I'm only interested in the mutation information, so:
>
> Say I've got this vector:
> bla -> c("V", "MV", "I", "IL", "PT", "M", "E", "OM")
>
> I'd like to count only those elements that are "truly unique"
> mutations, i.e.count "V", "MV" as 1, "I", "IL" as 1, "PT" as 1, "M" as
> 1, "E" as 1, not count "OM".
>
> I could do it iteratively:
> Element 1: V. Keep.
> Element 2: MV. Match Keep vs New -> 1. I got already a V, so don't count.
> Element 3: I. Match Keep vs New -> 0. I is new, keep. Keep = V,I
> Element 4: IL. Match Keep vs New -> 1. I got already an I, so don't count.
> Element 5: PT. Match Keep vs New -> 0. PT is new, keep. Keep = V,I,PT
> Element 6: M: Match Keep vs New -> 0. M is new, keep. Keep = V,I,PT,M
> Element 7: E. Match Keep vs New -> 0. E is new, keep. Keep = V,I,PT,M,E
> Element 8: OM. Match Keep vs New -> 1. I got already M, so don't count.
>
> Keep vector= (V,I,PT,M,E), count =5
>
> OK. There must be a more elegant way to do this! Something with
> vector-wise pattern matching or so?... By the way, I dont care e.g.
> which of "V" or "MV" is counted, what is important is that they are
> only counted as 1.
>
> Thanks for your help!
>
> Anne-Marie
>
>
I'm on my first cup of caffeinated beverage today so I don't know how
helpful I will be-- but I'll give it a shot. I would approach this problem
by:

1. Creating a function that uses grep to search the vector of acids for the
components that match a certain letter or combination. This function would
return 1 if any matches are found and 0 if no matches were found. The test
for any matching mutations would be done by the appropriately-named any()
function.

2. Use an apply function to execute the matching function for each
possibility I want to search for.

Here's an example for your case:

# Your data
acids <- c("V", "MV", "I", "IL", "PT", "M", "E", "OM")

# The letters you are interested in
to.count <- c('V','I','PT','M')

counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

  if( any( did.match ) ){
    return(1)
  }else{
    return(0)
  }

})

# The result
counts
 V  I PT  M
 1  1  1  1


If TRUE/FALSE answers would suffice, you could shorten the above code a
little by just returning the value of any():


counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

 return( any( did.match ) )

})

counts
   V     I    PT     M
 TRUE  TRUE TRUE  TRUE


Actually, you could use as.integer() to achieve the same thing and get 1s
and 0s (sorry, I ramble a lot in the early morning.)

counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

  return( as.integer( any( did.match ) ) )

})


Here's a function that packs the above code up nicely:

countMutations <- function( acids, to.count ){

  count <- sapply( to.count, function(to.match){

    did.match <- grep(to.match, acids)

    return(as.integer(any(did.match)))

  })

  return(count)

}


There is one problem with the above method that I can think of- if 'M' and
'OM' were to be missing in your data it would still be matched due to the
presence of 'MV':

# Data set without M and OM
acids <- c("V", "MV", "I", "IL", "PT", "E" )

countMutations( acids, to.count )

# Doh! M still counted...
V  I PT  M
 1  1  1  1

The remedy to this is to add a little regex pixie dust to the to.match
vector. For this conflict between 'MV' and 'M', we could impose the
following rules-- we only want M to match if it is by it's self or preceded
by an O and we only want V to match by it's self or preceded by M. We
indicate this by changing the contents of to.match to include some regular
expression voodoo:

# New to.count vector- contains voodoo
to.count <- c( '[M]?[V]$', 'I', 'PT', '[O]?[M]$')

countMutations( acids, to.count )

# Looks funky, but you could fix that by slapping it with names()
[M]?[V]$        I       PT [O]?[M]$
       1           1        1        0

Basically, what happened with the regexes:

[M]?[V]$

The [] indicate groups of possible matching characters-- in this case each
group only contains one character. The [M]? means that there may possibly be
a M at the start of the sequence, the [V]$ means that the sequence is
terminated by a V. The [O]?[M]$ expression works exactly the same way.

If you had multiple variants for V, such as:

'V'
'MV'
'PV'

You can add more characters into the first set of brackets: [MP]?[V]$ will
match anything possibly preceded by 'M' or 'P' and terminated by 'V'.

If you need something more advanced, I would suggest investing some time in
studying regular expressions-- they incredibly powerful yet powerfully
cryptic.

Well, coffee breaks' over- hope this helps!

-Charlie

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to