On Mon, Sep 21, 2009 at 8:07 AM, Anne-Marie Ternes <amter...@gmail.com>wrote:
> Dear mailing list, > > I'm stuck with a tricky problem here - at least it seems tricky to me, > being not really talented in pattern matching and regex matters. > > I'm analysing amino acid mutations by position and type of mutation. > E.g. (fictitious example) in position 92, I can find L92V, L92MV, > L92I... L is in this example the wild-type amino-acid, and everything > behind the position number is a mutation (single amino acid or > mixture). I'm only interested in the mutation information, so: > > Say I've got this vector: > bla -> c("V", "MV", "I", "IL", "PT", "M", "E", "OM") > > I'd like to count only those elements that are "truly unique" > mutations, i.e.count "V", "MV" as 1, "I", "IL" as 1, "PT" as 1, "M" as > 1, "E" as 1, not count "OM". > > I could do it iteratively: > Element 1: V. Keep. > Element 2: MV. Match Keep vs New -> 1. I got already a V, so don't count. > Element 3: I. Match Keep vs New -> 0. I is new, keep. Keep = V,I > Element 4: IL. Match Keep vs New -> 1. I got already an I, so don't count. > Element 5: PT. Match Keep vs New -> 0. PT is new, keep. Keep = V,I,PT > Element 6: M: Match Keep vs New -> 0. M is new, keep. Keep = V,I,PT,M > Element 7: E. Match Keep vs New -> 0. E is new, keep. Keep = V,I,PT,M,E > Element 8: OM. Match Keep vs New -> 1. I got already M, so don't count. > > Keep vector= (V,I,PT,M,E), count =5 > > OK. There must be a more elegant way to do this! Something with > vector-wise pattern matching or so?... By the way, I dont care e.g. > which of "V" or "MV" is counted, what is important is that they are > only counted as 1. > > Thanks for your help! > > Anne-Marie > > I'm on my first cup of caffeinated beverage today so I don't know how helpful I will be-- but I'll give it a shot. I would approach this problem by: 1. Creating a function that uses grep to search the vector of acids for the components that match a certain letter or combination. This function would return 1 if any matches are found and 0 if no matches were found. The test for any matching mutations would be done by the appropriately-named any() function. 2. Use an apply function to execute the matching function for each possibility I want to search for. Here's an example for your case: # Your data acids <- c("V", "MV", "I", "IL", "PT", "M", "E", "OM") # The letters you are interested in to.count <- c('V','I','PT','M') counts <- sapply( to.count, function( to.match ){ did.match <- grep( to.match, acids ) if( any( did.match ) ){ return(1) }else{ return(0) } }) # The result counts V I PT M 1 1 1 1 If TRUE/FALSE answers would suffice, you could shorten the above code a little by just returning the value of any(): counts <- sapply( to.count, function( to.match ){ did.match <- grep( to.match, acids ) return( any( did.match ) ) }) counts V I PT M TRUE TRUE TRUE TRUE Actually, you could use as.integer() to achieve the same thing and get 1s and 0s (sorry, I ramble a lot in the early morning.) counts <- sapply( to.count, function( to.match ){ did.match <- grep( to.match, acids ) return( as.integer( any( did.match ) ) ) }) Here's a function that packs the above code up nicely: countMutations <- function( acids, to.count ){ count <- sapply( to.count, function(to.match){ did.match <- grep(to.match, acids) return(as.integer(any(did.match))) }) return(count) } There is one problem with the above method that I can think of- if 'M' and 'OM' were to be missing in your data it would still be matched due to the presence of 'MV': # Data set without M and OM acids <- c("V", "MV", "I", "IL", "PT", "E" ) countMutations( acids, to.count ) # Doh! M still counted... V I PT M 1 1 1 1 The remedy to this is to add a little regex pixie dust to the to.match vector. For this conflict between 'MV' and 'M', we could impose the following rules-- we only want M to match if it is by it's self or preceded by an O and we only want V to match by it's self or preceded by M. We indicate this by changing the contents of to.match to include some regular expression voodoo: # New to.count vector- contains voodoo to.count <- c( '[M]?[V]$', 'I', 'PT', '[O]?[M]$') countMutations( acids, to.count ) # Looks funky, but you could fix that by slapping it with names() [M]?[V]$ I PT [O]?[M]$ 1 1 1 0 Basically, what happened with the regexes: [M]?[V]$ The [] indicate groups of possible matching characters-- in this case each group only contains one character. The [M]? means that there may possibly be a M at the start of the sequence, the [V]$ means that the sequence is terminated by a V. The [O]?[M]$ expression works exactly the same way. If you had multiple variants for V, such as: 'V' 'MV' 'PV' You can add more characters into the first set of brackets: [MP]?[V]$ will match anything possibly preceded by 'M' or 'P' and terminated by 'V'. If you need something more advanced, I would suggest investing some time in studying regular expressions-- they incredibly powerful yet powerfully cryptic. Well, coffee breaks' over- hope this helps! -Charlie [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.