[R] using (g)sub for efficient string handling (was Re: transforming one column into 2 columns)

Benilton Carvalho Sat, 02 Feb 2008 10:41:44 -0800

That actually reminds me of a problem I had to tackle a while ago.


Say I have the following:

txt <- c("Variation_0001 // chr1:1083805-1283805 // Array CGH //15286789 // Iafrate et al. (2004) // CopyNumber /// Variation_5452 //chr1:1142956-1147823 // Computational mapping of resequencingtraces // 16902084 // Mills et al. (2006) // CopyNumber","Variation_4192 // chr1:2062347-2242269 // Array CGH // 17160897 //Wong et al. (2007) // CopyNumber /// Variation_4193 //chr1:2145626-2314237 // Array CGH // 17160897 // Wong et al. (2007) //CopyNumber /// Variation_8246 // chr1:2224111-3755284 // Affymetrix500K and 100K SNP Mapping Arrays // 17638019 // Zogopoulos et al.(2007) // CopyNumber", "Variation_8246 // chr1:2224111-3755284 //Affymetrix 500K and 100K SNP Mapping Arrays // 17638019 // Zogopouloset al. (2007) // CopyNumber")


For each record, I'm interested in keeping the following:

results <- c("Variation_0001;Variation_5452","Variation_4192;Variation_4193;Variation_8246", "Variation_8246")


My solution was:

theNames <- function(tmp)
  sapply(strsplit(tmp, " /+ "),
         function(y)
         paste(y[grep("Variation_", y)],
               collapse=";"))

But my wish was to know the regular expression that I needed to selecteverything but "Variation_\\d+"... For example, something like:


gsub( NOT "Variation_\\d+", ";", txt, perl=TRUE)

Suggestions?

b

On Feb 2, 2008, at 1:03 PM, Peter Dalgaard wrote:

Benilton Carvalho wrote:

help("strsplit")
b

Yes, but...

The postprocessing gets a bit awkward. It might be easier to usesub() to get rid of the first/last bit of the string i.e.


C2 <- sub("^.*:", "",  Col)
C1 <- sub(":.*$", "",  Col)

An orthogonal idea is

con <- textConnection("Col")
read.table(con, sep=":")
close(con)

On Feb 2, 2008, at 12:43 PM, joseph wrote:



Hello

I have a data frame and one of its columns is as follows:




Col


chr1:71310034



chr15:37759058


chr22:18262638


chrUn:31337214


chr10_random:4369261


chrUn:3545097





I would like to get rid of colon (:) and replace this column

with two new columns containing the terms on each side of thecolon. The new columns

should look as follows:




Col_a   Col_b


chr1     71310034


chr14   23354088


chr15   37759058


chr22   18262638


chrUn   31337214


chr10_random  4369261


chrUn   3545097





Any help will be much appreciated


Joseph

____________________________________________________________________________________

Looking for last minute shopping deals?

   [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


------------------------------------------------------------------------

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
 O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K

(*) \(*) -- University of Copenhagen Denmark Ph: (+45)35327918~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45)35327907

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] using (g)sub for efficient string handling (was Re: transforming one column into 2 columns)

Reply via email to