[R] Regular Expression

2012-07-24 Thread Fred G
Hi--

I have three columns in an input file:
MONTH   QUARTER  YEAR
2012-07   2012-32012
2001-07   2001-32001
2002-01   2002-12002

I want to make output like so:
MONTH   QUARTER  YEAR
07   32012
07   32001
01   12002

I was having some trouble getting the regular expression to work.  I think
it should be something like the following:
tmp <- uncurated$MONTH
*tmp <- gsub("[^-\\d\\d]","",tmp,perl=TRUE)*
*tmp[tmp=="-"] <- ""*
*curated$MONTH <- tmp*
*
*
tmp <- uncurated$QUARTER
*tmp <- gsub("[^-\\d]","",tmp,perl=TRUE)*
*tmp[tmp=="-"] <- ""*
*curated$QUARTER <- tmp*
*
*
*but it's not quite working. I want to be able to isolate any digits that
occur after the hyphen and to delete everything before and including the
hyphen. Would greatly appreciate any clarification anyone can provide.*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expression

2012-07-24 Thread Fred G
Thank you! :)

On Tue, Jul 24, 2012 at 1:42 PM, Sarah Goslee wrote:

> To delete everything from the beginning of the string to and including
> the hyphen, use
> sub("^.*-", "", tmp)
>
> Sarah
>
> On Tue, Jul 24, 2012 at 1:36 PM, Fred G  wrote:
> > Hi--
> >
> > I have three columns in an input file:
> > MONTH   QUARTER  YEAR
> > 2012-07   2012-32012
> > 2001-07   2001-32001
> > 2002-01   2002-12002
> >
> > I want to make output like so:
> > MONTH   QUARTER  YEAR
> > 07   32012
> > 07   32001
> > 01   12002
> >
> > I was having some trouble getting the regular expression to work.  I
> think
> > it should be something like the following:
> > tmp <- uncurated$MONTH
> > *tmp <- gsub("[^-\\d\\d]","",tmp,perl=TRUE)*
> > *tmp[tmp=="-"] <- ""*
> > *curated$MONTH <- tmp*
> > *
> > *
> > tmp <- uncurated$QUARTER
> > *tmp <- gsub("[^-\\d]","",tmp,perl=TRUE)*
> > *tmp[tmp=="-"] <- ""*
> > *curated$QUARTER <- tmp*
> > *
> > *
> > *but it's not quite working. I want to be able to isolate any digits that
> > occur after the hyphen and to delete everything before and including the
> > hyphen. Would greatly appreciate any clarification anyone can provide.*
> >
> > [[alternative HTML version deleted]]
> >
> > __
>
> --
> Sarah Goslee
> http://www.functionaldiversity.org
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Hi all,

My code looks like the following:
inname = read.csv("ID_error_checker.csv", as.is=TRUE)
outname = read.csv("output.csv", as.is=TRUE)

#My algorithm is the following:
#for line in inname
#if first string up to whitespace in row in inname$name = first string up
to whitespace in row + 1 in inname$name
#AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
below it
#copy these two lines to a new file

In other words, if the name (up to the first whitespace) in the first row
equals the name in the second row (etc for whole file) and the ID in the
first row does not equal the ID in the second row, copy both of these rows
in full to a new file.  Only caveat is that I want a regular expression not
to take the full names, but just the first string up to the first
whitespace in the inname$name column (ie if row1 has a name of: New York
Mets and row2 has a name of New York Yankees, I would want both of these
rows to be copied in full since "New" is the same in both...)

Here is some example data:
ID NAME  YEAR SOURCE NOTES
1  New York Mets   1900  ESPN
2  New York Yankees  1920 Cooperstown
3  Boston Redsox   1918  ESPN
4  Washington Nationals  2010 ESPN
5  Detroit Tigers  1990  ESPN

The desired output would be:
ID   NAMEYEAR SOURCE
1New York Mets1900   ESPN
2New York Yankees   1920   Cooperstown

Thanks so much!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks Arun. The only issue is that I need the code to be very
generalizable, such that the grep() really has to be if the first string up
to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
below) is the same as the first string up to the whitespace in the row
directly below it, AND the ID's are different, then copy.  The actual file
has thousands of different IDs and names...

On Fri, Aug 10, 2012 at 2:01 PM, arun  wrote:

>
>
> Hi,
>
> Try this:
> dat1<-read.table(text="
> ID,NAME,YEAR,SOURCE
> 1,New York Mets,1900,ESPN
> 2,New York Yankees,1920,Cooperstown
> 3,Boston Redsox,1918,ESPN
> 4,Washington Nationals,2010,ESPN
> 5,Detroit Tigers,1990,ESPN
> ",sep=",",header=TRUE,stringsAsFactors=FALSE)
>
>  index<-grep("New York.*",dat1$NAME)
> dat1[index,]
> #  ID NAME YEAR  SOURCE
> #1  1    New York Mets 1900ESPN
> #2  2 New York Yankees 1920 Cooperstown
>
> A.K.
>
>
>
> - Original Message -
> From: Fred G 
> To: r-help@r-project.org
> Cc:
> Sent: Friday, August 10, 2012 1:41 PM
> Subject: [R] Regular Expressions + Matrices
>
> Hi all,
>
> My code looks like the following:
> inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> outname = read.csv("output.csv", as.is=TRUE)
>
> #My algorithm is the following:
> #for line in inname
> #if first string up to whitespace in row in inname$name = first string up
> to whitespace in row + 1 in inname$name
> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
> below it
> #copy these two lines to a new file
>
> In other words, if the name (up to the first whitespace) in the first row
> equals the name in the second row (etc for whole file) and the ID in the
> first row does not equal the ID in the second row, copy both of these rows
> in full to a new file.  Only caveat is that I want a regular expression not
> to take the full names, but just the first string up to the first
> whitespace in the inname$name column (ie if row1 has a name of: New York
> Mets and row2 has a name of New York Yankees, I would want both of these
> rows to be copied in full since "New" is the same in both...)
>
> Here is some example data:
> ID NAME  YEAR SOURCE NOTES
> 1  New York Mets   1900  ESPN
> 2  New York Yankees  1920 Cooperstown
> 3  Boston Redsox   1918  ESPN
> 4  Washington Nationals  2010 ESPN
> 5  Detroit Tigers  1990  ESPN
>
> The desired output would be:
> ID   NAMEYEAR SOURCE
> 1New York Mets1900   ESPN
> 2New York Yankees   1920   Cooperstown
>
> Thanks so much!
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks so much, and thanks for the clarification. "New York" ---> "New"
should not match "Other New" because "New" is not the first.

Thanks so much, testing it on my data now.

On Fri, Aug 10, 2012 at 2:35 PM, Rui Barradas  wrote:

> Hello,
>
> My code doesn't predict a point you've made clear in this post. Inline.
> Em 10-08-2012 19:05, Fred G escreveu:
>
>  Thanks Arun. The only issue is that I need the code to be very
>> generalizable, such that the grep() really has to be if the first string
>> up
>> to the whitespace in a row (ie "New", "Boston", "Washington", "Detroit
>> below) is the same as the first string up to the whitespace in the row
>> directly below it
>>
>
> Does this mean that "New York" ---> "New" in one row shouldn't match
> "Other New" in the next row because "New" is not the first string up to the
> whitespace? If this is the case, modify my earlier code to
>
>
>
> fun <- function(i, x){
> if(x[i, "ID"] != x[i + 1, "ID"]){
> s1 <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] # keep
> first string
> s2 <- unlist(strsplit(x[i + 1, "NAME"], "[[:space:]]"))[1]  # keep
> first string
> if(grepl(s1, s2)) return(TRUE)
> }
> FALSE
> }
>
> If it isn't the case, do nothing.
>
> Rui Barradas
>
>
>  , AND the ID's are different, then copy.  The actual file
>> has thousands of different IDs and names...
>>
>> On Fri, Aug 10, 2012 at 2:01 PM, arun  wrote:
>>
>>
>>> Hi,
>>>
>>> Try this:
>>> dat1<-read.table(text="
>>> ID,NAME,YEAR,SOURCE
>>> 1,New York Mets,1900,ESPN
>>> 2,New York Yankees,1920,Cooperstown
>>> 3,Boston Redsox,1918,ESPN
>>> 4,Washington Nationals,2010,ESPN
>>> 5,Detroit Tigers,1990,ESPN
>>> ",sep=",",header=TRUE,**stringsAsFactors=FALSE)
>>>
>>>   index<-grep("New York.*",dat1$NAME)
>>> dat1[index,]
>>> #  ID NAME YEAR  SOURCE
>>> #1  1New York Mets 1900ESPN
>>> #2  2 New York Yankees 1920 Cooperstown
>>>
>>> A.K.
>>>
>>>
>>>
>>> - Original Message -
>>> From: Fred G 
>>> To: r-help@r-project.org
>>> Cc:
>>> Sent: Friday, August 10, 2012 1:41 PM
>>> Subject: [R] Regular Expressions + Matrices
>>>
>>> Hi all,
>>>
>>> My code looks like the following:
>>> inname = read.csv("ID_error_checker.**csv", as.is=TRUE)
>>> outname = read.csv("output.csv", as.is=TRUE)
>>>
>>> #My algorithm is the following:
>>> #for line in inname
>>> #if first string up to whitespace in row in inname$name = first string up
>>> to whitespace in row + 1 in inname$name
>>> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the
>>> row
>>> below it
>>> #copy these two lines to a new file
>>>
>>> In other words, if the name (up to the first whitespace) in the first row
>>> equals the name in the second row (etc for whole file) and the ID in the
>>> first row does not equal the ID in the second row, copy both of these
>>> rows
>>> in full to a new file.  Only caveat is that I want a regular expression
>>> not
>>> to take the full names, but just the first string up to the first
>>> whitespace in the inname$name column (ie if row1 has a name of: New York
>>> Mets and row2 has a name of New York Yankees, I would want both of these
>>> rows to be copied in full since "New" is the same in both...)
>>>
>>> Here is some example data:
>>> ID NAME  YEAR SOURCE NOTES
>>> 1  New York Mets   1900  ESPN
>>> 2  New York Yankees  1920 Cooperstown
>>> 3  Boston Redsox   1918  ESPN
>>> 4  Washington Nationals  2010 ESPN
>>> 5  Detroit Tigers  1990  ESPN
>>>
>>> The desired output would be:
>>> ID   NAMEYEAR SOURCE
>>> 1New York Mets1900   ESPN
>>> 2New York Yankees   1920   Cooperstown
>>>
>>> Thanks so much!
>>>
>>>  [[alternative HTML version deleted]]
>>>

Re: [R] Regular Expressions + Matrices

2012-08-10 Thread Fred G
Thanks Bill! Works great! Thanks again guys!

On Fri, Aug 10, 2012 at 2:43 PM, William Dunlap  wrote:

> If you think about this as a runs problem you can get a loopless solution
> that I think is easier to read (once the requisite functions are defined).
>
> First define the function to canonicalize the name
>nickname <- function(x) sub(" .*", "", x)
> then define some handy runs functions
>   isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
>   isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical
> then use those functions on your dataset
>   > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR)
>   > d[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR  SOURCE
>   1  1New York Mets 1900ESPN
>   2  2 New York Yankees 1920 Cooperstown
> See how it works with triplicates as well
>   > dd <- rbind(d, data.frame(ID=6:8,
>   NAME=c("Chicago Blacksox", "Chicago Cubs",
> "Chicago Whitesox"),
>   YEAR=1701:1703, SOURCE=rep("made up", 3)))
>   > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR)
>   > dd[ nearDup | isJustBefore(nearDup), ]
> ID NAME YEAR  SOURCE
>   1  1New York Mets 1900ESPN
>   2  2 New York Yankees 1920 Cooperstown
>   6  6 Chicago Blacksox 1701 made up
>   7  7 Chicago Cubs 1702 made up
>   8  8 Chicago Whitesox 1703 made up
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -Original Message-
> > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
> On Behalf
> > Of Rui Barradas
> > Sent: Friday, August 10, 2012 11:18 AM
> > To: Fred G
> > Cc: r-help
> > Subject: Re: [R] Regular Expressions + Matrices
> >
> > Hello,
> >
> > Try the following.
> >
> >
> > d <- read.table(textConnection("
> > ID NAME  YEAR SOURCE
> > 1  'New York Mets'   1900  ESPN
> > 2  'New York Yankees'  1920 Cooperstown
> > 3  'Boston Redsox'   1918  ESPN
> > 4  'Washington Nationals'  2010 ESPN
> > 5  'Detroit Tigers'  1990  ESPN
> > "), header=TRUE)
> >
> > d$NAME <- as.character(d$NAME)
> >
> > fun <- function(i, x){
> >  if(x[i, "ID"] != x[i + 1, "ID"]){
> >  s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
> >  if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
> >  }
> >  FALSE
> > }
> >
> > inx <- sapply(seq_len(nrow(d) - 1), fun, d)
> > inx <- c(inx, FALSE) | c(FALSE, inx)
> > d[inx, ]
> >
> > Hope this helps,
> >
> > Rui Barradas
> > Em 10-08-2012 18:41, Fred G escreveu:
> > > Hi all,
> > >
> > > My code looks like the following:
> > > inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> > > outname = read.csv("output.csv", as.is=TRUE)
> > >
> > > #My algorithm is the following:
> > > #for line in inname
> > > #if first string up to whitespace in row in inname$name = first string
> up
> > > to whitespace in row + 1 in inname$name
> > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the
> row
> > > below it
> > > #copy these two lines to a new file
> > >
> > > In other words, if the name (up to the first whitespace) in the first
> row
> > > equals the name in the second row (etc for whole file) and the ID in
> the
> > > first row does not equal the ID in the second row, copy both of these
> rows
> > > in full to a new file.  Only caveat is that I want a regular
> expression not
> > > to take the full names, but just the first string up to the first
> > > whitespace in the inname$name column (ie if row1 has a name of: New
> York
> > > Mets and row2 has a name of New York Yankees, I would want both of
> these
> > > rows to be copied in full since "New" is the same in both...)
> > >
> > > Here is some example data:
> > > ID NAME  YEAR SOURCE NOTES
> > > 1  New York Mets   1900  ESPN
> > > 2  New York Yankees  1920 Cooperstown
> > > 3  Boston Redsox   1918  ESPN
> > > 4  Washington Nationals

[R] rbind()

2012-01-20 Thread Fred G
Hello there,

Much thanks in advance for any help.  I have a few questions:

1) Why do I keep getting the following error:

File1 <- read.csv("../RawData/File1.csv",as.is=TRUE,row.names=1)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file '../RawData/File1.csv': No such file or directory

?

More specifically, my directories are set up in the following way:
 SampleProject
 RawData   SampleCode

The current script is in the SampleCode folder.  File1.csv is in the
RawData folder.  I'm a bit confused why this error keeps occurring.  I
googled it and found many other people getting the same error, but was not
sure why mine remained incorrect...

2) Ultimately what I want to do is take File1.csv, File2.csv and File3.csv
(all in the RawData folder) and basically add them together such that it
was as if they were all on one big csv file to begin with.  I thought I
knew how to do this but I'm using a mac now-- is there something different
between the code to do this with R Studio and on a Mac and using Tinn R on
Windows?

In any case, I would really very much appreciate any help on both these
issues.

Thank you again.

benjamin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] regular expression

2012-02-29 Thread Fred G
Computer Friends,

with the following example lines:

[107] "98-610: Cell type: S; Surv(months): 6; STATUS(0=alive, 1=dead): 1"

[108] "99-625: Cell type: S; Surv(months): 21; STATUS(0=alive, 1=dead): 1"

i want to be able to isolate the number of months of survival for each row.

is there a regular expression that can find the first instance of a ";",
delete everything in front of it-- and find the second instance of an ";"
and delete everything behind it? in python there is a function line.find(),
would be grateful to hear the R equiv; or, any other better alternatives to
get the number of months of survival stored as a variable.

Much Thank You!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.