Something like this? # Remove everything after ; to give the status status <- sub(';.*$', '', data$cancer.problems)
# Remove everything before the last ; to give tissue # In case a no ; in the string this goes wrong; correct tissue <- sub('^.*;[ \n]*', '', data$cancer.problems) tissue[! grepl(';', data$cancer.problems)] <- '' # Select the part between ;'s to give age indices <- regexpr(';.*;', data$cancer.problems) lengths <- attr(indices, "match.length") age <- rep(NA, length(data$cancer.problems)) age[indices>0] <- substring(data$cancer.problems[indices>0], indices[indices>0]+1, indices[indices>0]+lengths[indices>0]-2) Hope it helps. Regards, Jan On Fri, Apr 23, 2010 at 6:24 AM, Min-Han Tan <minhan.scie...@gmail.com> wrote: > Dear fellow R-help members, > > I hope to seek your advice on how to parse/manage a dataset with hundreds of > columns. Two examples of these columns, 'cancer.problems', and > 'neuro.problems' are depicted below. Essentially, I need to parse this into > a useful dataset, and unfortunately, I am not familiar with perl or any such > language. > > data <- data.frame(id=c(1:10)) > data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y; > LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS") > data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74; > STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS") > > As can be seen, the semi-colon delimiter follows its own set of rules, which > are internally consistent - with all 3 elements of data, it should be > "Status; Age; Tissue Type". However, if there is only tissue type, it is" > Status; Tissue Type", without the trailing semi-colon. However, if there is > Age available, it is "Status; Age;". > > The main challenge for me is how to parse/convert this dataset into a useful > and consistent data.frame, or list, where I can capture Status, Age and > Tissue Type as separate fields. Due to the varying application of the > delimiter, I cannot use strsplit consistently. I have tried a convoluted > method by identifying "AGE" as the character string identifying 3 element > fields per below, but faced problems with unlist, given the empty fields. > > age.present <- grepl("AGE",data[,2]) > data.3column <- strsplit(data[age.present,2],";") > data.2column <- strsplit(data[!age.present,2],";") > data$cancer.status[age.present] <- unlist(data.3column) > [(1:sum(age.present)*3)-2] > ... > > Your advice is earnestly sought. > > Thanks. > > Min-Han > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.