Dear fellow R-help members, I hope to seek your advice on how to parse/manage a dataset with hundreds of columns. Two examples of these columns, 'cancer.problems', and 'neuro.problems' are depicted below. Essentially, I need to parse this into a useful dataset, and unfortunately, I am not familiar with perl or any such language.
data <- data.frame(id=c(1:10)) data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y; LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS") data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74; STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS") As can be seen, the semi-colon delimiter follows its own set of rules, which are internally consistent - with all 3 elements of data, it should be "Status; Age; Tissue Type". However, if there is only tissue type, it is" Status; Tissue Type", without the trailing semi-colon. However, if there is Age available, it is "Status; Age;". The main challenge for me is how to parse/convert this dataset into a useful and consistent data.frame, or list, where I can capture Status, Age and Tissue Type as separate fields. Due to the varying application of the delimiter, I cannot use strsplit consistently. I have tried a convoluted method by identifying "AGE" as the character string identifying 3 element fields per below, but faced problems with unlist, given the empty fields. age.present <- grepl("AGE",data[,2]) data.3column <- strsplit(data[age.present,2],";") data.2column <- strsplit(data[!age.present,2],";") data$cancer.status[age.present] <- unlist(data.3column) [(1:sum(age.present)*3)-2] ... Your advice is earnestly sought. Thanks. Min-Han [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.