Dear fellow R-help members,

I hope to seek your advice on how to parse/manage a dataset with hundreds of
columns. Two examples of these columns, 'cancer.problems', and
'neuro.problems' are depicted below. Essentially, I need to parse this into
a useful dataset, and unfortunately, I am not familiar with perl or any such
language.

data <- data.frame(id=c(1:10))
data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y;
LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS")
data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74;
STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS")

As can be seen, the semi-colon delimiter follows its own set of rules, which
are internally consistent - with all 3 elements of data, it should be
"Status; Age; Tissue Type". However, if there is only tissue type, it is"
Status; Tissue Type", without the trailing semi-colon. However, if there is
Age available, it is "Status; Age;".

The main challenge for me is how to parse/convert this dataset into a useful
and consistent data.frame, or list, where I can capture Status, Age and
Tissue Type as separate fields. Due to the varying application of the
delimiter, I cannot use strsplit consistently. I have tried a convoluted
method by identifying "AGE" as the character string identifying 3 element
fields per below, but faced problems with unlist, given the empty fields.

age.present <- grepl("AGE",data[,2])
data.3column <- strsplit(data[age.present,2],";")
data.2column <- strsplit(data[!age.present,2],";")
data$cancer.status[age.present] <- unlist(data.3column)
[(1:sum(age.present)*3)-2]
...

Your advice is earnestly sought.

Thanks.

Min-Han

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to