I agree, R will be fine for this. Not being as expert with regex as Jeff I would tend to do this in a few steps, something like
library(XLConnect) DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp" ) library(stringi) ## insert a marker between the text and the numbers txt <- stri_replace_all_regex(DF[[2]], "([^\\d]{2,})(\\d+ )", "$1|||$2") ## separate the text from the numbers stringNums <- stri_split_fixed(txt, "|||", 2, simplify = TRUE) ## split the numbers apart nums <- stri_split_regex(stringNums[, 2], "[^\\d]+", n = 5, simplify=TRUE) ## put it all back together extracted <- data.frame(DF[, 1], stringNums[, 1], apply(nums, 2, as.numeric)) ## put the names back names(extracted) <- c(names(DF)[1], paste(names(DF)[2], 1:6, sep = "_")) Best, Ista On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > I think R is quite capable of doing this. You would have to learn a > comparable number of fiddly bits to accomplish this in R, Python or Perl. > > That is not to say that learning Perl or Python is a bad idea... but in > terms of "shortest path" I think they are of comparable complexity. All > three languages support regular expressions, which would be the key bit of > knowledge to acquire regardless of which tool you use. > > Other fiddly bits might involve handling the cyrillic strings as data, > though you did not convey a desire to retain that information. > > One way (not extracting cyrillic text): > > library(XLConnect) > DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp" ) > pattern <- "^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$" > idx <- grep( pattern, DF[[2]] ) > dta <- sub( pattern, "\\1,\\2,\\3,\\4", DF[[2]][idx]) > dtamatrix <- apply( do.call( rbind > , strsplit( dta, "," ) ) > , 2 > , as.numeric > ) > extracted <- data.frame( V1=DF[[1]][idx], dtamatrix ) > > > On Wed, 21 Jan 2015, Collin Lynch wrote: > >> Dr. Polanski, I would recommend something else. Given the messy nature of >> your data I would suggest using a language like Python or Perl to extract >> it to an appropriate format. Python has good regular expression support >> and unicode support. If you can save your data as a csv file or even text >> line by line then it would be possible to write some code to read the >> file, >> match the lines with a simple regular expression, and then spit them back >> out as a csv file which you could read into R. >> >> I realize that this means learning a new language or finding someone with >> the requisite skills by I would recommend that over attempting to use R's >> text processing. >> >> Collin. >> >> On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski <n.polyans...@gmail.com> >> wrote: >> >>> Hi all! >>> >>> Sorry to bother you, I am trying to learn some R via coursera courses and >>> other internet sources yet haven?t managed to go far >>> >>> And now I need to do some, I hope, not too difficult things, which I >>> think >>> R can do, yet have no idea how to make it do so >>> >>> I have a big set of data (empirical) which was obtained by my colleagues >>> and store at not convenient way - all of the data in two cells of an >>> excel >>> table >>> an example of the data is in the attached file (the link) >>> >>> >>> >>> https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing >>> >>> so the first column has a number and the second has a whole vector (I >>> guess it is) which looks like >>> ?some words in Cyrillic(the length varies)? and then the set of numbers >>> ?12*23 34*45? (another problem that some times it is ?12*23, 34*56? >>> >>> And the number of raws is about 3000 so it is impossible to do manually >>> >>> what I need to have at the end is to have it separately in different >>> excel >>> cells >>> - what is written in words - | 12 | 23 | 34 | 45 | >>> >>> Do you think it is possible to do so using R (or something else?) >>> >>> Thank you very much in advance and sorry for asking for help and so >>> stupid >>> question, the problem is - I am trying and yet haven?t even managed to >>> install openSUSE onto my laptop - only Ubuntu! :) >>> >>> >>> Thank you very much! >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.