This seems unnecessarily complex. Or rather, it pushes the complexity into an arcane notation What we really want is something that says "here is a string, here is a pattern, give me all the substrings that match." What we're given is a function that tells us where those substrings are.
# greg.matches(pattern, text) # accepts a POSIX regular expression, pattern # and a text to search in. Both arguments must be character strings # (length(...) = 1) not longer vectors of strings. # It returns a character vector of all the (non-overlapping) # substrings of text as determined by gregexpr. greg.matches <- function (pattern, text) { if (length(pattern) > 1) stop("pattern has too many elements") if (length(text) > 1) stop( "text has too many elements") match.info <- gregexpr(pattern, text) starts <- match.info[[1]] stops <- attr(starts, "match.length") - 1 + starts sapply(seq(along=starts), function (i) { substr(text, starts[i], stops[i]) }) } Given greg.matches, we can do the rest with very simple and easily comprehended regular expressions. # parse.chemical(formula) # takes a simple chemical formula "<element><count>..." and # returns a list with components # $elements -- character -- the atom symbols # $counts -- number -- the counts (missing counts taken as 1). # BEWARE. This does not handle formulas like "CH(OH)3". parse.chemical <- function (formula) { parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula) elements <- gsub("[0-9]+", "", parts) counts <- as.numeric(gsub("[^0-9]+", "", parts)) counts <- ifelse(is.na(counts), 1, counts) list(elements=elements, counts=counts) } > parse.chemical("CCl3F") $elements [1] "C" "Cl" "F" $counts [1] 1 3 1 > parse.chemical("Li4Al4H16") $elements [1] "Li" "Al" "H" $counts [1] 4 4 16 > parse.chemical("CCl2CO2AlPO4SiO4Cl") $elements [1] "C" "Cl" "C" "O" "Al" "P" "O" "Si" "O" "Cl" $counts [1] 1 2 1 2 1 1 4 1 4 1 On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help <r-help@r-project.org> wrote: > Dear List members, > > What is the best way to test for numeric digits? > > suppressWarnings(as.double(c("Li", "Na", "K", "2", "Rb", "Ca", "3"))) > # [1] NA NA NA 2 NA NA 3 > The above requires the use of the suppressWarnings function. Are there > any better ways? > > I was working to extract chemical elements from a formula, something > like this: > split.symbol.character = function(x, rm.digits = TRUE) { > # Perl is partly broken in R 4.3, but this works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > # stringi::stri_split(x, regex = regex); > s = strsplit(x, regex, perl = TRUE); > if(rm.digits) { > s = lapply(s, function(s) { > isNotD = is.na(suppressWarnings(as.numeric(s))); > s = s[isNotD]; > }); > } > return(s); > } > > split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")) > > > Sincerely, > > > Leonard > > > Note: > # works: > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > > # broken in R 4.3.1 > # only slightly "erroneous" with stringi::stri_split > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])"; > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T) > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.