Leonard,

Since it now seems a main consideration you have is speed/efficiency, maybe a 
step back might help.

Are there simplifying assumptions that are valid or can you make it simpler, 
such as converting everything to the same case?

Your sample data was this and I assume your actual data is similar and far 
longer.

c("Li", "Na", "K",  "2", "Rb", "Ca", "3")

So rather than use complex and costly regular expressions, or other full 
searches, can you just assume all entries start with either an uppercase letter 
orn a numeral and test for those usinnd something simple like
> substr(c("Li", "Na", "K",  "2", "Rb", "Ca", "3"), 1, 1)
[1] "L" "N" "K" "2" "R" "C" "3"

If you save that in a variable you can check if that is greater than or equal 
to "A" or perhaps "0" and also perhaps if it is less than or equal to "Z" or 
perhaps "9" and see if such a test is faster.

orig <- c("Li", "Na", "K",  "2", "Rb", "Ca", "3")
initial <- substr(orig, 1, 1)
elements_bool <- initial >= "A" & initial <= "Z"

The latter contains a Boolean vector you can use to index your original and 
toss away the ones with digits, or any lower case letter versions or any other 
UNICODE symbols.

orig_elements <- orig[elements_bool]

> orig
[1] "Li" "Na" "K"  "2"  "Rb" "Ca" "3" 
> orig_elements
[1] "Li" "Na" "K"  "Rb" "Ca"
> orig[!elements_bool]
[1] "2" "3"

Other approaches you might consider depending on your needs is to encapsulate 
your data as a column in a data.frame or tibble or other such construct and 
generate additional columns along the way that keep your information 
consolidated in what could be an efficient way especially if you shift some of 
your logic to using faster compiled functionality and perhaps using packages 
that fit your needs better such as data.table or dplyr and other things in the 
tidyverse. And note if using pipelines, for many purposes, the new built-in 
pipelines may be faster.


-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Leonard Mada via R-help
Sent: Wednesday, October 18, 2023 10:59 AM
To: R-help Mailing List <r-help@r-project.org>
Subject: [R] Best way to test for numeric digits?

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?

I was working to extract chemical elements from a formula, something 
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
     # Perl is partly broken in R 4.3, but this works:
     regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
     # stringi::stri_split(x, regex = regex);
     s = strsplit(x, regex, perl = TRUE);
     if(rm.digits) {
         s = lapply(s, function(s) {
             isNotD = is.na(suppressWarnings(as.numeric(s)));
             s = s[isNotD];
         });
     }
     return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to