Re: [R] Best way to test for numeric digits?

Rui Barradas Wed, 18 Oct 2023 10:45:51 -0700

Às 17:24 de 18/10/2023, Leonard Mada escreveu:

Dear Rui,


Thank you for your reply.

I do have actually access to the chemical symbols: I have started torefactor and enhance the Rpdb package, see Rpdb::elements:

https://github.com/discoleo/Rpdb

However, the regex that you have constructed is quite heavy, as it needsto iterate through all chemical symbols (in decreasing nchar). Elementslike C, and especially O, P or S, appear late in the regex expression -but are quite common in chemistry.

The alternative regex is (in this respect) simpler. It actually works(once you know about the workaround).

Q: My question focused if there is anything like is.numeric, but toparse each element of a vector.


Sincerely,


Leonard


On 10/18/2023 6:53 PM, Rui Barradas wrote:

Às 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there
any better ways?

I was working to extract chemical elements from a formula, something
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
      # Perl is partly broken in R 4.3, but this works:

regex ="(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";

      # stringi::stri_split(x, regex = regex);
      s = strsplit(x, regex, perl = TRUE);
      if(rm.digits) {
          s = lapply(s, function(s) {
              isNotD = is.na(suppressWarnings(as.numeric(s)));
              s = s[isNotD];
          });
      }
      return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
PLEASE do read the posting guide
https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
and provide commented, minimal, self-contained, reproducible code.

Hello,

If you want to extract chemical elements symbols, the following mightwork.

It uses the periodic table in GitHub package chemr and a package stringr
function.


devtools::install_github("paleolimbot/chemr")



split_chem_elements <- function(x) {
    data(pt, package = "chemr", envir = environment())
    el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
    pat <- paste(el, collapse = "|")
    stringr::str_extract_all(x, pat)
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"


It is also possible to rewrite the function without calls to non base
packages but that will take some more work.

Hope this helps,

Rui Barradas

Hello,

You and Avi are right, my function's performance is terrible. Thefollowing is much faster.

As for how to not have digits throw warnings, the lapply in the versionof your function below solves it by setting grep argument invert = TRUE.This will get all strings where digits do not occur.




split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringr::str_replace_all(mol, regex, "#") |>
      strsplit("#|[[:digit:]]") |>
      lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
    s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
  }
  s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"
split.symbol.character(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"

mol10000 <- rep(mol, 10000)

system.time(
  split_chem_elements(mol10000)
)
#>    user  system elapsed
#>    0.01    0.00    0.02
system.time(
  split.symbol.character(mol10000)
)
#>    user  system elapsed
#>    0.35    0.07    0.47



Hope this helps,

Rui Barradas

--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Best way to test for numeric digits?

Reply via email to