Às 19:35 de 18/10/2023, Leonard Mada escreveu:
Dear Rui,

On 10/18/2023 8:45 PM, Rui Barradas wrote:
split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringr::str_replace_all(mol, regex, "#") |>
      strsplit("#|[[:digit:]]") |>
      lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
    s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
  }
  s
}

You have a glitch (mol is hardcoded) in the code of the first function. The times are similar, after correcting for that glitch.

Note:
- grep("[[:digit:]]", ...) behaves almost twice as slow as grep("[0-9]", ...)!
- corrected results below;

Sincerely,

Leonard
#######

split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringr::str_replace_all(x, regex, "#") |>
       strsplit("#|[[:digit:]]") |>
       lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

split.symbol.character = function(x, rm.digits = TRUE) {
   # Perl is partly broken in R 4.3, but this works:
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   s <- strsplit(x, regex, perl = TRUE)
   if(rm.digits) {
     s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
   }
   s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
mol10000 <- rep(mol, 10000)

system.time(
   split_chem_elements(mol10000)
)
#   user  system elapsed
#   0.58    0.00    0.58

system.time(
   split.symbol.character(mol10000)
)
#   user  system elapsed
#   0.67    0.00    0.67

Hello,

You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the package stringi function stri_replace_all_regex and the improvement is significant.


split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringi::stri_replace_all_regex(x, "#", regex) |>
      strsplit("#|[0-9]") |>
      lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

# system.time(
#   split_chem_elements(mol10000)
# )
#  user  system elapsed
#  0.06    0.00    0.09
# system.time(
#   split.symbol.character(mol10000)
# )
#  user  system elapsed
#  0.25    0.00    0.28



Hope this helps,

Rui Barradas




--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to