Hello,
If speed is important, and following the previous discussion and Bert's
tests, here are two other alternatives, both faster.
1. Bert2 is Bert's original but with scan(., sep = "_") substituted for
unlist/strsplit.
2. A package data.table solution. These are always fast, many times the
fastest. But have the inconvenience of coercing the data to class
"data.table" and the rest of the code needs to be adapted to handle
data.tables. Namely, the second index in dt[i, j] is no longer a column
index.
Unlike Bert, I time my first code, the one with package tidyr and its
performance clearly beats the second one.
I define a test function, running several input sizes. It doesn't take
much time to complete, only several minutes. The times' differences are
not as impressive as Bert's, probably due to be on a different OS. I'm
running R 4.0.2 on Ubuntu 20.04, sessionInfo at the end.
Also, I find X$Y1 <- as.integer(grepl("_", X$text)) more readable than
coercion to numeric with +grepl(.).
library(data.table)
library(microbenchmark)
library(ggplot2)
Rui1 <- function(X){
#X$Y1 <- as.integer(grepl("_", X$text))
tidyr::separate(X, text, into = c("X1", "X2"), sep = "_", fill = "right")
}
Bert <- function(X){
## which are the non "_" indices?
wh <- grep("_",X$text, fixed = TRUE, invert = TRUE)
## paste "_." to these
X[wh,"text"] <- paste(X[wh,"text"],".",sep = "_")
## Now strsplit() and unlist() them to get a vector
z <- unlist(strsplit(X$text, "_"))
## now cbind() to the data frame
cbind(X, matrix(z, ncol = 2, byrow = TRUE))
}
Bert2 <- function(X){
wh <- grep("_",X$text, fixed = TRUE, invert = TRUE)
X[wh,"text"] <- paste(X[wh,"text"],".",sep = "_")
z <- scan(what = character(), text = X$text, sep = "_")
cbind(X, matrix(z, ncol = 2, byrow = TRUE))
}
DT <- function(X){
Y <- as.data.table(X)
Y[, c("X1", "X2") := tstrsplit(text, "_", fixed = TRUE)]
}
testSeparate <- function(X, size = 0:6, times = 10){
row_nums <- seq_len(nrow(X))
res <- lapply(size, function(s){
Y <- X[rep(row_nums, 10^s), ]
mb <- microbenchmark(
Rui = Rui1(Y),
Bert = Bert(Y),
Bert2 = Bert2(Y),
DT = DT(Y),
times = times
)
mb$size <- s
mb
})
# return median times
res <- do.call(rbind, res)
aggregate(time ~ size + expr, res, median)
}
F1 <- read.table(text="ID1 ID2 text
A1 B1 NONE
A1 B1 cf_12
A1 B1 NONE
A2 B2 X2_25
A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F)
agg <- testSeparate(F1, times = 5)
ggplot(agg, aes(size, time, color = expr)) +
geom_line() + geom_point() +
scale_y_continuous(trans = "log10") +
xlab(expression(log[10] ~ "(size)")) +
ylab(expression(log[10] ~ "(time)"))
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=pt_PT.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_PT.UTF-8 LC_COLLATE=pt_PT.UTF-8
[5] LC_MONETARY=pt_PT.UTF-8 LC_MESSAGES=pt_PT.UTF-8
[7] LC_PAPER=pt_PT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.3.2 microbenchmark_1.4-7 data.table_1.12.8
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 magrittr_1.5 tidyselect_1.1.0 munsell_0.5.0
[5] colorspace_1.4-1 R6_2.4.1 rlang_0.4.7 dplyr_1.0.2
[9] tools_4.0.2 grid_4.0.2 gtable_0.3.0 withr_2.2.0
[13] ellipsis_0.3.1 digest_0.6.25 tibble_3.0.3 lifecycle_0.2.0
[17] crayon_1.3.4 purrr_0.3.4 farver_2.0.3 tidyr_1.0.2
[21] vctrs_0.3.4 glue_1.4.2 labeling_0.3 stringi_1.4.6
[25] compiler_4.0.2 pillar_1.4.6 generics_0.0.2 scales_1.1.0
[29] pkgconfig_2.0.3
Hope this helps,
Rui Barradas
Às 02:47 de 23/09/20, Bert Gunter escreveu:
That was still slower and doesn't quite give what was requested:
> cbind(F1,utils::strcapture("([^_]*)_(.*)", F1$text,
proto=data.frame(Before_=character(), After_=character())))
ID1 ID2 text Before_ After_
1 A1 B1 NONE <NA> <NA>
2 A1 B1 cf_12 cf 12
3 A1 B1 NONE <NA> <NA>
4 A2 B2 X2_25 X2 25
5 A2 B3 fd_15 fd 15
> system.time({
+ cbind(F2,utils::strcapture("([^_]*)_(.*)", F2$text,
proto=data.frame(Before_=character(), After_=character())))
+ }
+ )
user system elapsed
32.712 0.736 33.587
Cheers,
Bert
On Tue, Sep 22, 2020 at 5:45 PM Bill Dunlap <williamwdun...@gmail.com
<mailto:williamwdun...@gmail.com>> wrote:
Another way to make columns out of the stuff before and after the
underscore, with NAs if there is no underscore, is
utils::strcapture("([^_]*)_(.*)", F1$text,
proto=data.frame(Before_=character(), After_=character()))
-Bill
On Tue, Sep 22, 2020 at 4:25 PM Bert Gunter <bgunter.4...@gmail.com
<mailto:bgunter.4...@gmail.com>> wrote:
To be clear, I think Rui's solution is perfectly fine and
probably better
than what I offer below. But just for fun, I wanted to do it
without the
lapply(). Here is one way. I think my comments suffice to explain.
> ## which are the non "_" indices?
> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> ## paste "_." to these
> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> ## Now strsplit() and unlist() them to get a vector
> z <- unlist(strsplit(F1$text, "_"))
> ## now cbind() to the data frame
> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> F1
ID1 ID2 text 1 2
1 A1 B1 NONE_. NONE .
2 A1 B1 cf_12 cf 12
3 A1 B1 NONE_. NONE .
4 A2 B2 X2_25 X2 25
5 A2 B3 fd_15 fd 15
>## You can change the names of the 2 columns yourself
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming
along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas
<ruipbarra...@sapo.pt <mailto:ruipbarra...@sapo.pt>> wrote:
> Hello,
>
> A base R solution with strsplit, like in your code.
>
> F1$Y1 <- +grepl("_", F1$text)
>
> tmp <- strsplit(as.character(F1$text), "_")
> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".")
else x)
> tmp <- do.call(rbind, tmp)
> colnames(tmp) <- c("X1", "X2")
> F1 <- cbind(F1[-3], tmp) # remove the original column
> rm(tmp)
>
> F1
> # ID1 ID2 Y1 X1 X2
> #1 A1 B1 0 NONE .
> #2 A1 B1 1 cf 12
> #3 A1 B1 0 NONE .
> #4 A2 B2 1 X2 25
> #5 A2 B3 1 fd 15
>
>
> Note that cbind dispatches on F1, an object of class
"data.frame".
> Therefore it's the method cbind.data.frame that is called and
the result
> is also a df, though tmp is a "matrix".
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> > Hello,
> >
> > Something like this?
> >
> >
> > F1$Y1 <- +grepl("_", F1$text)
> > F1 <- F1[c(1, 2, 4, 3)]
> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep =
"_", fill =
> > "right")
> > F1
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Às 19:55 de 22/09/20, Val escreveu:
> >> HI All,
> >>
> >> I am trying to create new columns based on another
column string
> >> content. First I want to identify rows that contain a
particular
> >> string. If it contains, I want to split the string and
create two
> >> variables.
> >>
> >> Here is my sample of data.
> >> F1<-read.table(text="ID1 ID2 text
> >> A1 B1 NONE
> >> A1 B1 cf_12
> >> A1 B1 NONE
> >> A2 B2 X2_25
> >> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F)
> >> If the variable "text" contains this "_" I want to create
an indicator
> >> variable as shown below
> >>
> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >>
> >>
> >> Then I want to split that string in to two, before "_" and
after "_"
> >> and create two variables as shown below
> >> x1= strsplit(as.character(F1$text),'_',2)
> >>
> >> My problem is how to combine this with the original data
frame. The
> >> desired output is shown below,
> >>
> >>
> >> ID1 ID2 Y1 X1 X2
> >> A1 B1 0 NONE .
> >> A1 B1 1 cf 12
> >> A1 B1 0 NONE .
> >> A2 B2 1 X2 25
> >> A2 B3 1 fd 15
> >>
> >> Any help?
> >> Thank you.
> >>
> >> ______________________________________________
> >> R-help@r-project.org <mailto:R-help@r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained,
reproducible code.
> >>
> >
> > ______________________________________________
> > R-help@r-project.org <mailto:R-help@r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained,
reproducible code.
>
> ______________________________________________
> R-help@r-project.org <mailto:R-help@r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
code.
>
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org> mailing list
-- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.