Re: [R] Split String in regex while Keeping Delimiter

Greg Snow Thu, 13 Apr 2023 09:09:41 -0700

Since any space that follows 2 or 3 + signs (or - signs) also follows
a single + (or -), this can be done with positive look behind, which
may be a little simpler:


x <- c(
  'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
  'leucocyten - grampositieve coccen +'
)
strsplit(x, "(?<=[+-])\\s+", perl=TRUE)

An alternative is to use the strapply function(s) in the gsubfn
package which focus on what you want to keep for each piece rather
than what to split on.

Here is an example that says to keep a sequence of characters that are
not + or -, followed by 1 to 3 + or - characters:

library(gsubfn)
strapplyc(x, "[^+-]+[+-]{1,3}")

This includes the spaces at the beginning of the return strings after
the first, a couple of options that drop these spaces as well are:

strapply(x, "([^+-]+[+-]{1,3}) *", backref = -1)
strapply(x, "[^ +-][^+-]+[+-]{1,3}")

On Wed, Apr 12, 2023 at 11:54 AM Ivan Krylov <krylov.r...@gmail.com> wrote:
>
> On Wed, 12 Apr 2023 08:29:50 +0000
> Emily Bakker <emilybak...@outlook.com> wrote:
>
> > Some example data:
> > “leucocyten + gramnegatieve staven +++ grampositieve staven ++”
> > “leucocyten – grampositieve coccen +”
> >
> > I want to split the strings such that I get the following result:
> > c(“leucocyten +”,  “gramnegatieve staven +++”,
> >  “grampositieve staven ++”)
> > c(“leucocyten –“, “grampositieve coccen +”)
> >
> > I have tried strsplit with a regular expression with a positive
> > lookahead, but I am not able to achieve the results that I want.
>
> It sounds like you need positive look-behind, not look-ahead: split on
> spaces only if they _follow_ one to three of '+' or '-'. Unfortunately,
> repetition quantifiers like {n,m} or + are not directly supported in
> look-behind expressions (nor in Perl itself). As a special case, you
> can use \K, where anything to the left of \K is a zero-width positive
> match:
>
> x <- c(
>  'leucocyten + gramnegatieve staven +++ grampositieve staven ++',
>  'leucocyten - grampositieve coccen +'
> )
> strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE)
> # [[1]]
> # [1] "leucocyten +"             "gramnegatieve staven +++"
> #     "grampositieve staven ++"
> #
> # [[2]]
> # [1] "leucocyten -"           "grampositieve coccen +"
>
> --
> Best regards,
> Ivan
>
> P.S. It looks like your e-mail client has transformed every quote
> character into typographically-correct Unicode quotes “” and every
> minus into an en dash, which makes it slightly harder to work with your
> code, since typographically correct Unicode quotes are not R string
> delimiters. Is it really – that you'd like to split upon, or is it -?
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Split String in regex while Keeping Delimiter

Reply via email to