Since any space that follows 2 or 3 + signs (or - signs) also follows a single + (or -), this can be done with positive look behind, which may be a little simpler:
x <- c( 'leucocyten + gramnegatieve staven +++ grampositieve staven ++', 'leucocyten - grampositieve coccen +' ) strsplit(x, "(?<=[+-])\\s+", perl=TRUE) An alternative is to use the strapply function(s) in the gsubfn package which focus on what you want to keep for each piece rather than what to split on. Here is an example that says to keep a sequence of characters that are not + or -, followed by 1 to 3 + or - characters: library(gsubfn) strapplyc(x, "[^+-]+[+-]{1,3}") This includes the spaces at the beginning of the return strings after the first, a couple of options that drop these spaces as well are: strapply(x, "([^+-]+[+-]{1,3}) *", backref = -1) strapply(x, "[^ +-][^+-]+[+-]{1,3}") On Wed, Apr 12, 2023 at 11:54 AM Ivan Krylov <krylov.r...@gmail.com> wrote: > > On Wed, 12 Apr 2023 08:29:50 +0000 > Emily Bakker <emilybak...@outlook.com> wrote: > > > Some example data: > > “leucocyten + gramnegatieve staven +++ grampositieve staven ++” > > “leucocyten – grampositieve coccen +” > > > > I want to split the strings such that I get the following result: > > c(“leucocyten +”, “gramnegatieve staven +++”, > > “grampositieve staven ++”) > > c(“leucocyten –“, “grampositieve coccen +”) > > > > I have tried strsplit with a regular expression with a positive > > lookahead, but I am not able to achieve the results that I want. > > It sounds like you need positive look-behind, not look-ahead: split on > spaces only if they _follow_ one to three of '+' or '-'. Unfortunately, > repetition quantifiers like {n,m} or + are not directly supported in > look-behind expressions (nor in Perl itself). As a special case, you > can use \K, where anything to the left of \K is a zero-width positive > match: > > x <- c( > 'leucocyten + gramnegatieve staven +++ grampositieve staven ++', > 'leucocyten - grampositieve coccen +' > ) > strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE) > # [[1]] > # [1] "leucocyten +" "gramnegatieve staven +++" > # "grampositieve staven ++" > # > # [[2]] > # [1] "leucocyten -" "grampositieve coccen +" > > -- > Best regards, > Ivan > > P.S. It looks like your e-mail client has transformed every quote > character into typographically-correct Unicode quotes “” and every > minus into an en dash, which makes it slightly harder to work with your > code, since typographically correct Unicode quotes are not R string > delimiters. Is it really – that you'd like to split upon, or is it -? > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. 538...@gmail.com ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.