Primarily for my own amusement, here is a way to do what I think you wanted without look-aheads/behinds
strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def, adef,x; ,,gh"), " +") [[1]] [1] "a" "bc" "," "def" "," "adef" "," "x" ";" [10] "," "," "gh" I certainly would *not* claim that it is in any way superior to anything that has already been suggested -- indeed, probably the contrary. But it's simple (as am I). Cheers, Bert On Fri, May 5, 2023 at 2:54 PM Leonard Mada via R-help <r-help@r-project.org> wrote: > Dear Avi, > > Punctuation marks are used in various NLP language models. Preserving > the "," is therefore useful in such scenarios and Regex are useful to > accomplish this (especially if you have sufficient experience with such > expressions). > > I observed only an odd behaviour using strsplit: the example string is > constructed; but it is always wise to test a Regex expression against > various scenarios. It is usually hard to predict what special cases will > occur in a specific corpus. > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?=,)|(?<=,)(?![ ])") > # "a" "bc" "," "def" "," "adef" "" "," "," "gh" > > stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<! > )(?=,)|(?<=,)(?![ ])") > # "a" "bc" "," "def" "," "adef" "," "," "gh" > > # Expected: > # "a" "bc" "," "def" "," "adef" "," "," "gh" > # see 2nd instance of stringi::stri_split > > > Sincerely, > > > Leonard > > > On 5/5/2023 11:20 PM, avi.e.gr...@gmail.com wrote: > > Leonard, > > > > It can be helpful to spell out your intent in English or some of us have > to go back to the documentation to remember what some of the operators do. > > > > Your text being searched seems to be an example of items between comas > with an optional space after some commas and in one case, nothing between > commas. > > > > So what is your goal for the example, and in general? You mention a bit > unclearly at the end some of what you expect and I think it would be > clearer if you also showed exactly the output you would want. > > > > I saw some other replies that addressed what you wanted and am going to > reply in another direction. > > > > Why do things the hard way using things like lookahead or look behind? > Would several steps get you the result way more clearly? > > > > For the sake of argument, you either want what reading in a CSV file > would supply, or something else. Since you are not simply splitting on > commas, it sounds like something else. But what exactly else? Something as > simple as this on just a comma produces results including empty strings and > embedded leading or trailing spaces: > > > > strsplit("a bc,def, adef ,,gh", ",") > > [[1]] > > [1] "a bc" "def" " adef " "" "gh" > > > > That can of course be handled by, for example, trimming the result after > unlisting the odd way strsplit returns results: > > > > library("stringr") > > str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))) > > > > [1] "a bc" "def" "adef" "" "gh" > > > > Now do you want the empty string to be something else, such as an NA? > That can be done too with another step. > > > > And a completely different variant can be used to read in your one-line > CSV as text using standard overkill tools: > > > >> read.table(text="a bc,def, adef ,,gh", sep=",") > > V1 V2 V3 V4 V5 > > 1 a bc def adef NA gh > > > > The above is a vector of texts. But if you simply want to reassemble > your initial string cleaned up a bit, you can use paste to put back commas, > as in a variation of the earlier example: > > > >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))), > collapse=",") > > [1] "a bc,def,adef,,gh" > > > > So my question is whether using advanced methods is really necessary for > your case, or even particularly efficient. If efficiency matters, often, it > is better to use tools without regular expressions such as paste0() when > they meet your needs. > > > > Of course, unless I know what you are actually trying to do, my remarks > may be not useful. > > > > > > > > -----Original Message----- > > From: R-help <r-help-boun...@r-project.org> On Behalf Of Leonard Mada > via R-help > > Sent: Thursday, May 4, 2023 5:00 PM > > To: R-help Mailing List <r-help@r-project.org> > > Subject: [R] Regex Split? > > > > Dear R-Users, > > > > I tried the following 3 Regex expressions in R 4.3: > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > > > Is this correct? > > > > > > I feel that: > > - none should return (after "def"): ",", ""; > > - the first one could also return "", "," (but probably not; not fully > > sure about this); > > > > > > Sincerely, > > > > > > Leonard > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3 > > PLEASE do read the posting guide > https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.