Dear Bert, Thank you for the suggestion. Indeed, there are various solutions and workarounds. However, there is still a bug in strsplit.
2.) gsub I would try to avoid gsub on a Wikipedia-sized corpus: using strsplit directly should be far more efficient. 3.) Punctuation marks Abbreviations and "word1-word2" may be a problem: gsub("(?<ThePunct>[[:punct:]])", "\\1 ", "A.B.C.", perl=T) # "A. B. C. " I do not yet have an intuition if the spaces in "A. B. C. " would adversely affect the language model. But this goes off-topic. Sincerely, Leonard On 5/6/2023 1:35 AM, Bert Gunter wrote: > Primarily for my own amusement, here is a way to do what I think you > wanted without look-aheads/behinds > > strsplit(gsub("([[:punct:]])"," \\1 ","a bc,def, adef,x; ,,gh"), " +") > [[1]] > [1] "a" "bc" "," "def" "," "adef" "," "x" ";" > [10] "," "," "gh" > > I certainly would *not* claim that it is in any way superior to > anything that has already been suggested -- indeed, probably the > contrary. But it's simple (as am I). > > Cheers, > Bert > > On Fri, May 5, 2023 at 2:54 PM Leonard Mada via R-help > <r-help@r-project.org> wrote: > > Dear Avi, > > Punctuation marks are used in various NLP language models. Preserving > the "," is therefore useful in such scenarios and Regex are useful to > accomplish this (especially if you have sufficient experience with > such > expressions). > > I observed only an odd behaviour using strsplit: the example > string is > constructed; but it is always wise to test a Regex expression against > various scenarios. It is usually hard to predict what special > cases will > occur in a specific corpus. > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > stringi::stri_split("a bc,def, adef ,,gh", regex=" > |(?=,)|(?<=,)(?![ ])") > # "a" "bc" "," "def" "," "adef" "" "," "," "gh" > > stringi::stri_split("a bc,def, adef ,,gh", regex=" |(?<! > )(?=,)|(?<=,)(?![ ])") > # "a" "bc" "," "def" "," "adef" "," "," "gh" > > # Expected: > # "a" "bc" "," "def" "," "adef" "," "," "gh" > # see 2nd instance of stringi::stri_split > > > Sincerely, > > > Leonard > > > On 5/5/2023 11:20 PM, avi.e.gr...@gmail.com wrote: > > Leonard, > > > > It can be helpful to spell out your intent in English or some of > us have to go back to the documentation to remember what some of > the operators do. > > > > Your text being searched seems to be an example of items between > comas with an optional space after some commas and in one case, > nothing between commas. > > > > So what is your goal for the example, and in general? You > mention a bit unclearly at the end some of what you expect and I > think it would be clearer if you also showed exactly the output > you would want. > > > > I saw some other replies that addressed what you wanted and am > going to reply in another direction. > > > > Why do things the hard way using things like lookahead or look > behind? Would several steps get you the result way more clearly? > > > > For the sake of argument, you either want what reading in a CSV > file would supply, or something else. Since you are not simply > splitting on commas, it sounds like something else. But what > exactly else? Something as simple as this on just a comma produces > results including empty strings and embedded leading or trailing > spaces: > > > > strsplit("a bc,def, adef ,,gh", ",") > > [[1]] > > [1] "a bc" "def" " adef " "" "gh" > > > > That can of course be handled by, for example, trimming the > result after unlisting the odd way strsplit returns results: > > > > library("stringr") > > str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))) > > > > [1] "a bc" "def" "adef" "" "gh" > > > > Now do you want the empty string to be something else, such as > an NA? That can be done too with another step. > > > > And a completely different variant can be used to read in your > one-line CSV as text using standard overkill tools: > > > >> read.table(text="a bc,def, adef ,,gh", sep=",") > > V1 V2 V3 V4 V5 > > 1 a bc def adef NA gh > > > > The above is a vector of texts. But if you simply want to > reassemble your initial string cleaned up a bit, you can use paste > to put back commas, as in a variation of the earlier example: > > > >> paste(str_squish(unlist(strsplit("a bc,def, adef ,,gh", ","))), > collapse=",") > > [1] "a bc,def,adef,,gh" > > > > So my question is whether using advanced methods is really > necessary for your case, or even particularly efficient. If > efficiency matters, often, it is better to use tools without > regular expressions such as paste0() when they meet your needs. > > > > Of course, unless I know what you are actually trying to do, my > remarks may be not useful. > > > > > > > > -----Original Message----- > > From: R-help <r-help-boun...@r-project.org> On Behalf Of Leonard > Mada via R-help > > Sent: Thursday, May 4, 2023 5:00 PM > > To: R-help Mailing List <r-help@r-project.org> > > Subject: [R] Regex Split? > > > > Dear R-Users, > > > > I tried the following 3 Regex expressions in R 4.3: > > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", > perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])", > perl=T) > > # "a" "bc" "," "def" "," "" "adef" "," "," "gh" > > > > > > Is this correct? > > > > > > I feel that: > > - none should return (after "def"): ",", ""; > > - the first one could also return "", "," (but probably not; not > fully > > sure about this); > > > > > > Sincerely, > > > > > > Leonard > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://eu01.z.antigena.com/l/boS91wizs77ZHrpn6fDgE-TZu7JxUnjyNg_9mZDUsLWLylcL-dhQytfeUHheLHZnKJw-VwwfCd_W4XdAukyKenqYPFzSJmP5FrWmF_wepejCrBByUVa66jUF7wKGiA8LnqB49ZUVq-urjKs272Rl-mj-SE1q7--Xj1UXRol3 > > PLEASE do read the posting guide > > https://eu01.z.antigena.com/l/rUS82cEKjOa3tTqQ7yTAXLpuOWG1NttoMdEKDQkk3EZhrLW63rsvJ77vuFxoc44Nwo7BGuQyBzF3bNlYLccamhXBk0shpe_1ZhOeonqIbTm59I58PKOPwwqUt6gLF2fLg3OmstDk7ueraKARO4qpUToOguMdYKyE2_LZnBk7QR > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.