Controlling the pointer is going to be very different from perl since the R functions are vectorized rather than focusing on a single string.
Here is one approach that will give all the matches and lengths (for the original problem at least): > mystr <- paste(rep("1122", 10), collapse="") > n <- nchar(mystr) > > mystr2 <- substr(rep(mystr,n), 1:n, n) > > tmp <- regexpr("^11221122", mystr2) > (tmp + 1:n - 1)[tmp>0] [1] 1 5 9 13 17 21 25 29 33 > attr(tmp,"match.length")[tmp>0] [1] 8 8 8 8 8 8 8 8 8 -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -----Original Message----- > From: Wacek Kusnierczyk [mailto:waclaw.marcin.kusnierc...@idi.ntnu.no] > Sent: Saturday, December 13, 2008 5:27 PM > To: Greg Snow > Cc: R help > Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391) > > Greg Snow wrote: > > Wacek, > > > > I am curious as to why Brian and I (and possibly other responders) > are held to a higher standard than the original poster. > > > (we have just had an offline communication, it should be fine to keep > it > that way) > > > My first question was a real question. There are 2 main ways to do > regular expression matching (possibly others as well), you describe one > method with the pointer moving through the string to be matched, the > other way moves the pointer through the pattern looking for all > possible matches in the string that match so far (one method is DFA the > other NFA, but I don't remember which is which and my copy of Friedl is > at work and I'm not). Perl and PCRE use the method that you describe, > but the other method may be more efficient for finding all overlapping > matches in some cases, but as far as I know (and there are plenty of > things I don't know, including all the changes/new programs since I > last read about this) the only programs that use the other type of > matching don't return the actual matches, just a yes/no on at least one > match. So if the original poster has formed his opinion based on > another program that uses that type of matching, I would be interested > to know about it. It would teach me something and also make it easier > for all of us to better help the original poster in the future if we > understand where he is coming from. > > that's right, there are the dfa and the nfa approaches (and more), and > the latter is the regex-driven approach of perl and many other > implementations. they work differently, and also differ in what you > can > do with them. > > i was talking about updating the string pointer to the position where > the next match in a global (iterative) matching process can start. > while the nfa approach moves, in general, back and forth through the > string, and the dfa approach steps through the string linearly keeping > a > list of possible matches, the end result is the same: if there is a > match, the next match will not start earlier than after the end of the > current match. what a dfa and an nfa engine will actually match given > a > text and a pattern may differ: > > perl -e 'print "ab" =~ /a|ab/g' > # a > > echo ab | egrep -o 'a|ab' > # ab > > but none of them will report overlapping matches: > > perl -e 'print "aaaaa" =~ /aa/g > # 2 matches > > echo aaaaa | egrep -o aa > # 2 matches > > (afaik, egrep uses a posix-compliant dfa engine) > > to achieve the effect of overlapping matches, you either need to > manually move the pointer while the global match proceeds, or > sequentially perform single matches on successive substrings of the > input string (which can give you the same match more than once, > though). it appears that my earlier suggestion was flawed, the > following is a bit cleaner: > > $string = # some string > $pattern = # some pattern > > @matches = (); > while ($string =~ /$pattern/g) { push @matches, [$-[0], $&]; > pos($string) -= (length($&) - 1) } > > after each successful match, it moves to the position right after the > start of the successful match. > > > the following will capture all possible matches for all alternatives in > the regex (or so it seems): > > $string =~/(?:$pattern)(??{ push @matches, [$-[0], $&] })/g > > so that "aabb" =~ /(?:a|abb?)(??{ push @matches, [$-[0], $&] })/g will > give 4 matches. > > again, not sure if this can be done within r. > > vQ > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.