On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann <guent...@rudersport.de> wrote:
> On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote: >> Okay, was tinkering with the code below but the zero-width lookahead is >> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output >> is bogus (you can run this and see what I mean). >> >> What am I doing wrong? > > You are using an overly complex and fugly test case. ;) Seriously, a > stripped down test string does not require more than about 4 instances > of plain chars and HTML entities. Much easier on the eye. > > >> my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g; > > That RE is a single, straight-forward alternation with two alternatives. > > The first one translates to a single char in a given, specific range. > Basically, anything but the ampersand. The second alternative is an > ampersand, that is not followed by #xDDDD. > > The (?!pattern) is a zero-width negative look-ahead assertion. A zero > width means, it does not consume what it matches. Thus, the second > alternation ultimately will match a single ampersand only. The /g global > matching then continues where it left of after the last matching > attempt. In the case of that ampersand followed by #xDDDD, that still is > right after the ampersand. > > line: Thе R > matches: T,h,#,x,0,4,3,5,;, ,R Okay, so what I was trying to do is skip any ampersand followed by #xDDDD; as part of the matched text (but include ampersands not followed by #xDDDD; as part of the match). So that if I had the text: This that & those. The first @match would be counted as $chars: T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,. and the 2nd @match would be: e counting as $uchars. So in the first case, the e would be skipped over as part of the capture. What’s the opposite of a zero-width lookahead? I.e. a match that advances the cursor but doesn’t copy the matching text into the capture buffer? > > The offending ampersand part of the HTML entity encoding correctly is > not matched. The following chars do match the "anything but an > ampersand" first alternative. > > > I am unsure what you are trying to achieve. If you want to compare the > number of HTML entities with the number of regular chars, wouldn't it be > easier to simply drop them flat? > > $data =~ s/&#x[0-9A-F]{4};//g; > > Or just plain match and count? > > @matches = $data =~ /&#x[0-9A-F]{4};/g; > > > -- > char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; > main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: > (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}} >