On Jul 2, 2014, at 5:16 PM, Karsten Bräckelmann <guent...@rudersport.de> wrote:

> On Wed, 2014-07-02 at 14:44 -0600, Philip Prindeville wrote:
>> Okay, was tinkering with the code below but the zero-width lookahead is
>> not disqualifying ampersand followed by #x[0-9A-F]{4}; so the output
>> is bogus (you can run this and see what I mean).
>> 
>> What am I doing wrong?
> 
> You are using an overly complex and fugly test case. ;)  Seriously, a
> stripped down test string does not require more than about 4 instances
> of plain chars and HTML entities. Much easier on the eye.
> 
> 
>>    my @matches = m/[\001-\045\047-\177]|&(?!#x[0-9A-F]{4};)/g;
> 
> That RE is a single, straight-forward alternation with two alternatives.
> 
> The first one translates to a single char in a given, specific range.
> Basically, anything but the ampersand. The second alternative is an
> ampersand, that is not followed by #xDDDD.
> 
> The (?!pattern) is a zero-width negative look-ahead assertion. A zero
> width means, it does not consume what it matches. Thus, the second
> alternation ultimately will match a single ampersand only. The /g global
> matching then continues where it left of after the last matching
> attempt. In the case of that ampersand followed by #xDDDD, that still is
> right after the ampersand.
> 
>  line: Th&#x0435; R
>  matches: T,h,#,x,0,4,3,5,;, ,R

Okay, so what I was trying to do is skip any ampersand followed by #xDDDD; as 
part of the matched text (but include ampersands not followed by #xDDDD; as 
part of the match).

So that if I had the text:

This that & thos&#x0065;.

The first @match would be counted as $chars:

T,h,i,s, ,t,h,a,t, ,&, ,t,h,o,s,.

and the 2nd @match would be:

&#x0065;

counting as $uchars.

So in the first case, the &#x0065; would be skipped over as part of the capture.

What’s the opposite of a zero-width lookahead?  I.e. a match that advances the 
cursor but doesn’t copy the matching text into the capture buffer?


> 
> The offending ampersand part of the HTML entity encoding correctly is
> not matched. The following chars do match the "anything but an
> ampersand" first alternative.
> 
> 
> I am unsure what you are trying to achieve. If you want to compare the
> number of HTML entities with the number of regular chars, wouldn't it be
> easier to simply drop them flat?
> 
>  $data =~ s/&#x[0-9A-F]{4};//g;
> 
> Or just plain match and count?
> 
>  @matches = $data =~ /&#x[0-9A-F]{4};/g;
> 
> 
> -- 
> char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
> 

Reply via email to