Re: [Pharo-users] ZnURL and parsing URL with diacritics

PBKResearch Tue, 26 Mar 2019 04:55:08 -0700

Sean

I have realized that the method I proposed can be expressed entirely within the 
Zinc system, which may make it a bit neater and easier to follow. There 
probably is no completely general solution, but there is a completely general 
way of finding a solution for your problem domain.

It is important to realize that String>>urlEncoded is defined as:
        ZnPercentEncoder new encode: self.
ZnPercentEncoder does not attempt to parse the input string as a url. It scans 
the entire string, and percent encodes any character that is not in its safe 
set (see the comment to ZnPercentEncoder>>encode:). Sven has given as default a 
minimum safe set, which does not include slash, but there is a setter method to 
redefine the safe set.

So the general way to find a solution for your domain is to collect a 
representative set of the url strings, apply String>>urlEncoded to each, and 
work out which characters have been percent encoded wrongly for your domain. 
For any url cases this is likely to include ':/?#', as well as '%' if it 
includes things already percent encoded, but there may be others specific to 
your domain. Now construct an instance of ZnPercentEncoder with the safe set 
extended to include these characters - note that the default safe set is given 
by the class method ZnPercentEncoder class>> rfc3986UnreservedCharacters. Apply 
this instance to encode all your test incoming url strings and verify that they 
work. Iterate, extending the safe set, until everything passes.

If you want to keep the neatness being able to write something like 
'incomingString urlEncoded asZnUrl', you can add a method to String; for the 
case of the common url characters mentioned above:

String>> urlEncodedMyWay

"As urlEncoded, but with the safe set extended to include characters commonly 
found in a url"

^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder 
rfc3986UnreservedCharacters);
        encode: self

This works in much the same way as the snippet I posted originally, because my 
code simply reproduces the essentials of ZnPercentEncoder>>encode:.

I seem to be trying to monopolize this thread, so I shall shut up now.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of PBKResearch
Sent: 24 March 2019 15:36
To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Well it didn't take long to find a potential problem in what I wrote, at least 
as a general solution. If the input string contains something which has already 
been percent encoded, it will re-encode the percent signs. In this case, 
decoding will recover the once-encoded version, but we need to decode twice to 
recover the original text. Any web site receiving this version will almost 
certainly decode once only, and so will not see the right details.

The solution is simple - just include the percent sign in the list of excluded 
characters in the third line, so it becomes:
        url asString do: [ :ch|(':/?%' includes: ch )

-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of PBKResearch
Sent: 24 March 2019 12:11
To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sean, Sven

Thinking about this, I have found a simple (maybe too simple) way round it. The 
obvious first approach is to apply 'urlEncoded' to the received url string, but 
this fails because it also encodes the slashes and other segment dividers. A 
simple-minded approach is to scan the received string, copy the slashes and 
other segment dividers unchanged and percent encode everything else. I cobbled 
together the following in a playground, but it could easily be turned into a 
method in String class.

urlEncodedSegments := [ :url||outStream|
        outStream := String new writeStream.
        url asString do: [ :ch|(':/?' includes: ch )
                ifTrue: [ outStream nextPut: ch ]
                ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. 
        outStream contents].

urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
=> https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie

This may fail if a slash can occur in a url other than as a segment divider. I 
am not sure if this is possible - could there be some sort of escaped slash 
within a segment? Anyway, if the received url strings are well-behaved, apart 
from the diacritics, this approach could be used as a hack for Sean's problem.

HTH

Peter Kenny

Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding 
where characters that are illegal in a URL are escaped.' Slashes are escaped 
but are quite legal. Should the comment be changed, or the method?

-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
Caekenberghe
Sent: 23 March 2019 20:03
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote:
> 
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent 
>> file in my image directory.
> 
> Ah, no. I see the same result. By "worked" I meant that it created a 
> URL that safari accepted, but I see now it's not the same as correctly 
> parsing it.
> 
> 
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread. 
>> The same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
> 
> Yes, this works if you are assembling the URL, but several people 
> presented the use case of processing URLs from elsewhere, leaving one 
> in a chicken-and-egg situation where one can't parse due to the 
> diacritics and can't escape the diacritics (i.e. without incorrectly 
> escaping other things) without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are 
incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, 
strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the 
strict URL syntax, I am not 100% sure how they do it, but it probably involves 
a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing 
somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>

Re: [Pharo-users] ZnURL and parsing URL with diacritics

Reply via email to