Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sven Van Caekenberghe Tue, 26 Mar 2019 05:27:16 -0700

I would use a variant of your original transformation.

The issue (the error in the URL) is that all kinds of non-ASCII characters 
occur unencoded. We should/could assume that other special/reserved ASCII 
characters _are_ properly encoded (so we do not need to handle them).


So I would literally patch/fix the problem, like this:

| bogusUrl fixedUrl url |
bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
fixedUrl := String streamContents: [ :out |
        bogusUrl do: [ :each |
                (each codePoint < 127 and: [ each ~= $ ])
                        ifTrue: [ out nextPut: each ]
                        ifFalse: [ out nextPutAll: each asString urlEncoded ] ] 
].
fixedUrl asUrl retrieveContents.

I made and extra case for the space character, it works either way in the 
example given, but a space cannot occur freely.

> On 26 Mar 2019, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sean
> 
> I have realized that the method I proposed can be expressed entirely within 
> the Zinc system, which may make it a bit neater and easier to follow. There 
> probably is no completely general solution, but there is a completely general 
> way of finding a solution for your problem domain.
> 
> It is important to realize that String>>urlEncoded is defined as:
>       ZnPercentEncoder new encode: self.
> ZnPercentEncoder does not attempt to parse the input string as a url. It 
> scans the entire string, and percent encodes any character that is not in its 
> safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as 
> default a minimum safe set, which does not include slash, but there is a 
> setter method to redefine the safe set.
> 
> So the general way to find a solution for your domain is to collect a 
> representative set of the url strings, apply String>>urlEncoded to each, and 
> work out which characters have been percent encoded wrongly for your domain. 
> For any url cases this is likely to include ':/?#', as well as '%' if it 
> includes things already percent encoded, but there may be others specific to 
> your domain. Now construct an instance of ZnPercentEncoder with the safe set 
> extended to include these characters - note that the default safe set is 
> given by the class method ZnPercentEncoder class>> 
> rfc3986UnreservedCharacters. Apply this instance to encode all your test 
> incoming url strings and verify that they work. Iterate, extending the safe 
> set, until everything passes.
> 
> If you want to keep the neatness being able to write something like 
> 'incomingString urlEncoded asZnUrl', you can add a method to String; for the 
> case of the common url characters mentioned above:
> 
> String>> urlEncodedMyWay
> 
> "As urlEncoded, but with the safe set extended to include characters commonly 
> found in a url"
> 
> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder 
> rfc3986UnreservedCharacters);
>       encode: self
> 
> This works in much the same way as the snippet I posted originally, because 
> my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
> 
> I seem to be trying to monopolize this thread, so I shall shut up now.
> 
> HTH
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
> PBKResearch
> Sent: 24 March 2019 15:36
> To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> Well it didn't take long to find a potential problem in what I wrote, at 
> least as a general solution. If the input string contains something which has 
> already been percent encoded, it will re-encode the percent signs. In this 
> case, decoding will recover the once-encoded version, but we need to decode 
> twice to recover the original text. Any web site receiving this version will 
> almost certainly decode once only, and so will not see the right details.
> 
> The solution is simple - just include the percent sign in the list of 
> excluded characters in the third line, so it becomes:
>       url asString do: [ :ch|(':/?%' includes: ch )
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
> PBKResearch
> Sent: 24 March 2019 12:11
> To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> Sean, Sven
> 
> Thinking about this, I have found a simple (maybe too simple) way round it. 
> The obvious first approach is to apply 'urlEncoded' to the received url 
> string, but this fails because it also encodes the slashes and other segment 
> dividers. A simple-minded approach is to scan the received string, copy the 
> slashes and other segment dividers unchanged and percent encode everything 
> else. I cobbled together the following in a playground, but it could easily 
> be turned into a method in String class.
> 
> urlEncodedSegments := [ :url||outStream|
>       outStream := String new writeStream.
>       url asString do: [ :ch|(':/?' includes: ch )
>               ifTrue: [ outStream nextPut: ch ]
>               ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. 
>       outStream contents].
> 
> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
> 
> This may fail if a slash can occur in a url other than as a segment divider. 
> I am not sure if this is possible - could there be some sort of escaped slash 
> within a segment? Anyway, if the received url strings are well-behaved, apart 
> from the diacritics, this approach could be used as a hack for Sean's problem.
> 
> HTH
> 
> Peter Kenny
> 
> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding 
> where characters that are illegal in a URL are escaped.' Slashes are escaped 
> but are quite legal. Should the comment be changed, or the method?
> 
> 
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
> Caekenberghe
> Sent: 23 March 2019 20:03
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> 
> 
>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote:
>> 
>> Peter Kenny wrote
>>> And when I inspect the result, it is the address of a non-existent 
>>> file in my image directory.
>> 
>> Ah, no. I see the same result. By "worked" I meant that it created a 
>> URL that safari accepted, but I see now it's not the same as correctly 
>> parsing it.
>> 
>> 
>> Peter Kenny wrote
>>> Incidentally, I tried the other trick Sven cites in the same thread. 
>>> The same url as above can be written:
>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>> 
>> Yes, this works if you are assembling the URL, but several people 
>> presented the use case of processing URLs from elsewhere, leaving one 
>> in a chicken-and-egg situation where one can't parse due to the 
>> diacritics and can't escape the diacritics (i.e. without incorrectly 
>> escaping other things) without parsing :/
> 
> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are 
> incorrect and can't be parsed.
> 
> I do understand that sometimes these URLs occur in the wild, but again, 
> strictly speaking they are in error.
> 
> The fact that browser search boxes accept them is a service on top of the 
> strict URL syntax, I am not 100% sure how they do it, but it probably 
> involves a lot of heuristics and trial and error.
> 
> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing 
> somebody from making a new ZnLoseUrlParser, but it won't be easy.
> 
>> -----
>> Cheers,
>> Sean
>> --
>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>> 
> 
> 
> 
> 
>

Re: [Pharo-users] ZnURL and parsing URL with diacritics

Reply via email to