Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sven Van Caekenberghe Tue, 26 Mar 2019 08:09:04 -0700

Peter,

It *is* a bogus URL, please go and read some RFCs.


A browser's address/search box is an entirely different thing that adds 
convenience features, such as the issue we are discussing here.

Sven

> On 26 Mar 2019, at 16:02, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sven
> 
> That would certainly work, and represents the most liberal possible approach. 
> An equivalent, keeping entirely within Zinc, would be to use a 
> special-purpose instance of ZnPercentEncoder, in which the safe set is 
> defined as all characters between code points 33 and 126 inclusive. (Starting 
> at 33 fixes your space point.)
> 
> Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up 
> French and German words in Wiktionary all the time, and I am building a Pharo 
> app to do it for me. The version of the url with the accented characters will 
> not work in Zinc until I have urlEncoded it, but it works perfectly well in a 
> browser and is much easier to read.
> 
> Peter Kenny
> 
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
> Caekenberghe
> Sent: 26 March 2019 12:26
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> I would use a variant of your original transformation.
> 
> The issue (the error in the URL) is that all kinds of non-ASCII characters 
> occur unencoded. We should/could assume that other special/reserved ASCII 
> characters _are_ properly encoded (so we do not need to handle them).
> 
> So I would literally patch/fix the problem, like this:
> 
> | bogusUrl fixedUrl url |
> bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
> fixedUrl := String streamContents: [ :out |
>       bogusUrl do: [ :each |
>               (each codePoint < 127 and: [ each ~= $ ])
>                       ifTrue: [ out nextPut: each ]
>                       ifFalse: [ out nextPutAll: each asString urlEncoded ] ] 
> ].
> fixedUrl asUrl retrieveContents.
> 
> I made and extra case for the space character, it works either way in the 
> example given, but a space cannot occur freely.
> 
>> On 26 Mar 2019, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>> 
>> Sean
>> 
>> I have realized that the method I proposed can be expressed entirely within 
>> the Zinc system, which may make it a bit neater and easier to follow. There 
>> probably is no completely general solution, but there is a completely 
>> general way of finding a solution for your problem domain.
>> 
>> It is important to realize that String>>urlEncoded is defined as:
>>      ZnPercentEncoder new encode: self.
>> ZnPercentEncoder does not attempt to parse the input string as a url. It 
>> scans the entire string, and percent encodes any character that is not in 
>> its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given 
>> as default a minimum safe set, which does not include slash, but there is a 
>> setter method to redefine the safe set.
>> 
>> So the general way to find a solution for your domain is to collect a 
>> representative set of the url strings, apply String>>urlEncoded to each, and 
>> work out which characters have been percent encoded wrongly for your domain. 
>> For any url cases this is likely to include ':/?#', as well as '%' if it 
>> includes things already percent encoded, but there may be others specific to 
>> your domain. Now construct an instance of ZnPercentEncoder with the safe set 
>> extended to include these characters - note that the default safe set is 
>> given by the class method ZnPercentEncoder class>> 
>> rfc3986UnreservedCharacters. Apply this instance to encode all your test 
>> incoming url strings and verify that they work. Iterate, extending the safe 
>> set, until everything passes.
>> 
>> If you want to keep the neatness being able to write something like 
>> 'incomingString urlEncoded asZnUrl', you can add a method to String; for the 
>> case of the common url characters mentioned above:
>> 
>> String>> urlEncodedMyWay
>> 
>> "As urlEncoded, but with the safe set extended to include characters 
>> commonly found in a url"
>> 
>> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder 
>> rfc3986UnreservedCharacters);
>>      encode: self
>> 
>> This works in much the same way as the snippet I posted originally, because 
>> my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
>> 
>> I seem to be trying to monopolize this thread, so I shall shut up now.
>> 
>> HTH
>> 
>> Peter Kenny
>> 
>> -----Original Message-----
>> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
>> PBKResearch
>> Sent: 24 March 2019 15:36
>> To: 'Any question about pharo is welcome' 
>> <pharo-users@lists.pharo.org>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>> 
>> Well it didn't take long to find a potential problem in what I wrote, at 
>> least as a general solution. If the input string contains something which 
>> has already been percent encoded, it will re-encode the percent signs. In 
>> this case, decoding will recover the once-encoded version, but we need to 
>> decode twice to recover the original text. Any web site receiving this 
>> version will almost certainly decode once only, and so will not see the 
>> right details.
>> 
>> The solution is simple - just include the percent sign in the list of 
>> excluded characters in the third line, so it becomes:
>>      url asString do: [ :ch|(':/?%' includes: ch )
>> 
>> -----Original Message-----
>> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
>> PBKResearch
>> Sent: 24 March 2019 12:11
>> To: 'Any question about pharo is welcome' 
>> <pharo-users@lists.pharo.org>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>> 
>> Sean, Sven
>> 
>> Thinking about this, I have found a simple (maybe too simple) way round it. 
>> The obvious first approach is to apply 'urlEncoded' to the received url 
>> string, but this fails because it also encodes the slashes and other segment 
>> dividers. A simple-minded approach is to scan the received string, copy the 
>> slashes and other segment dividers unchanged and percent encode everything 
>> else. I cobbled together the following in a playground, but it could easily 
>> be turned into a method in String class.
>> 
>> urlEncodedSegments := [ :url||outStream|
>>      outStream := String new writeStream.
>>      url asString do: [ :ch|(':/?' includes: ch )
>>              ifTrue: [ outStream nextPut: ch ]
>>              ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. 
>>      outStream contents].
>> 
>> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
>> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
>> 
>> This may fail if a slash can occur in a url other than as a segment divider. 
>> I am not sure if this is possible - could there be some sort of escaped 
>> slash within a segment? Anyway, if the received url strings are 
>> well-behaved, apart from the diacritics, this approach could be used as a 
>> hack for Sean's problem.
>> 
>> HTH
>> 
>> Peter Kenny
>> 
>> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding 
>> where characters that are illegal in a URL are escaped.' Slashes are escaped 
>> but are quite legal. Should the comment be changed, or the method?
>> 
>> 
>> 
>> -----Original Message-----
>> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
>> Sven Van Caekenberghe
>> Sent: 23 March 2019 20:03
>> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
>> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
>> 
>> 
>> 
>>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote:
>>> 
>>> Peter Kenny wrote
>>>> And when I inspect the result, it is the address of a non-existent 
>>>> file in my image directory.
>>> 
>>> Ah, no. I see the same result. By "worked" I meant that it created a 
>>> URL that safari accepted, but I see now it's not the same as 
>>> correctly parsing it.
>>> 
>>> 
>>> Peter Kenny wrote
>>>> Incidentally, I tried the other trick Sven cites in the same thread. 
>>>> The same url as above can be written:
>>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>>> 
>>> Yes, this works if you are assembling the URL, but several people 
>>> presented the use case of processing URLs from elsewhere, leaving one 
>>> in a chicken-and-egg situation where one can't parse due to the 
>>> diacritics and can't escape the diacritics (i.e. without incorrectly 
>>> escaping other things) without parsing :/
>> 
>> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are 
>> incorrect and can't be parsed.
>> 
>> I do understand that sometimes these URLs occur in the wild, but again, 
>> strictly speaking they are in error.
>> 
>> The fact that browser search boxes accept them is a service on top of the 
>> strict URL syntax, I am not 100% sure how they do it, but it probably 
>> involves a lot of heuristics and trial and error.
>> 
>> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing 
>> somebody from making a new ZnLoseUrlParser, but it won't be easy.
>> 
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: [Pharo-users] ZnURL and parsing URL with diacritics

Reply via email to