Re: [Pharo-users] ZnURL and parsing URL with diacritics

PBKResearch Tue, 26 Mar 2019 08:03:14 -0700

Sven

That would certainly work, and represents the most liberal possible approach. 
An equivalent, keeping entirely within Zinc, would be to use a special-purpose 
instance of ZnPercentEncoder, in which the safe set is defined as all 
characters between code points 33 and 126 inclusive. (Starting at 33 fixes your 
space point.)


Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up 
French and German words in Wiktionary all the time, and I am building a Pharo 
app to do it for me. The version of the url with the accented characters will 
not work in Zinc until I have urlEncoded it, but it works perfectly well in a 
browser and is much easier to read.

Peter Kenny


-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
Caekenberghe
Sent: 26 March 2019 12:26
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics

I would use a variant of your original transformation.

The issue (the error in the URL) is that all kinds of non-ASCII characters 
occur unencoded. We should/could assume that other special/reserved ASCII 
characters _are_ properly encoded (so we do not need to handle them).

So I would literally patch/fix the problem, like this:

| bogusUrl fixedUrl url |
bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'.
fixedUrl := String streamContents: [ :out |
        bogusUrl do: [ :each |
                (each codePoint < 127 and: [ each ~= $ ])
                        ifTrue: [ out nextPut: each ]
                        ifFalse: [ out nextPutAll: each asString urlEncoded ] ] 
].
fixedUrl asUrl retrieveContents.

I made and extra case for the space character, it works either way in the 
example given, but a space cannot occur freely.

> On 26 Mar 2019, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sean
> 
> I have realized that the method I proposed can be expressed entirely within 
> the Zinc system, which may make it a bit neater and easier to follow. There 
> probably is no completely general solution, but there is a completely general 
> way of finding a solution for your problem domain.
> 
> It is important to realize that String>>urlEncoded is defined as:
>       ZnPercentEncoder new encode: self.
> ZnPercentEncoder does not attempt to parse the input string as a url. It 
> scans the entire string, and percent encodes any character that is not in its 
> safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as 
> default a minimum safe set, which does not include slash, but there is a 
> setter method to redefine the safe set.
> 
> So the general way to find a solution for your domain is to collect a 
> representative set of the url strings, apply String>>urlEncoded to each, and 
> work out which characters have been percent encoded wrongly for your domain. 
> For any url cases this is likely to include ':/?#', as well as '%' if it 
> includes things already percent encoded, but there may be others specific to 
> your domain. Now construct an instance of ZnPercentEncoder with the safe set 
> extended to include these characters - note that the default safe set is 
> given by the class method ZnPercentEncoder class>> 
> rfc3986UnreservedCharacters. Apply this instance to encode all your test 
> incoming url strings and verify that they work. Iterate, extending the safe 
> set, until everything passes.
> 
> If you want to keep the neatness being able to write something like 
> 'incomingString urlEncoded asZnUrl', you can add a method to String; for the 
> case of the common url characters mentioned above:
> 
> String>> urlEncodedMyWay
> 
> "As urlEncoded, but with the safe set extended to include characters commonly 
> found in a url"
> 
> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder 
> rfc3986UnreservedCharacters);
>       encode: self
> 
> This works in much the same way as the snippet I posted originally, because 
> my code simply reproduces the essentials of ZnPercentEncoder>>encode:.
> 
> I seem to be trying to monopolize this thread, so I shall shut up now.
> 
> HTH
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
> PBKResearch
> Sent: 24 March 2019 15:36
> To: 'Any question about pharo is welcome' 
> <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> Well it didn't take long to find a potential problem in what I wrote, at 
> least as a general solution. If the input string contains something which has 
> already been percent encoded, it will re-encode the percent signs. In this 
> case, decoding will recover the once-encoded version, but we need to decode 
> twice to recover the original text. Any web site receiving this version will 
> almost certainly decode once only, and so will not see the right details.
> 
> The solution is simple - just include the percent sign in the list of 
> excluded characters in the third line, so it becomes:
>       url asString do: [ :ch|(':/?%' includes: ch )
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
> PBKResearch
> Sent: 24 March 2019 12:11
> To: 'Any question about pharo is welcome' 
> <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> Sean, Sven
> 
> Thinking about this, I have found a simple (maybe too simple) way round it. 
> The obvious first approach is to apply 'urlEncoded' to the received url 
> string, but this fails because it also encodes the slashes and other segment 
> dividers. A simple-minded approach is to scan the received string, copy the 
> slashes and other segment dividers unchanged and percent encode everything 
> else. I cobbled together the following in a playground, but it could easily 
> be turned into a method in String class.
> 
> urlEncodedSegments := [ :url||outStream|
>       outStream := String new writeStream.
>       url asString do: [ :ch|(':/?' includes: ch )
>               ifTrue: [ outStream nextPut: ch ]
>               ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. 
>       outStream contents].
> 
> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie
> 
> This may fail if a slash can occur in a url other than as a segment divider. 
> I am not sure if this is possible - could there be some sort of escaped slash 
> within a segment? Anyway, if the received url strings are well-behaved, apart 
> from the diacritics, this approach could be used as a hack for Sean's problem.
> 
> HTH
> 
> Peter Kenny
> 
> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding 
> where characters that are illegal in a URL are escaped.' Slashes are escaped 
> but are quite legal. Should the comment be changed, or the method?
> 
> 
> 
> -----Original Message-----
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of 
> Sven Van Caekenberghe
> Sent: 23 March 2019 20:03
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics
> 
> 
> 
>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote:
>> 
>> Peter Kenny wrote
>>> And when I inspect the result, it is the address of a non-existent 
>>> file in my image directory.
>> 
>> Ah, no. I see the same result. By "worked" I meant that it created a 
>> URL that safari accepted, but I see now it's not the same as 
>> correctly parsing it.
>> 
>> 
>> Peter Kenny wrote
>>> Incidentally, I tried the other trick Sven cites in the same thread. 
>>> The same url as above can be written:
>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
>> 
>> Yes, this works if you are assembling the URL, but several people 
>> presented the use case of processing URLs from elsewhere, leaving one 
>> in a chicken-and-egg situation where one can't parse due to the 
>> diacritics and can't escape the diacritics (i.e. without incorrectly 
>> escaping other things) without parsing :/
> 
> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are 
> incorrect and can't be parsed.
> 
> I do understand that sometimes these URLs occur in the wild, but again, 
> strictly speaking they are in error.
> 
> The fact that browser search boxes accept them is a service on top of the 
> strict URL syntax, I am not 100% sure how they do it, but it probably 
> involves a lot of heuristics and trial and error.
> 
> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing 
> somebody from making a new ZnLoseUrlParser, but it won't be easy.
> 
>> -----
>> Cheers,
>> Sean
>> --
>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>> 
> 
> 
> 
> 
>

Re: [Pharo-users] ZnURL and parsing URL with diacritics

Reply via email to