Sean, Sven

Thinking about this, I have found a simple (maybe too simple) way round it. The 
obvious first approach is to apply 'urlEncoded' to the received url string, but 
this fails because it also encodes the slashes and other segment dividers. A 
simple-minded approach is to scan the received string, copy the slashes and 
other segment dividers unchanged and percent encode everything else. I cobbled 
together the following in a playground, but it could easily be turned into a 
method in String class.

urlEncodedSegments := [ :url||outStream|
        outStream := String new writeStream.
        url asString do: [ :ch|(':/?' includes: ch )
                ifTrue: [ outStream nextPut: ch ]
                ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. 
        outStream contents].

urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie'
=> https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie

This may fail if a slash can occur in a url other than as a segment divider. I 
am not sure if this is possible - could there be some sort of escaped slash 
within a segment? Anyway, if the received url strings are well-behaved, apart 
from the diacritics, this approach could be used as a hack for Sean's problem.

HTH

Peter Kenny

Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding 
where characters that are illegal in a URL are escaped.' Slashes are escaped 
but are quite legal. Should the comment be changed, or the method?



-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van 
Caekenberghe
Sent: 23 March 2019 20:03
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics



> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote:
> 
> Peter Kenny wrote
>> And when I inspect the result, it is the address of a non-existent 
>> file in my image directory.
> 
> Ah, no. I see the same result. By "worked" I meant that it created a 
> URL that safari accepted, but I see now it's not the same as correctly 
> parsing it.
> 
> 
> Peter Kenny wrote
>> Incidentally, I tried the other trick Sven cites in the same thread. 
>> The same url as above can be written:
>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'.
> 
> Yes, this works if you are assembling the URL, but several people 
> presented the use case of processing URLs from elsewhere, leaving one 
> in a chicken-and-egg situation where one can't parse due to the 
> diacritics and can't escape the diacritics (i.e. without incorrectly 
> escaping other things) without parsing :/

Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are 
incorrect and can't be parsed.

I do understand that sometimes these URLs occur in the wild, but again, 
strictly speaking they are in error.

The fact that browser search boxes accept them is a service on top of the 
strict URL syntax, I am not 100% sure how they do it, but it probably involves 
a lot of heuristics and trial and error.

The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing 
somebody from making a new ZnLoseUrlParser, but it won't be easy.

> -----
> Cheers,
> Sean
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
> 



Reply via email to