I would use a variant of your original transformation. The issue (the error in the URL) is that all kinds of non-ASCII characters occur unencoded. We should/could assume that other special/reserved ASCII characters _are_ properly encoded (so we do not need to handle them).
So I would literally patch/fix the problem, like this: | bogusUrl fixedUrl url | bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'. fixedUrl := String streamContents: [ :out | bogusUrl do: [ :each | (each codePoint < 127 and: [ each ~= $ ]) ifTrue: [ out nextPut: each ] ifFalse: [ out nextPutAll: each asString urlEncoded ] ] ]. fixedUrl asUrl retrieveContents. I made and extra case for the space character, it works either way in the example given, but a space cannot occur freely. > On 26 Mar 2019, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Sean > > I have realized that the method I proposed can be expressed entirely within > the Zinc system, which may make it a bit neater and easier to follow. There > probably is no completely general solution, but there is a completely general > way of finding a solution for your problem domain. > > It is important to realize that String>>urlEncoded is defined as: > ZnPercentEncoder new encode: self. > ZnPercentEncoder does not attempt to parse the input string as a url. It > scans the entire string, and percent encodes any character that is not in its > safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given as > default a minimum safe set, which does not include slash, but there is a > setter method to redefine the safe set. > > So the general way to find a solution for your domain is to collect a > representative set of the url strings, apply String>>urlEncoded to each, and > work out which characters have been percent encoded wrongly for your domain. > For any url cases this is likely to include ':/?#', as well as '%' if it > includes things already percent encoded, but there may be others specific to > your domain. Now construct an instance of ZnPercentEncoder with the safe set > extended to include these characters - note that the default safe set is > given by the class method ZnPercentEncoder class>> > rfc3986UnreservedCharacters. Apply this instance to encode all your test > incoming url strings and verify that they work. Iterate, extending the safe > set, until everything passes. > > If you want to keep the neatness being able to write something like > 'incomingString urlEncoded asZnUrl', you can add a method to String; for the > case of the common url characters mentioned above: > > String>> urlEncodedMyWay > > "As urlEncoded, but with the safe set extended to include characters commonly > found in a url" > > ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder > rfc3986UnreservedCharacters); > encode: self > > This works in much the same way as the snippet I posted originally, because > my code simply reproduces the essentials of ZnPercentEncoder>>encode:. > > I seem to be trying to monopolize this thread, so I shall shut up now. > > HTH > > Peter Kenny > > -----Original Message----- > From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of > PBKResearch > Sent: 24 March 2019 15:36 > To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics > > Well it didn't take long to find a potential problem in what I wrote, at > least as a general solution. If the input string contains something which has > already been percent encoded, it will re-encode the percent signs. In this > case, decoding will recover the once-encoded version, but we need to decode > twice to recover the original text. Any web site receiving this version will > almost certainly decode once only, and so will not see the right details. > > The solution is simple - just include the percent sign in the list of > excluded characters in the third line, so it becomes: > url asString do: [ :ch|(':/?%' includes: ch ) > > -----Original Message----- > From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of > PBKResearch > Sent: 24 March 2019 12:11 > To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics > > Sean, Sven > > Thinking about this, I have found a simple (maybe too simple) way round it. > The obvious first approach is to apply 'urlEncoded' to the received url > string, but this fails because it also encodes the slashes and other segment > dividers. A simple-minded approach is to scan the received string, copy the > slashes and other segment dividers unchanged and percent encode everything > else. I cobbled together the following in a playground, but it could easily > be turned into a method in String class. > > urlEncodedSegments := [ :url||outStream| > outStream := String new writeStream. > url asString do: [ :ch|(':/?' includes: ch ) > ifTrue: [ outStream nextPut: ch ] > ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. > outStream contents]. > > urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie' > => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie > > This may fail if a slash can occur in a url other than as a segment divider. > I am not sure if this is possible - could there be some sort of escaped slash > within a segment? Anyway, if the received url strings are well-behaved, apart > from the diacritics, this approach could be used as a hack for Sean's problem. > > HTH > > Peter Kenny > > Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding > where characters that are illegal in a URL are escaped.' Slashes are escaped > but are quite legal. Should the comment be changed, or the method? > > > > -----Original Message----- > From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van > Caekenberghe > Sent: 23 March 2019 20:03 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics > > > >> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote: >> >> Peter Kenny wrote >>> And when I inspect the result, it is the address of a non-existent >>> file in my image directory. >> >> Ah, no. I see the same result. By "worked" I meant that it created a >> URL that safari accepted, but I see now it's not the same as correctly >> parsing it. >> >> >> Peter Kenny wrote >>> Incidentally, I tried the other trick Sven cites in the same thread. >>> The same url as above can be written: >>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'. >> >> Yes, this works if you are assembling the URL, but several people >> presented the use case of processing URLs from elsewhere, leaving one >> in a chicken-and-egg situation where one can't parse due to the >> diacritics and can't escape the diacritics (i.e. without incorrectly >> escaping other things) without parsing :/ > > Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are > incorrect and can't be parsed. > > I do understand that sometimes these URLs occur in the wild, but again, > strictly speaking they are in error. > > The fact that browser search boxes accept them is a service on top of the > strict URL syntax, I am not 100% sure how they do it, but it probably > involves a lot of heuristics and trial and error. > > The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing > somebody from making a new ZnLoseUrlParser, but it won't be easy. > >> ----- >> Cheers, >> Sean >> -- >> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html >> > > > > >