Sven Well RFCs are unreadable - I know, because I looked at 3986 while looking at this question - but OK, I get your point. I suppose I should be looking for something that makes it easier to provide similar convenience features in Pharo. As you say, if this issue is cracked, that is a step on the way.
Peter -----Original Message----- From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Sven Van Caekenberghe Sent: 26 March 2019 15:08 To: Any question about pharo is welcome <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics Peter, It *is* a bogus URL, please go and read some RFCs. A browser's address/search box is an entirely different thing that adds convenience features, such as the issue we are discussing here. Sven > On 26 Mar 2019, at 16:02, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Sven > > That would certainly work, and represents the most liberal possible > approach. An equivalent, keeping entirely within Zinc, would be to use > a special-purpose instance of ZnPercentEncoder, in which the safe set > is defined as all characters between code points 33 and 126 inclusive. > (Starting at 33 fixes your space point.) > > Using 'bogusUrl' as a variable name seems a bit pejorative. I am looking up > French and German words in Wiktionary all the time, and I am building a Pharo > app to do it for me. The version of the url with the accented characters will > not work in Zinc until I have urlEncoded it, but it works perfectly well in a > browser and is much easier to read. > > Peter Kenny > > > -----Original Message----- > From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of > Sven Van Caekenberghe > Sent: 26 March 2019 12:26 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics > > I would use a variant of your original transformation. > > The issue (the error in the URL) is that all kinds of non-ASCII characters > occur unencoded. We should/could assume that other special/reserved ASCII > characters _are_ properly encoded (so we do not need to handle them). > > So I would literally patch/fix the problem, like this: > > | bogusUrl fixedUrl url | > bogusUrl := 'https://en.wikipedia.org/wiki/Česká republika'. > fixedUrl := String streamContents: [ :out | > bogusUrl do: [ :each | > (each codePoint < 127 and: [ each ~= $ ]) > ifTrue: [ out nextPut: each ] > ifFalse: [ out nextPutAll: each asString urlEncoded ] ] > ]. > fixedUrl asUrl retrieveContents. > > I made and extra case for the space character, it works either way in the > example given, but a space cannot occur freely. > >> On 26 Mar 2019, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote: >> >> Sean >> >> I have realized that the method I proposed can be expressed entirely within >> the Zinc system, which may make it a bit neater and easier to follow. There >> probably is no completely general solution, but there is a completely >> general way of finding a solution for your problem domain. >> >> It is important to realize that String>>urlEncoded is defined as: >> ZnPercentEncoder new encode: self. >> ZnPercentEncoder does not attempt to parse the input string as a url. It >> scans the entire string, and percent encodes any character that is not in >> its safe set (see the comment to ZnPercentEncoder>>encode:). Sven has given >> as default a minimum safe set, which does not include slash, but there is a >> setter method to redefine the safe set. >> >> So the general way to find a solution for your domain is to collect a >> representative set of the url strings, apply String>>urlEncoded to each, and >> work out which characters have been percent encoded wrongly for your domain. >> For any url cases this is likely to include ':/?#', as well as '%' if it >> includes things already percent encoded, but there may be others specific to >> your domain. Now construct an instance of ZnPercentEncoder with the safe set >> extended to include these characters - note that the default safe set is >> given by the class method ZnPercentEncoder class>> >> rfc3986UnreservedCharacters. Apply this instance to encode all your test >> incoming url strings and verify that they work. Iterate, extending the safe >> set, until everything passes. >> >> If you want to keep the neatness being able to write something like >> 'incomingString urlEncoded asZnUrl', you can add a method to String; for the >> case of the common url characters mentioned above: >> >> String>> urlEncodedMyWay >> >> "As urlEncoded, but with the safe set extended to include characters >> commonly found in a url" >> >> ^ ZnPercentEncoder new safeSet: ':/?#%', (ZnPercentEncoder >> rfc3986UnreservedCharacters); >> encode: self >> >> This works in much the same way as the snippet I posted originally, because >> my code simply reproduces the essentials of ZnPercentEncoder>>encode:. >> >> I seem to be trying to monopolize this thread, so I shall shut up now. >> >> HTH >> >> Peter Kenny >> >> -----Original Message----- >> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of >> PBKResearch >> Sent: 24 March 2019 15:36 >> To: 'Any question about pharo is welcome' >> <pharo-users@lists.pharo.org> >> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics >> >> Well it didn't take long to find a potential problem in what I wrote, at >> least as a general solution. If the input string contains something which >> has already been percent encoded, it will re-encode the percent signs. In >> this case, decoding will recover the once-encoded version, but we need to >> decode twice to recover the original text. Any web site receiving this >> version will almost certainly decode once only, and so will not see the >> right details. >> >> The solution is simple - just include the percent sign in the list of >> excluded characters in the third line, so it becomes: >> url asString do: [ :ch|(':/?%' includes: ch ) >> >> -----Original Message----- >> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of >> PBKResearch >> Sent: 24 March 2019 12:11 >> To: 'Any question about pharo is welcome' >> <pharo-users@lists.pharo.org> >> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics >> >> Sean, Sven >> >> Thinking about this, I have found a simple (maybe too simple) way round it. >> The obvious first approach is to apply 'urlEncoded' to the received url >> string, but this fails because it also encodes the slashes and other segment >> dividers. A simple-minded approach is to scan the received string, copy the >> slashes and other segment dividers unchanged and percent encode everything >> else. I cobbled together the following in a playground, but it could easily >> be turned into a method in String class. >> >> urlEncodedSegments := [ :url||outStream| >> outStream := String new writeStream. >> url asString do: [ :ch|(':/?' includes: ch ) >> ifTrue: [ outStream nextPut: ch ] >> ifFalse:[outStream nextPutAll: ch asString urlEncoded ] ]. >> outStream contents]. >> >> urlEncodedSegments value: 'https://fr.wiktionary.org/wiki/péripétie' >> => https://fr.wiktionary.org/wiki/p%C3%A9rip%C3%A9tie >> >> This may fail if a slash can occur in a url other than as a segment divider. >> I am not sure if this is possible - could there be some sort of escaped >> slash within a segment? Anyway, if the received url strings are >> well-behaved, apart from the diacritics, this approach could be used as a >> hack for Sean's problem. >> >> HTH >> >> Peter Kenny >> >> Note to Sven: The comment to String>>urlEncoded says: ' This is an encoding >> where characters that are illegal in a URL are escaped.' Slashes are escaped >> but are quite legal. Should the comment be changed, or the method? >> >> >> >> -----Original Message----- >> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of >> Sven Van Caekenberghe >> Sent: 23 March 2019 20:03 >> To: Any question about pharo is welcome <pharo-users@lists.pharo.org> >> Subject: Re: [Pharo-users] ZnURL and parsing URL with diacritics >> >> >> >>> On 23 Mar 2019, at 20:53, Sean P. DeNigris <s...@clipperadams.com> wrote: >>> >>> Peter Kenny wrote >>>> And when I inspect the result, it is the address of a non-existent >>>> file in my image directory. >>> >>> Ah, no. I see the same result. By "worked" I meant that it created a >>> URL that safari accepted, but I see now it's not the same as >>> correctly parsing it. >>> >>> >>> Peter Kenny wrote >>>> Incidentally, I tried the other trick Sven cites in the same thread. >>>> The same url as above can be written: >>>> 'https://fr.wiktionary.org/wiki' asUrl / 'péripétie'. >>> >>> Yes, this works if you are assembling the URL, but several people >>> presented the use case of processing URLs from elsewhere, leaving >>> one in a chicken-and-egg situation where one can't parse due to the >>> diacritics and can't escape the diacritics (i.e. without incorrectly >>> escaping other things) without parsing :/ >> >> Yes, that is pretty close to a catch 22. Strictly speaking, such URLs are >> incorrect and can't be parsed. >> >> I do understand that sometimes these URLs occur in the wild, but again, >> strictly speaking they are in error. >> >> The fact that browser search boxes accept them is a service on top of the >> strict URL syntax, I am not 100% sure how they do it, but it probably >> involves a lot of heuristics and trial and error. >> >> The parser of ZnUrl is just 3 to 4 methods. There is nothing preventing >> somebody from making a new ZnLoseUrlParser, but it won't be easy. >> >>> ----- >>> Cheers, >>> Sean >>> -- >>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html >>> >> >> >> >> >> > > >