Re: [Pharo-users] ZnURL and parsing URL with diacritics

Sven Van Caekenberghe Mon, 10 Sep 2018 04:27:02 -0700

Hi,

> On 10 Sep 2018, at 12:53, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Hi Petr
>  
> I have used #urlEncoded in the past, with success, to deal with German 
> umlauts. The secret is to urlEncode just the part containing the diacritics. 
> If you encode the whole url, the slashes are encoded, and this confuses Zinc, 
> which segments the url before decoding.
>  
> So I would expect you to be able to read your file with:
>  
> ZnEasy get: 'http://domain.com/’,’ěščýž.html' urlEncoded.
>  
> However, this also fails with ‘ASCII character expected’, and I can’t 
> understand why. The debug trace has too many levels for me to understand. 
> Zinc is evidently getting in a mess trying to decode the urlEncoded string, 
> but if we try:
>  
> ’ěščýž.html' urlEncoded urlDecoded
>  
> as a separate operation, it works OK.
>  
> I think only Sven can explain this for you.


The external representation of a URL with special characters is not the same as 
what an address bar or browser search field accepts. The latter is quite 
intelligent and accepts much broader input.

ZnUrl parses the official external representation according to the spec.

Internally, ZnUrl represents all components as resolved strings. The solution 
is to construct difficult/special URLs by hand.

Here is an example: let's say we want to access the English Wikipedia page of 
the Czech Republic (the country) using its native name 'Česká republika' (which 
is not only non-ASCII, but non-Latin1 as well, so it needs a WideString and 
UTF-8 encoding).

Here is one way to construct such a string.

ZnUrl new 
  scheme: #http; 
  host: 'en.wikipedia.org'; 
  addPathSegment: 'wiki'; 
  addPathSegment: 'Česká republika';
  yourself.

Which gives a URL with the following external representation:

  http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika

This can be parsed without problems.

  'http://en.wikipedia.org/wiki/%C4%8Cesk%C3%A1%20republika' asUrl.

You can send #retrieveContents to a URL to actually fetch it.

ZnUrl new 
  scheme: #http; 
  host: 'en.wikipedia.org'; 
  addPathSegment: 'wiki'; 
  addPathSegment: 'Česká republika'; 
  retrieveContents.

Or you could use the url in a ZnClient object.

BTW, there are many ways to construct URLs, I would maybe do the following.

  'https://en.wikipedia.org/wiki' asUrl addPathSegment: 'Česká republika'; 
yourself.

Or something like

ZnClient new
  url: 'https://en.wikipedia.org/wiki';
  addPathSegment: 'Česká republika';
  get.

HTH,

Sven

> HTH
>  
> Peter Kenny
>  
>  
> From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Petr 
> Fischer via Pharo-users
> Sent: 10 September 2018 10:07
> To: pharo-users@lists.pharo.org
> Cc: Petr Fischer <petr.fisc...@me.com>
> Subject: [Pharo-users] ZnURL and parsing URL with diacritics
>  
> Hello, 
>  
> when I try to parse this URL asUrl, error "ZnCharacterEncodingError: ASCII 
> character expected" occurs:
>  
> 'http://domain.com/ěščýž.html' asUrl.
>  
> this also does not work:
>  
> ZnEasy get: 'http://domain.com/ěščýž.html'
>  
> How to solve this? In the web browser, URL with diacritics is OK. 
>  
> I tried also this:
>  
> ZnEasy get: 'http://domain.com/ěščýž.html' urlEncoded.
>  
> but this cripples the whole URL.
>  
> Thanks! Petr Fischer

Re: [Pharo-users] ZnURL and parsing URL with diacritics

Reply via email to