On 25 Sep 2014, at 8:55 , Alain Rastoul <alf.mmm....@gmail.com> wrote:
> Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit : >> >> On 25 Sep 2014, at 01:04, Alain Rastoul <alf.mmm....@gmail.com> wrote: >> >>> Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit : >>>> Alain, >>> >>>> The character encoding situation in Pharo is pretty good actually. The >>>> only problem is that there is some old school code left that encodes >>>> strings into strings, but today you can easily write much better and >>>> conceptually correct code. >>>> >>>> You could have a look at this draft chapter of the upcoming 'Enterprise >>>> Pharo' book that I am currently writing: >>>> >>>> http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/ >>>> >>>> Concerning file system paths, FilePathEncoder and FilePluginPrimitives >>>> already do the right thing. >>>> >>>> Now, your idea about using UTF-8 to represent internal Strings is >>>> something that has been discussed before and in many other languages as >>>> well. The short answer is that due to it being variable length, the >>>> inefficiency is (probably) just too high. Simple indexed access becomes a >>>> problem, let alone more complex string manipulations. I am not saying that >>>> it cannot be done, I think it is just not worth the trouble. The current >>>> solution in Pharo with ByteString and WideString is quite nice (check the >>>> chapter I mentioned before). >>>> >>>> Sven >>>> >>> Very interesting ! >>> It seems that most of what I was saying is already here :) >>> I was not saying that Pharo should use utf8 (I mentionned utf8 because it >>> is a standard, but I find the variable length encoding very weird), I was >>> rather talking of using WideString in UTF 16 or 32 and that's done. >>> I saw asWideString but didn't know about automatic convertion or codepoint >>> selector and internal wide string support. >>> Does it means that Pharo Greek users (for example) use WideString for >>> Strings without having to specify it or make explicit convertions (except >>> of course when dealing with bytes if they want to) ? >>> If yes, very good, job is almost done :) >>> (personnally I would also deprecate ByteString, and get rid of it, just my >>> opinion). >>> Thanks for the link, another good chapter . >>> >>> Regards, >>> >>> Alain >> >> ByteString is important because it is an optimalization of the most common >> case. > > I understand the point here, memory/data footprint, cpu cache and so on (not > talking of encoding/decoding). > I think that's why Microsoft choosed UTF16 (old UCS2) as a middle solution > because it covers most of character sets with 2 bytes. It used to be a middle solution, back when UCS2 could encode the entire defined Unicode set. Novadays it's just the worst of both worlds; you waste memory for most normal text, *and* you don't have constant time indexed code point access. The duality we have in Pharo is an attempt to achieve the *best* of both worlds, wasting little memory for the "normal" case (latin1), and maintain constant time indexed access in all cases. The ultimate solution for this approach would have a trio of string classes with slot sizes 8 - 16 - 32 expanding / contracting as needed, but we don't have classes with variable short slots. (currently, they're planned in new Cog, if I've understood Eliots new object format correctly) Cheers, Henry
signature.asc
Description: Message signed with OpenPGP using GPGMail