On Mon, Jun 06, 2016 at 08:34:40PM +0200, Sven Van Caekenberghe wrote:
> 
> > On 06 Jun 2016, at 17:22, Sabine Manaa <manaa.sab...@gmail.com> wrote:
> > 
> > why ByteArray?
> 
> http://www.unicode.org/faq/utf_bom.html
> 
> A Unicode transformation format (UTF) is an algorithmic mapping from every 
> Unicode code point (except surrogate code points) to a unique byte sequence.
> 
> https://en.wikipedia.org/wiki/UTF-8
> 
> UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code 
> space (1,114,112 code points minus 2,048 surrogate code points) using one to 
> four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode 
> Standard).
> 
> In Pharo
> 
> https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html
> 
> Of course, given a ByteArray, whose values are all between 0 and 255 by 
> definition, you can convert it to a ByteString. That String is not a correct 
> (Pharo) String anymore, it is like converting a PNG or JPEG to String, you 
> can do it, it is just wrong.
> 
> When talking to the outside world, be it over a network connection, or via 
> primitive calls, anything but pure ASCII strings need an encoding. This has 
> to be agreed upon by both parties. If the receiving party wants UTF-8 forced 
> into a (kind of) String, that is (still) possible.
> 
> Your initial solution seems to indicate that this is expected. This (ugly) 
> conversion should be done at an as low level as possible, IMHO.
> 

Hi Sven,

Thanks for this concise summary. I think perhaps what is conceptually
a problem in my OSProcess implementation is that I allow command arguments
to be given in the form of Strings, then pass the byte array contents of
those Squeak/Pharo Strings to a Unix shell or to an exec() system call.
This is convenient from my point of view, because strings are very easy
to use, but it does not account for the differences in mapping from a
String to a byte array. It is the byte array that is actually used in
the calls to the operating system such as:

  UnixOSProcessAccessor>>primForkExec: executableFile
        stdIn: inputFileHandle
        stdOut: outputFileHandle
        stdErr: errorFileHandle
        argBuf: argVec
        argOffsets: argOffsets
        envBuf: envVec
        envOffsets: envOffsets
        workingDir: pathString

At this point, the argVec is composed of "strings" in the C sense of the
word, which really means that it contains byte array data from the Strings.
And of course, if the string encodings in the Squeak/Pharo strings do not
happen to match the string encodings of the operating system, then indeed
the byte arrays do not match and we get a "file not found" kind of problem.

My hope is that Mariano's assessment is correct, and that we can treat
this as the right way to handle the encoding match issues: 

On Mon, Jun 06, 2016 at 01:59:21PM -0300, Mariano Martinez Peck wrote:
> Hi Dave, Sabine, Norbert et all,
>
> Few weeks (months?) ago I was also reviewing this topic of encoding a
> OS(Sub)Process. After surfing a bit the web, I found out the most simple
> and accurate answer/solution was indeed to set the correct locale and/or
> text encoding in the computer in question. Anyway...more answers below.

This certainly sounds like the Right Thing To Do if only it works :-)

Dave


Reply via email to