Re: [Harbour] codepage and RDD

Szakáts Viktor Mon, 03 Nov 2008 17:41:31 -0800

Hi Przemek,


[ Well, sorry in advance if I may look like contradicting
myself along the line below. But along my answering I got
to understand your idea better, and this gets reflected
in reactions. Also, I didn't answer in "descending" order
in all cases. But it's too late now to restart the whole
mail. Brainstorm style. ]

We do not know how user used character items with FREAD()/FWRITE().
And this is only an example. There is much more similar functions.
I was thinking about this problem and the only one solution I can
see which will not break backward compatibility is introducing
two types of string items with byte and unicode representation
and functions to make conversions between both formats using
some given CP or when not specified default HVM CP. Otherwise
we will lost backward compatibility. It will have to be bound
with C API which will hide internal HVM representation for C code.


Yes, this may be a solution, although it pushes IMO the
language to a less elegant style. It would be nice to
keep it simple and have only one kind of string type, if
there is any possibility of it. This way it would need
heavy code modifications to make it possible to actually
use Unicode. My feeling is that switching to Unicode cannot
be made fully two-way compatible anyway, this means that
programs will have to be modified in one way or another
to make use of it. These required modifications will most
probably make the given apps incompatible with Clipper
anyway. For this reason the question is rather how to
make the transition the less painful, more natural, and
the most transparent (easy to identify modifications,
with a clear "upgrade path").

I'd expect that f.e. CHR() to return - say - 14783 for
an accented char is okay, if the user has chosen to
enable Unicode, and it's easier to prepare the app for this
case, than changing all 'ValType( s ) == "C"' expressions
to 'ValType( s ) $ "CU"' (given that "U" is the type code
of a Unicode string). With two types the developer will
always have to keep in mind which string has which kind
from the first assignment. I'd think that a compiler
switch to enable Unicode for a given program is a nicer
way to deal with it, at least as far as the end result
goes. I know there are lots of issue, like macros, lib
code, mixed code, C legacy code. Probably the C API
will have to be duplicated in any case.

So, generically speaking all these I/O points needs to be
resolved, taking into account CP/binary-raw flagging. Just
like we do now with GTs.


Not only IO. What about functions like CHR() or ASC()?
Should they operate on Unicode values or CP indexes?
What about [IL]2BIN()? Should it return Unicode string or
byte string? What about LEN()? What it should return number
of bytes or characters?


Chars definitely. At .prg level IMO the programmer is
the happiest if he doesn't even have to know what a byte
is for such a basic string function. BYTEs don't have
too much meaning as it can be 'n' if the internal
representation is UTF8 and 'm' if it's UTF16, but this
details should be fully abstracted on the .prg level.
[I|L]2BIN() is an interesting example, it could return
8 bytes, 4 chars in the 0x0000-0x00FF range and explicitly
ask for some kind of conversion when written to disk.
Or mark the string as binary (internally) and let the system
handle the issue automatically if that's possible. It's
important to consider that Clipper/Harbour is mainly aimed
for normal business app development, and as such, 99% of
string operations will be _real string operations_. The rest
is such exceptions as *2BIN()/BIN2*(), and IMO it's important
to handle them both in some ways, but we should first and
foremost help the mainstream usage rather than being too
much carried away with these exceptions. So we may even
think about a special set of API dealing with non-Unicode
chars (= raw bytes), to allow using such low-level constructs
as *2BIN and BIN2*. It will be much easier to rewrite the
low-level parts than having to review whole app code. We
may even introduce such binary API in advance to help
transition in time. The API could even be implemented as
a stub for Clipper, so two-way code transition stays
possible. At this point I may say the same as you, different
string types in practice, but I'd rather say we'd need
a new binary type (and using it in appropriate functions)
plus just convert existing string type to Unicode as is.

As for CHR()/ASC() I think they should access/return
a number between 0-65535 vs. a Unicode char.

Each time will change sth to unconditionally operate on Unicode
I can create example of existing Clipper code which will be broken.


If these points are clearly identifiable, and/or are
rather exceptions than the rule in normal app code, I
think we could just as well live with them. Unicode is
a brand new concept (just like MT), and a one push of
a button transition while keeping full compatibility
is a technical utopia anyway.

We may also introduce compatibility level command line
switch or #pragma to carefully let old concepts go,
and/or turn on new ones like this one, if these old
concepts are in the way of useful core additions.

In any case we need to identify as much problem
(compatibility) points in advance as possible. Maybe
we'll better see the good direction, and whether there
is one generic solution to solve them, or we need to
decide one by one.

Even if we've agreed on the internals, of course we'd also
need to solve to keep everything Clipper compatible.


And this is the most important problem. We are not creating
new language where we can define everything from scratch.
It means that we will have to introduce Unicode items as
and option so users can use them if their existing .prg code
is ready for such representation.


Yes, or define new API in advance to lay a clear route to
prepare for transition and when Harbour is ready switch,
maybe a compiler switch will suffice to make the last
step for the developers.

Yes, but that's attached to the RDD (== app) collation, which makes
it extremely dangerous to use unfortunately. Maybe we should
(or we may already do) have function to just do:
hb_strUpper/Lower( <string>, <CP> )
hb_strIsAlpha( <char>, <CP> )
[names tentative]


We do not have such .prg functions but we can add them.
They can be also written as .prg code, f.e.:

  func UpperCP( cVal, cCPID )
     local cOldCP := hb_setCodepage( cCPID ), cResult
     cResult := Upper( cVal )
     hb_setCodepage( cOldCP )
  return cResult


Good idea.

We seem to have the C level functions for these, but not the
.prg AFAIK.


Yes.

[ And maybe I don't even need them, I'll reexamine my code
regarding these issues. I definitely need proper conversions
though (852 -> Win, Win -> 852, Win/852 -> utf8), it would be
much nicer to use Harbour for these tasks. ]


If you are interested only in Hungarian national characters
then CODEPAGE in RDD will make such conversion for you. If you
want to make conversions for all other characters then we do
not have such functionality now. The UTF8 translations introduce
yet another problems for fixed size fields: variable string
length. So far non of our RDDs support it. I wanted to add
such translations but I will have to introduce meta Harbour CP
called UTF8 which can be used as CODEPAGE parameter so RDD can
recognize it and use it only for translation without touching
collation.
In general we have serious problem with naming convention.
Now we are using two letters country ID and then our own
identifier for CP. Probably we should unify it and use sth
like: <lang_ID>_<country_ID>.<encoding> - it's standard way
used by locale. It will help to eliminate repeated MSG* modules
with different encoding. We will need only one and all others
can be created dynamically by simple online translation. So
lang modules will use <lang_ID>_<country_ID> as identifier
and internal field <encoding> used for online translations.
The same we can introduce to i18n modules.


I fully agree. Current system was largely the result
of 8.3 naming convention and keeping one unified ID
thorough the filename and internally. <encoding> should
also be standardized as we have quite same variations
above official standard names (WINM, "Clipper"). Also
"852" is officially "IBM852" (with some aliases, but
we should stick with some kind standard here, too,
this is IBM*, Windows-*, ISO-8859-* for most cases.).
Question how pedantic we want to be?

BTW I theoretically need not just HU CP and collation,
but I'll revise all these matters as things were already
very inflexible in Clipper times, so it looks it's
time to rethink the concept a bit and do a good balance
between features and manageability (and possibilities).

What do you think are the practical chance or possibility
to use an existing external lib - like iconv - to deal
with low-level CP issues? It's a quite ugly GNU monster,
I personally don't see too much.

BTW, why isn't hb_setcodepage() a simple Set()? Shouldn't we
convert it to one? IMO we should. What do you think?


I've already added _SET_CODEPAGE with MT modifications:

2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)

  [...]
    * harbour/include/set.ch
    * harbour/include/hbset.h

+ added _SET_CODEPAGE which works like _SET_LANGUAGE givingcommon

        interface

You can use SET( _SET_CODEPAGE [, <cNewVal> ] ) and
SET( _SET_LANGUAGE [, <cNewVal> ] ) instead of hb_setCodepage()/
hb_langSelect().


Thanks a lot. Banging open doors, I missed that in the midst of
all the other things.

Brgds,
Viktor

_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Re: [Harbour] codepage and RDD

Reply via email to