Re: [Harbour] codepage and RDD

Przemyslaw Czerpak Mon, 03 Nov 2008 15:27:06 -0800

On Mon, 03 Nov 2008, Szak�ts Viktor wrote:

Hi Viktor,


> I'm probably not able to see all the implications of binary
> data, but to me, binary data is just a bunch of 0x00-0xFF bytes
> and that's it (sorry for my simplistic and ignorant POV).
> Usually you don't sort binary data, and you don't do UPPER()
> on it, if someone does, it his problem. Anyhow, FRead()/FWrite()
> are exactly to such I/O points where CP conversion may occur,
> and we'd need some ways to control CPs at this point, even if
> we'd only handle text. If we allow setting up a CP, one such
> special CP could be 'raw data', meaning, there is no need to
> touch/convert it at all, handle it as is. I'm not sure if this
> fact needs to be attached to the strings to be used at later
> times.

We do not know how user used character items with FREAD()/FWRITE().
And this is only an example. There is much more similar functions.
I was thinking about this problem and the only one solution I can
see which will not break backward compatibility is introducing
two types of string items with byte and unicode representation
and functions to make conversions between both formats using
some given CP or when not specified default HVM CP. Otherwise
we will lost backward compatibility. It will have to be bound
with C API which will hide internal HVM representation for C code.

> That of course could also mean that special field type needs
> to be reserved for such binary data in .dbf (or other) tables,
> to help handling that problem. [ Reading "normal" text fields
> will automatically converted between DB CP and internal Unicode,
> while we need some convenient ways to prevent that for some
> special - binary - data. ]

DBF format already has one bit in each field definition stored
in DBF header used to mark it as binary. Unfortunatelly Clipper
does not fill correctly any fields which were not used by Clipper
DBF* RDDs directly and they are random in DBF files created by
Clipper. Anyhow we can resolve it by adding some additional
signature in DBF header so it will work at least for DBFs created
by Harbour (and VFP for which we already read extended field
attributes because such DBFs uses different signature so we
can easy recognize them).

> And that instantly explains the reason why BLOB fields were
> born. In Clipper this looked like superfluous stuff, and not
> really necessary, since every field is "BLOB", even a simple
> memo will do. But in case of such Unicode trickery, it's
> starting to have a point.

BLOB fields may imply also other things though usually it's
binary MEMO field.

> So, generically speaking all these I/O points needs to be
> resolved, taking into account CP/binary-raw flagging. Just
> like we do now with GTs.

Not only IO. What about functions like CHR() or ASC()?
Should they operate on Unicode values or CP indexes?
What about [IL]2BIN()? Should it return Unicode string or
byte string? What about LEN()? What it should return number
of bytes or characters?
Each time will change sth to unconditionally operate on Unicode
I can create example of existing Clipper code which will be broken.

> Even if we've agreed on the internals, of course we'd also
> need to solve to keep everything Clipper compatible.

And this is the most important problem. We are not creating
new language where we can define everything from scratch.
It means that we will have to introduce Unicode items as
and option so users can use them if their existing .prg code
is ready for such representation.

> Yes, but that's attached to the RDD (== app) collation, which makes
> it extremely dangerous to use unfortunately. Maybe we should
> (or we may already do) have function to just do:
> hb_strUpper/Lower( <string>, <CP> )
> hb_strIsAlpha( <char>, <CP> )
> [names tentative]

We do not have such .prg functions but we can add them.
They can be also written as .prg code, f.e.:

   func UpperCP( cVal, cCPID )
      local cOldCP := hb_setCodepage( cCPID ), cResult
      cResult := Upper( cVal )
      hb_setCodepage( cOldCP )
   return cResult

> We seem to have the C level functions for these, but not the
> .prg AFAIK.

Yes.

> [ And maybe I don't even need them, I'll reexamine my code
> regarding these issues. I definitely need proper conversions
> though (852 -> Win, Win -> 852, Win/852 -> utf8), it would be
> much nicer to use Harbour for these tasks. ]

If you are interested only in Hungarian national characters
then CODEPAGE in RDD will make such conversion for you. If you
want to make conversions for all other characters then we do
not have such functionality now. The UTF8 translations introduce
yet another problems for fixed size fields: variable string
length. So far non of our RDDs support it. I wanted to add
such translations but I will have to introduce meta Harbour CP
called UTF8 which can be used as CODEPAGE parameter so RDD can
recognize it and use it only for translation without touching
collation.
In general we have serious problem with naming convention.
Now we are using two letters country ID and then our own
identifier for CP. Probably we should unify it and use sth
like: <lang_ID>_<country_ID>.<encoding> - it's standard way
used by locale. It will help to eliminate repeated MSG* modules
with different encoding. We will need only one and all others
can be created dynamically by simple online translation. So
lang modules will use <lang_ID>_<country_ID> as identifier
and internal field <encoding> used for online translations.
The same we can introduce to i18n modules.

> BTW, why isn't hb_setcodepage() a simple Set()? Shouldn't we
> convert it to one? IMO we should. What do you think?

I've already added _SET_CODEPAGE with MT modifications:

   2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
   [...]
     * harbour/include/set.ch
     * harbour/include/hbset.h
       + added _SET_CODEPAGE which works like _SET_LANGUAGE giving common
         interface

You can use SET( _SET_CODEPAGE [, <cNewVal> ] ) and
SET( _SET_LANGUAGE [, <cNewVal> ] ) instead of hb_setCodepage()/
hb_langSelect().

best regards,
Przemek
_______________________________________________
Harbour mailing list
Harbour@harbour-project.org
http://lists.harbour-project.org/mailman/listinfo/harbour

Re: [Harbour] codepage and RDD

Reply via email to