On Mon, 03 Nov 2008, Szak�ts Viktor wrote: Hi Viktor,
> I'm probably not able to see all the implications of binary > data, but to me, binary data is just a bunch of 0x00-0xFF bytes > and that's it (sorry for my simplistic and ignorant POV). > Usually you don't sort binary data, and you don't do UPPER() > on it, if someone does, it his problem. Anyhow, FRead()/FWrite() > are exactly to such I/O points where CP conversion may occur, > and we'd need some ways to control CPs at this point, even if > we'd only handle text. If we allow setting up a CP, one such > special CP could be 'raw data', meaning, there is no need to > touch/convert it at all, handle it as is. I'm not sure if this > fact needs to be attached to the strings to be used at later > times. We do not know how user used character items with FREAD()/FWRITE(). And this is only an example. There is much more similar functions. I was thinking about this problem and the only one solution I can see which will not break backward compatibility is introducing two types of string items with byte and unicode representation and functions to make conversions between both formats using some given CP or when not specified default HVM CP. Otherwise we will lost backward compatibility. It will have to be bound with C API which will hide internal HVM representation for C code. > That of course could also mean that special field type needs > to be reserved for such binary data in .dbf (or other) tables, > to help handling that problem. [ Reading "normal" text fields > will automatically converted between DB CP and internal Unicode, > while we need some convenient ways to prevent that for some > special - binary - data. ] DBF format already has one bit in each field definition stored in DBF header used to mark it as binary. Unfortunatelly Clipper does not fill correctly any fields which were not used by Clipper DBF* RDDs directly and they are random in DBF files created by Clipper. Anyhow we can resolve it by adding some additional signature in DBF header so it will work at least for DBFs created by Harbour (and VFP for which we already read extended field attributes because such DBFs uses different signature so we can easy recognize them). > And that instantly explains the reason why BLOB fields were > born. In Clipper this looked like superfluous stuff, and not > really necessary, since every field is "BLOB", even a simple > memo will do. But in case of such Unicode trickery, it's > starting to have a point. BLOB fields may imply also other things though usually it's binary MEMO field. > So, generically speaking all these I/O points needs to be > resolved, taking into account CP/binary-raw flagging. Just > like we do now with GTs. Not only IO. What about functions like CHR() or ASC()? Should they operate on Unicode values or CP indexes? What about [IL]2BIN()? Should it return Unicode string or byte string? What about LEN()? What it should return number of bytes or characters? Each time will change sth to unconditionally operate on Unicode I can create example of existing Clipper code which will be broken. > Even if we've agreed on the internals, of course we'd also > need to solve to keep everything Clipper compatible. And this is the most important problem. We are not creating new language where we can define everything from scratch. It means that we will have to introduce Unicode items as and option so users can use them if their existing .prg code is ready for such representation. > Yes, but that's attached to the RDD (== app) collation, which makes > it extremely dangerous to use unfortunately. Maybe we should > (or we may already do) have function to just do: > hb_strUpper/Lower( <string>, <CP> ) > hb_strIsAlpha( <char>, <CP> ) > [names tentative] We do not have such .prg functions but we can add them. They can be also written as .prg code, f.e.: func UpperCP( cVal, cCPID ) local cOldCP := hb_setCodepage( cCPID ), cResult cResult := Upper( cVal ) hb_setCodepage( cOldCP ) return cResult > We seem to have the C level functions for these, but not the > .prg AFAIK. Yes. > [ And maybe I don't even need them, I'll reexamine my code > regarding these issues. I definitely need proper conversions > though (852 -> Win, Win -> 852, Win/852 -> utf8), it would be > much nicer to use Harbour for these tasks. ] If you are interested only in Hungarian national characters then CODEPAGE in RDD will make such conversion for you. If you want to make conversions for all other characters then we do not have such functionality now. The UTF8 translations introduce yet another problems for fixed size fields: variable string length. So far non of our RDDs support it. I wanted to add such translations but I will have to introduce meta Harbour CP called UTF8 which can be used as CODEPAGE parameter so RDD can recognize it and use it only for translation without touching collation. In general we have serious problem with naming convention. Now we are using two letters country ID and then our own identifier for CP. Probably we should unify it and use sth like: <lang_ID>_<country_ID>.<encoding> - it's standard way used by locale. It will help to eliminate repeated MSG* modules with different encoding. We will need only one and all others can be created dynamically by simple online translation. So lang modules will use <lang_ID>_<country_ID> as identifier and internal field <encoding> used for online translations. The same we can introduce to i18n modules. > BTW, why isn't hb_setcodepage() a simple Set()? Shouldn't we > convert it to one? IMO we should. What do you think? I've already added _SET_CODEPAGE with MT modifications: 2008-09-13 18:49 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl) [...] * harbour/include/set.ch * harbour/include/hbset.h + added _SET_CODEPAGE which works like _SET_LANGUAGE giving common interface You can use SET( _SET_CODEPAGE [, <cNewVal> ] ) and SET( _SET_LANGUAGE [, <cNewVal> ] ) instead of hb_setCodepage()/ hb_langSelect(). best regards, Przemek _______________________________________________ Harbour mailing list Harbour@harbour-project.org http://lists.harbour-project.org/mailman/listinfo/harbour