[SOLVED] Re: [fpc-pascal] ord() of a string index returns wrong value.

Pew (piffle.the.cat) Sat, 29 Jan 2011 18:30:59 -0800

Hi Jonas,

[ and cross-posted to lazarus-list ]


On 01/30/2011 04:59 AM, Jonas Maebe wrote:


On 29 Jan 2011, at 18:34, Pew (piffle.the.cat) wrote:

On 01/30/2011 03:13 AM, Jonas Maebe wrote:


On 29 Jan 2011, at 17:05, Pew (piffle.the.cat) wrote:

I have a problem where ord() of a character (single string index) returns the 
wrong value. the character is a 'o' which is a 111 value but the ord of it 
returns 121 into an integer. What am I doing wrong?


Is that Lazarus code? If so, the string will be utf-8 encoded and you cannot 
assume that str[i] corresponds to the i'th character of the string. Even if 
it's not Lazarus code, it could still be utf-8 encoded depending on what the 
source of the string is and/or the locale settings of the system.


Yes, it is Lazarus code. Okay so I think that we have found the problem. Now 
how do I fix it?


If you want to access individual characters, it's probably the easiest to 
convert it to a unicodestring first:

var
   utxt: unicodestring;
begin
   ..
   utxt:=utf8decode(txt);
   { now perform all operations on utxt instead of on txt }
   ..
end;

I will use this code (listed above) in preference to the latter code, asyou state the latter code has a higher probability of data loss.


Note:
1) even in UTF-16 (which is the encoding of a unicodestring), a single character may take 
up more than one code point, so this is not 100% safe yet either. If you want a guarantee 
to string[i] corresponding 1 single "character", you
a) have to normalize the unicode string to remove decomposed characters, and 
then
b) convert it to an UTF-32 string. You can use this routine for the 
unicodestring to UTF-32 conversion: 
http://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html 
(note that UCS4String is a dynamic array, not a string type)

I don't know whether Lazarus contains platform-indepdendent wrappers for a). 
FPC itself at least doesn't at this time.

2) you will have to make sure that your "Rects_low" and "LastCharacterDefined" are 
defined in terms of UTF-16. Unless they are all plain ASCII characters (i.e., with an ordinal 
value<=127), using a simple range is unlikely to work correctly.


This is how Rect_low (=0) and LastCharacterDefined (byte) are declared.

unit uFixedWidthFonts;

{$MODE Delphi}

interface

uses
  graphics, ExtCtrls, SysUtils, Dialogs, SDL;

const
  Rects_low = 0;
  Rects_high = 255;

type
  PFixedFont = ^TFixedFont;
  TFixedFont = object
  private
    Image: TBitmap;
// [ALVAROGP] Rects redefined as array of class TSDL_Rect
{
    Rects: array[Rects_low..Rects_high] of PSDL_Rect;
}
    Rects: array[Rects_low..Rects_high] of TSDL_Rect;
  public
    LastCharacterDefined: byte;
    TransparentColor,
      TextColor,
      BackgroundColor: Tcolor;
    // this is used only when UseTransparentBackground is false
    Char_width,
      Char_height: byte; // default is 8x8
    ReverseVideo: Boolean; // default =false
    HorizontalGap: byte;
    // Horizontal Gap between characters in Pixels -- default =1
    UseTransparentBackground: Boolean;

    constructor Initialize;
    procedure LoadFont(const Fontfile: string);
    procedure FreeUpAll;
    destructor Finalize;
    procedure WriteText2(x, y: integer; Txt: string;
      var PaintBox1: TPaintBox);
  end;

var
  Font1: PFixedFont;



A simpler alternative, with fairly high chances of data loss, is something like 
this:

var
   mytxt: ansistring;
begin
   mytxt:=utf8decode(txt);
   { now perform all operations on mytxt instead of on txt }
   ...
end;

This will first decode the UTF-8 encoded string to an UTF-16 encoded unicodestring, and 
then convert this unicodestring to a plain ansistring. Data loss can happen in case the 
string contains characters that cannot be represented using the "ansi" (~ 
default) code page of the system the program is running on. Such non-representable 
characters will be replaced by '?'.

I could have used your above code as unknown characters appearing as '?'characters is acceptable.


In summary, unless you are an expert at working with unicode, you should not 
work with such string at the character/code point level, and use higher level 
helpers instead to achieve what you want to do. You may want to ask for help 
about that on the Lazarus mailing list (subscription information at 
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus), by describing 
what exactly it is you want to do rather than showing how you are currently 
doing it.


I am not a unicode expert, but I am interested to learn new skills.

Yes, I am a member of the Lazarus list. For some reason I got Lazarusand Free Pascal confused and perhaps wrongly assumed that this questionwas better suited to the FP list.



Jonas_______________________________________________


Thank you.

Peter / pew

fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal

[SOLVED] Re: [fpc-pascal] ord() of a string index returns wrong value.

Reply via email to