Just forgot to mention: LRM and RLM are considered General Punctuation, in the
range U+2000..201B, see http://www.unicode.org/charts/PDF/U2000.pdf.
Another character in this range which is important for Hebrew is U+201E, the
DOUBLE LOW-9 QUOTATION MARK, which should be as the opening double quotation
matk in Hebrew (see my pages below).

On Tue, 22 Jan 2002, Zvi Har'El wrote:

> On Tue, 22 Jan 2002, Eli Marmor wrote:
>
> > Zvi Har'El wrote:
> > > > umm, isn't UTF-8 8 bit with occasional 16? :)
> > >
> > > UTF-8 is one, two or three bytes per character. In the Hebrew case, a Hebrew
> > > character is two bytes.
> >
> > Of course.
> > But there are some special Hebrew characters (such as RLM/LRM, etc.)
> > that are 3.
>
> To be precise, Hebrew characters are those with Unicode representaion
> U+0590..05FF (see http://www.unicode.org/charts/PDF/U0590.pdf), and they all
> occupy two bytes in the UTF-8 encoding. Only characters U+0800 and above need
> three bytes (see utf-8(7) for detailes).  LRM and RLM, the left to right mark
> and right to left mark, are not Hebrew, even according the most liberal HOK
> HASHEVUT (law of return).
>
> > And theoretically, UTF8 can handle up to 5 bytes.
> >
>
> Six, to be precise, under the full UCS-4, which use 31-bit code space.
> However, Unicode 3.0 uses only 16-bits code space (UCS-2) and thus can be
> encoded into 3 bytes. Again, read utf-8(7) for detailes. BTW, in Java,
> everything is internally Unicode, and char is two bytes long. In C,
> wchar_t is 4 bytes long.
>
>

-- 
Dr. Zvi Har'El     mailto:[EMAIL PROTECTED]     Department of Mathematics
tel:+972-54-227607                   Technion - Israel Institute of Technology
fax:+972-4-8324654 http://www.math.technion.ac.il/~rl/     Haifa 32000, ISRAEL
"If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
                              Tuesday, 9 Shevat 5762, 22 January 2002,  1:25PM


=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to