On Friday 09 March 2007, Avraham Rosenberg wrote:
> On Fri, Mar 09, 2007 at 12:18:27PM +0200, Shlomi Fish wrote:
> > On Friday 09 March 2007, Avraham Rosenberg wrote:
> > > Hi,
> > > I use very often the traditional Unix tools (mainly sed and tr)
> > > for text processing from the command line. They work very well as
> > > long as the input is plain text in English.
> > > Question: Are there ways (please pointers) to use such tools with
> > > Hebrew text, html English text and maybe mixed Hebrew-English
> > > text and html files ?
> >
> > For everything that involves character sets, encoding and/or HTML, I tend
> > to always resort to using Perl. perl-5.8.x has excellent support for
> > character sets and encodings, and has very nice modules for parsing HTML,
> > and other document formats.
> >
> > In regards to your question, from what I recall the tr command has
> > non-existent support for Unicode. Perl's equivalent command (tr/// or
> > y///) support Unicode very well. I don't know about sed.
> >
> > Aside from Perl you may also wish to look at Python, and possibly other
> > languages.
>
> Thanks Shlomi. I never summoned the necessary motivation to learn
> something as big as perl. Actually, after hearing your two
> lectures at telux, my conclusion was was that I won't try to
> learn it until I'll really need it. Otherwise, I'll forget it
> soon after putting a lot of effort in learning. Maybe now it's
> the opportunity. Its ability to deal with a variety of formats is
> very appealing.
> As to your remarks: With tr you can represent the characters by
> their octal codes, thus circumventing the problem of the
> encoding. 

Perhaps. However, I believe GNU tr still makes the assumption that a character 
is represented a single byte, which doesn't work properly for multi-byte 
encodings. (like UTF-8).

Regards,

        Shlomi Fish 

---------------------------------------------------------------------
Shlomi Fish      [EMAIL PROTECTED]
Homepage:        http://www.shlomifish.org/

Chuck Norris wrote a complete Perl 6 implementation in a day but then
destroyed all evidence with his bare hands, so no one will know his secrets.

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to