On Fri, Mar 09, 2007 at 12:18:27PM +0200, Shlomi Fish wrote:
> On Friday 09 March 2007, Avraham Rosenberg wrote:
> > Hi,
> > I use very often the traditional Unix tools (mainly sed and tr)
> > for text processing from the command line. They work very well as
> > long as the input is plain text in English.
> > Question: Are there ways (please pointers) to use such tools with
> > Hebrew text, html English text and maybe mixed Hebrew-English
> > text and html files ?
> 
> For everything that involves character sets, encoding and/or HTML, I tend to 
> always resort to using Perl. perl-5.8.x has excellent support for character 
> sets and encodings, and has very nice modules for parsing HTML, and other 
> document formats. 
> 
> In regards to your question, from what I recall the tr command has 
> non-existent support for Unicode. Perl's equivalent command (tr/// or y///) 
> support Unicode very well. I don't know about sed.
> 
> Aside from Perl you may also wish to look at Python, and possibly other 
> languages.
> 
Thanks Shlomi. I never summoned the necessary motivation to learn
something as big as perl. Actually, after hearing your two
lectures at telux, my conclusion was was that I won't try to
learn it until I'll really need it. Otherwise, I'll forget it
soon after putting a lot of effort in learning. Maybe now it's
the opportunity. Its ability to deal with a variety of formats is
very appealing.
As to your remarks: With tr you can represent the characters by
their octal codes, thus circumventing the problem of the
encoding. I don't know about sed, either. What about dealing with
right-to-left direction in perl ? Is it direct support there, or 
one should filter the input throug bidiv?

For the problem in hand, Geoff's remark seems to point to the best way to go.
Thanks and cheers, Avraham

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to