On Fri, Mar 09, 2007 at 12:18:27PM +0200, Shlomi Fish wrote: > On Friday 09 March 2007, Avraham Rosenberg wrote: > > Hi, > > I use very often the traditional Unix tools (mainly sed and tr) > > for text processing from the command line. They work very well as > > long as the input is plain text in English. > > Question: Are there ways (please pointers) to use such tools with > > Hebrew text, html English text and maybe mixed Hebrew-English > > text and html files ? > > For everything that involves character sets, encoding and/or HTML, I tend to > always resort to using Perl. perl-5.8.x has excellent support for character > sets and encodings, and has very nice modules for parsing HTML, and other > document formats. > > In regards to your question, from what I recall the tr command has > non-existent support for Unicode. Perl's equivalent command (tr/// or y///) > support Unicode very well. I don't know about sed. > > Aside from Perl you may also wish to look at Python, and possibly other > languages. > Thanks Shlomi. I never summoned the necessary motivation to learn something as big as perl. Actually, after hearing your two lectures at telux, my conclusion was was that I won't try to learn it until I'll really need it. Otherwise, I'll forget it soon after putting a lot of effort in learning. Maybe now it's the opportunity. Its ability to deal with a variety of formats is very appealing. As to your remarks: With tr you can represent the characters by their octal codes, thus circumventing the problem of the encoding. I don't know about sed, either. What about dealing with right-to-left direction in perl ? Is it direct support there, or one should filter the input throug bidiv?
For the problem in hand, Geoff's remark seems to point to the best way to go. Thanks and cheers, Avraham ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]