On Friday 09 March 2007, Avraham Rosenberg wrote: > On Fri, Mar 09, 2007 at 12:18:27PM +0200, Shlomi Fish wrote: > > On Friday 09 March 2007, Avraham Rosenberg wrote: > > > Hi, > > > I use very often the traditional Unix tools (mainly sed and tr) > > > for text processing from the command line. They work very well as > > > long as the input is plain text in English. > > > Question: Are there ways (please pointers) to use such tools with > > > Hebrew text, html English text and maybe mixed Hebrew-English > > > text and html files ? > > > > For everything that involves character sets, encoding and/or HTML, I tend > > to always resort to using Perl. perl-5.8.x has excellent support for > > character sets and encodings, and has very nice modules for parsing HTML, > > and other document formats. > > > > In regards to your question, from what I recall the tr command has > > non-existent support for Unicode. Perl's equivalent command (tr/// or > > y///) support Unicode very well. I don't know about sed. > > > > Aside from Perl you may also wish to look at Python, and possibly other > > languages. > > Thanks Shlomi. I never summoned the necessary motivation to learn > something as big as perl. Actually, after hearing your two > lectures at telux, my conclusion was was that I won't try to > learn it until I'll really need it. Otherwise, I'll forget it > soon after putting a lot of effort in learning. Maybe now it's > the opportunity. Its ability to deal with a variety of formats is > very appealing. > As to your remarks: With tr you can represent the characters by > their octal codes, thus circumventing the problem of the > encoding.
Perhaps. However, I believe GNU tr still makes the assumption that a character is represented a single byte, which doesn't work properly for multi-byte encodings. (like UTF-8). Regards, Shlomi Fish --------------------------------------------------------------------- Shlomi Fish [EMAIL PROTECTED] Homepage: http://www.shlomifish.org/ Chuck Norris wrote a complete Perl 6 implementation in a day but then destroyed all evidence with his bare hands, so no one will know his secrets. ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]