On Friday 09 March 2007, Avraham Rosenberg wrote: > Hi, > I use very often the traditional Unix tools (mainly sed and tr) > for text processing from the command line. They work very well as > long as the input is plain text in English. > Question: Are there ways (please pointers) to use such tools with > Hebrew text, html English text and maybe mixed Hebrew-English > text and html files ?
For everything that involves character sets, encoding and/or HTML, I tend to always resort to using Perl. perl-5.8.x has excellent support for character sets and encodings, and has very nice modules for parsing HTML, and other document formats. In regards to your question, from what I recall the tr command has non-existent support for Unicode. Perl's equivalent command (tr/// or y///) support Unicode very well. I don't know about sed. Aside from Perl you may also wish to look at Python, and possibly other languages. > > Last problem in hand: To extract e-mail addresses and names from > a word document and create a two-column list. The first column > containing the names of the addresses and the second links to > their mail address. > The list will contain 2000-5000 items. > As you may guess, the friend which I am trying to help uses > Microsoft Windows, and will, in the end build an excell file, but > the MS tools we know are not adequate to deal with such a task. > Maybe he will realize that Linux offers, after all, possibilities > which Windows does not. I should note that you can install Linux-like command line tools on Windows using cygwin, and there are native Win32 ports of many Unixish tools. Regards, Shlomi Fish --------------------------------------------------------------------- Shlomi Fish [EMAIL PROTECTED] Homepage: http://www.shlomifish.org/ Chuck Norris wrote a complete Perl 6 implementation in a day but then destroyed all evidence with his bare hands, so no one will know his secrets. ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]