On Friday 09 March 2007, Avraham Rosenberg wrote:
> Hi,
> I use very often the traditional Unix tools (mainly sed and tr)
> for text processing from the command line. They work very well as
> long as the input is plain text in English.
> Question: Are there ways (please pointers) to use such tools with
> Hebrew text, html English text and maybe mixed Hebrew-English
> text and html files ?

For everything that involves character sets, encoding and/or HTML, I tend to 
always resort to using Perl. perl-5.8.x has excellent support for character 
sets and encodings, and has very nice modules for parsing HTML, and other 
document formats. 

In regards to your question, from what I recall the tr command has 
non-existent support for Unicode. Perl's equivalent command (tr/// or y///) 
support Unicode very well. I don't know about sed.

Aside from Perl you may also wish to look at Python, and possibly other 
languages.

>
> Last problem in hand: To extract e-mail addresses and names from
> a word document and create a two-column list. The first column
> containing the names of the addresses and the second links to
> their mail address.
> The list will contain 2000-5000 items.
> As you may guess, the friend which I am trying to help uses
> Microsoft Windows, and will, in the end build an excell file, but
> the MS tools we know are not adequate to deal with such a task.
> Maybe he will realize that Linux offers, after all, possibilities
> which Windows does not.

I should note that you can install Linux-like command line tools on Windows 
using cygwin, and there are native Win32 ports of many Unixish tools.  

Regards,

        Shlomi Fish

---------------------------------------------------------------------
Shlomi Fish      [EMAIL PROTECTED]
Homepage:        http://www.shlomifish.org/

Chuck Norris wrote a complete Perl 6 implementation in a day but then
destroyed all evidence with his bare hands, so no one will know his secrets.

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to