On Thu, 5 Nov 2009 14:50:56 +1030 David Purton <dcpur...@marshwiggle.net> wrote:
> Can anyone suggest a simple way to strip vowels out of utf-8 encoded > hebrew text, leaving just the consenants? > > i.e., given something like בָָּ֟֟רָא, pipe it through something so that the > output is ברא. The unicode characters <U+0591> to <U+05C7> ideally > should be stripped. This includes accents, etc. #! /usr/bin/perl -w use strict; use Encode; while (<>) { $_ = Encode::decode('utf-8', $_); s/[\x{0591}-\x{05C7}]//g; print Encode::encode('utf-8', $_); } This works (tested on your example, and on a sample from here: http://www.mechon-mamre.org/c/ct/c0101.htm). Celejar -- foffl.sourceforge.net - Feeds OFFLine, an offline RSS/Atom aggregator mailmin.sourceforge.net - remote access via secure (OpenPGP) email ssuds.sourceforge.net - A Simple Sudoku Solver and Generator -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org