Re: [sword-devel] Character Frequency

Peter von Kaehne Thu, 07 Jul 2011 14:40:48 -0700

On 03/07/11 18:43, Greg Hellings wrote:

> http://dl.thehellings.com/count.py


What one though really needs (an all solutions mentioned so far lack)
is a character counter which disregards OSIS tags and attributes.

A "c" in a text of a cyrillic Bible can either be perfectly innocent (as
part of e.g. the "chapter" tag) or it might be in place of a "с"
(\u0441), in which case it causes a mess.

Similar about numbers - a common problem in Arabic script texts we
receive is that the references in xrefs are in Western numbers. Again,
such numbers are normal part of OSIS attributes

I have just now committed a couple of scripts to sword-tools to assist
with this:

1) charmap.pl takes a OSIS file (or rather any XML file) and returns a
character map similar to thise discussed, but solely for text nodes

2) osis_tr.pl does a "tr" job - replacing one set of characters with
another, but again only in text nodes

3) numbers.pl fixes the numbers problem above. I wrote this first,
before I generalised it into the osis_tr.pl script, but think it has
value, as the problem is so common.

Peter

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Character Frequency

Reply via email to