[sword-devel] Improvements in dictionary collation. was Re: AbbottSmith module question

DM Smith Fri, 15 Jan 2016 06:39:06 -0800

> On Jan 15, 2016, at 8:01 AM, Jonathan Morgan <jonmmor...@gmail.com> wrote:
> 
> Hi DM,
> 
> On Fri, Jan 15, 2016 at 1:40 AM, DM Smith <dmsm...@crosswire.org 
> <mailto:dmsm...@crosswire.org>> wrote:
> I’ve been trawling through the code. Seems that there is support for Strong’s 
> Numbers that are not padded. If a module contains Strong’s Numbers that are 
> not padded, it is to use StrongsPadding=false. (Actually any value other than 
> “true” will be false. TRUE is false.) This module does not have it.
> 
> Not having StrongsPadding in a conf is the same as StrongsPadding=true. 
> There’s a note in the wiki that says that we’ll probably reverse that in the 
> future. I doubt it. We still have LZSS as the default compression though no 
> module has used it for years (other than experimental modules).
> 
> I’m not sure how a Bible with a reference to G0001 will find G1 as it doesn’t 
> unpad the user’s input. But at least the dictionary should work. BTW, there’s 
> a missing "if (strongsPadding)” in rawLD. It is present in zLD. I think this 
> is a bug. Need to verify, report and submit a patch for it. (BTW, I don’t 
> have write permissions either on the main repo, but I’m not discouraged in 
> contributing and submitting patches.)
> 
> Sorry if I'm missing something, but surely keys without padding wouldn't 
> appear in the correct (numeric) order in the dictionary?
> 
> Jon


Jon,

Right. They will be in collation order, not numerical order. It doesn’t work as 
a SWORD module for that reason and was my primary motivation for moving it to 
the Experimental repository. The tei2mod program needs to add support for 
Strong’s numbers as imp2ld has. It doesn’t pad the values as it puts them into 
the module.

The ordering problem is a more general problem. Our collation order is good for 
ASCII. It is not good for Latin-1 as the byte value for accented letters is not 
adjacent to unaccented counterparts.

Each language, script combination has its own collation order. Some languages 
use multiple glyphs for a single letter. This was noted earlier this month on 
this mailing list.

In a past job, I had to implement a sort routine that would account for numbers 
occurring anywhere in a string. What I discovered in the process of doing this 
was that there is a need for an internal representation that differs from an 
external representation and routines that would normalize an external 
representation to an internal representation. Basically that routine would look 
at a string as an alternating sequence of numbers and non-numbers. The routine 
external2internal would create a string where numbers were zero padded to 10 
digits. (It also did other things like strip noise words from the string, 
normalize dotted acronyms, normalize casing, …).


Also in an earlier posting this month, I mentioned that ICU has collation 
routines that are language and script sensitive. The collation values that 
these produce are good for byte-order sorting, but are not intended for 
external use.

What we need is a dictionary that stores the case-insensitive keys and that the 
frontend can collate as it sees fit. That collation order would be used to sort 
and show the case-insensitive keys. Basically another layer of indirection with 
a mapping from external presentation to the internal storage of the module.

We’ve talked about this before. I think Troy suggested a mechanism.

I’m going to survey the lexdict modules in all the repos in the Master list 
(and a few others) to see where we stand with those modules and the 
StrongsPadding flag. If any key starts with a number and isn’t zero padded, it 
will have difficulty if StrongsPadding=false is not in the conf. If a module 
has some that are zero padded and others that are not, this also is a problem.

DM

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

[sword-devel] Improvements in dictionary collation. was Re: AbbottSmith module question

Reply via email to