Re: [sword-devel] HowTo: create ztext module?

DM Smith Tue, 09 May 2006 08:18:29 -0700

L.Allan-pbio wrote:

While not simpler, there are ways to compress the files that don't
use stream compression such that each verse can be handled
independently.
I'd be interested in how to do this, and still get decent compression.

The basic, overly simplified technique is to build a mapping fromcharacters seen to bit patterns output. A streaming (de)compressionstarts with a default dictionary and as characters are seen, it adjuststhe dictionary. Rather than doing the analysis on the next character,the analysis is done on a "window" of input. IIRC, a larger windowprovides better compression. Since the dictionary is based on what camebefore, it is not possible to synchronize in the middle of the stream.

This kind of compression is not optimal. If it were known in advancewhat was being compressed (i.e. two pass or static knowledge) then amore appropriate dictionary could be built.

It has been a number of years since I have examined compressiontechniques, so there may be some recent advances...

IIRC, Huffman encoding seems to produce an optimal compression. Thebasic idea is to build a trie with the shortest paths through the triebeing the most frequent patterns. The algorithms that I saw did this oninput assuming a single byte character encoding such as ASCII orLatin-1. It is readily adaptable to UTF-8, by considering bytes ratherthan characters.

Further compression can be achieved by analyzing "word" frequency andprioritizing them in the dictionary. E.g. The word "the" probably occursmore frequently than the letter "Z". If the compression of "t", "h", "e"is bigger (the sum of the lengths of the paths and the time spentprocessing them) than "the" would be then it would make sense to put"the" in the dictionary. (Greatly simplified! The algorithms I sawdefined a word as a sequence of letters whose maximum length is bounded.)

Another gain can be to use a smaller, substitutionary representation fora markup language. E.g. <hi type="italic">...</hi> could become<e>...</e>, where the substitution chosen is highly compressible.(Google: xml compression)

The nature of Huffman decompression is that one can start anywhere inthe bit stream. If the current bit cannot be resolved, it is skippeduntil one can be found to resolve to an entry in the dictionary. Fromthat point forward the stream can be understood.

The building of the dictionary would be a single pass of the input and astatistical analysis of it. The dictionary would be written as a part ofthe module, probably a separate file.

Compressing the module would probably be best to based on atomic units,say verses, since decompression will be done on these units.

I am not aware of any available code to do this. It might exist. But itprobably would need to be written.

Is it worth the effort? I don't think so at this point and time. My takeon it is that there is enough to do that this gets pushed further downmy list of things to do (it is on my todo list). And unless it makessense in the SWORD world as a contribution, it would only be an academicexercise for me (which I love doing).


I think that in the LCDBible world, it would make lots of sense.




_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] HowTo: create ztext module?

Reply via email to