L.Allan-pbio wrote:
While not simpler, there are ways to compress the files that don't
use stream compression such that each verse can be handled
independently.

I'd be interested in how to do this, and still get decent compression.
The basic, overly simplified technique is to build a mapping from characters seen to bit patterns output. A streaming (de)compression starts with a default dictionary and as characters are seen, it adjusts the dictionary. Rather than doing the analysis on the next character, the analysis is done on a "window" of input. IIRC, a larger window provides better compression. Since the dictionary is based on what came before, it is not possible to synchronize in the middle of the stream.

This kind of compression is not optimal. If it were known in advance what was being compressed (i.e. two pass or static knowledge) then a more appropriate dictionary could be built.

It has been a number of years since I have examined compression techniques, so there may be some recent advances...

IIRC, Huffman encoding seems to produce an optimal compression. The basic idea is to build a trie with the shortest paths through the trie being the most frequent patterns. The algorithms that I saw did this on input assuming a single byte character encoding such as ASCII or Latin-1. It is readily adaptable to UTF-8, by considering bytes rather than characters.

Further compression can be achieved by analyzing "word" frequency and prioritizing them in the dictionary. E.g. The word "the" probably occurs more frequently than the letter "Z". If the compression of "t", "h", "e" is bigger (the sum of the lengths of the paths and the time spent processing them) than "the" would be then it would make sense to put "the" in the dictionary. (Greatly simplified! The algorithms I saw defined a word as a sequence of letters whose maximum length is bounded.)

Another gain can be to use a smaller, substitutionary representation for a markup language. E.g. <hi type="italic">...</hi> could become <e>...</e>, where the substitution chosen is highly compressible. (Google: xml compression)

The nature of Huffman decompression is that one can start anywhere in the bit stream. If the current bit cannot be resolved, it is skipped until one can be found to resolve to an entry in the dictionary. From that point forward the stream can be understood.

The building of the dictionary would be a single pass of the input and a statistical analysis of it. The dictionary would be written as a part of the module, probably a separate file.

Compressing the module would probably be best to based on atomic units, say verses, since decompression will be done on these units.

I am not aware of any available code to do this. It might exist. But it probably would need to be written.

Is it worth the effort? I don't think so at this point and time. My take on it is that there is enough to do that this gets pushed further down my list of things to do (it is on my todo list). And unless it makes sense in the SWORD world as a contribution, it would only be an academic exercise for me (which I love doing).

I think that in the LCDBible world, it would make lots of sense.




_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to