Regarding the algorithm itself, I am now using it on a byte level instead of 
lines level, so it works on any kind of data.

The algorithm's runtime is indeed linear in the input length, which still 
amazes me. How is compression in linear time even possible? Yet it is. The 
basic trick is that by combining repeated pairs into a new symbol, we 
successively basically halve the input length and thus we end up at a single 
amortized operation per input byte - while still finding structures of any 
length.

Just to clarify: This complexity class of O(n) is the lowest conceivable class 
for any compression algorithm whatsoever. You just can't get any lower than 
looking at each character once.

Sadly, despite this theoretically excellent linear complexity, the algorithm is 
not that fast in practical terms. Each byte takes 5 to 7 microseconds to 
process, so we are looking at a maximum of ~200 kB/sec. (Benchmark.) 
<http://code.botcompany.de/1028520>

These results are for a single core as the algorithm is not easy to 
parallelize, although I suppose it could be done.

Memory-wise, the optimized Java implementation consumes about 100 bytes per 
byte in the input which I guess is pretty decent.
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-M9ee73b5390fb07a0c885d8b8
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to