Your idea is similar to Compression by induction of hierarchical grammars by Craig G. Nevill-Manning, Ian H . Witten & David L. Maulsby, October, 1993
It looks promising but I don't know of any practical implementations. You have to make multiple passes to find the most frequent pair of symbols, but it is faster (but suboptimal) to find many frequent pairs at once in fewer passes. On Thu, May 28, 2020, 8:58 AM stefan.reich.maker.of.eye via AGI < [email protected]> wrote: > I think I got the theoretical runtime down to O(n) if we skip sorting the > literal lines in the output file (which doesn't make any difference to the > core idea). > > So the algorithm basically compresses a list of ints to a compact binary > graph in O(n) time. It does so like this: > > while true: > (i, j) = most often (and at least twice!) occurring adjacent pair in > list; end loop if none found > k = index of new or existing table entry (i, j) > replace all occurrences of (i, j) in list with (k) > > Each step of the loop can be done in constant time by clever use of hash > maps and linked lists. That almost sounds impossible, but there we are, > just finished the code. In practical terms, for enwik8, it's just about as > fast than the last version that was more O(n log n)-ish. > > I'm truly wondering if this algorithm has been discovered before. > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + delivery > options <https://agi.topicbox.com/groups/agi/subscription> Permalink > <https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-Ma69c283a13a3bf51eb532bef> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-M8cdf4daa6a5627de0e2fb67e Delivery options: https://agi.topicbox.com/groups/agi/subscription
