I think I got the theoretical runtime down to O(n) if we skip sorting the literal lines in the output file (which doesn't make any difference to the core idea).
So the algorithm basically compresses a list of ints to a compact binary graph in O(n) time. It does so like this: while true: (i, j) = most often (and at least twice!) occurring adjacent pair in list; end loop if none found k = index of new or existing table entry (i, j) replace all occurrences of (i, j) in list with (k) Each step of the loop can be done in constant time by clever use of hash maps and linked lists. That almost sounds impossible, but there we are, just finished the code. In practical terms, for enwik8, it's just about as fast than the last version that was more O(n log n)-ish. I'm truly wondering if this algorithm has been discovered before. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-Ma69c283a13a3bf51eb532bef Delivery options: https://agi.topicbox.com/groups/agi/subscription
