Hi all, In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file).
Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. The order of occurrence is very handy because the most occurring words are usually garbage like the he, a and the like. Therefore, you can scan and delete those words very quickly. The words used only once or twice comprise the majority of words, and because they're used only once or twice, they're typically not important and you can scan them very quickly. The words in the middle typically contain many words useful in construction of an index, and should be perused more quickly. My program, which is written in Ruby, is licensed GNU GPL version 2, and is included as the remainder of the body of this document. Have fun with it. SteveT #!/usr/bin/ruby # Copyright (C) 2007 by Steve Litt, all rights reserved # This program is licensed under the GNU GPL version 2 -- only version 2 require 'set' $punct=Set.new([",", ".", "/", "<", ">", "?", ";", "'", ":", '"', "[", "]", "{", "}", "|", "`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "+"]) def by_freq_then_name(a, b) if a[1] < b[1] return 1 elsif a[1] > b[1] return -1 elsif a[0] > b[0] return 1 elsif a[0] < b[0] return -1 else return 0 end end word_hash = Hash.new() word_hash['junk'] = 25 STDIN.each do |line| line.chomp! line.strip! temparr = line.split(/\s\s*/) temparr.each do |word| while word.length > 0 and $punct.include?(word[0].chr) word = word[1..-1] end while word.length > 0 and $punct.include?(word[-1].chr) word = word[0..-2] end if word_hash.has_key?(word) word_hash[word] += 1 else word_hash[word] = 1 end end end puts "=================================================" puts "=============== ALPHA ORDER =====================" puts "=================================================" keys = word_hash.keys.sort keys.each do |key| printf "%24s %6d\n", key, word_hash[key] end puts "=================================================" puts "============ OCCURRENCE ORDER ===================" puts "=================================================" temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)} temparray.each do |word_freq| printf "%7d %s\n", word_freq[1], word_freq[0] end