Re: [Patch] gimmick: show word frequencies in new buffer

Helge Hafting Mon, 07 Feb 2005 04:35:41 -0800

Andreas Vox wrote:

Hi!
The attached patch realizes a function to display a sorted word list
of the current document in a new buffer.
It is based on the code for Tools->Count Words and my first
experiments with DocIterator & Co. :-)
Why? Well, for LyX1.5 I'd like to contribute some functions for index generation. The first step is to get a list of all words in the text (*) The other reason is that I wanted to get a feeling for DocIterator & Co. and this function looked simple enough ...

So, sent me comments!


This seems useful for several purposes, but perhaps a warning about
automatic index generation.  The frequency is a useful statistic, but
high frequency have no impact on wether to index the word.  The obvious
example here should be the top word on your list. :-)  We index what "people
might want to look up", not "all we have".

Also, make sure this thing does not get in the
way of indexing whole phrases, math expressions, images and other stuff that
don't show up in a wordlist.  Well, I guess it doesn't, but still.

Finally, and most important: Autoindexing every occurence of some index-worthy word often yields a useless index. Perhaps there are cases where such indexing is mandatory. But for an ordinary book the requirement is not to index every occurence of some word, but the 1,2 or 3 most important places the word occur. Few people want to mess around with "word, 1,2,6,8,12-16, 14, 18,19,22, 25-31,36" only to discover that "word" is thorougly explained on page 14 and 26-28, and all the other references merely mention "word" briefly.

I'm sure there is a simpler way to generate LyXText. And I need a method to insert insets as well. LCursor.insertInset() crashed when I used it with larger documents.

At least now we know that the english UserGuide.lyx does not only contain 31640 words but that there are only 3501 distinct ones. The most frequent are:
the 1893
to 842
a 819
One of the most infrequent ones is:
unlucky 1

Good, don't want too many of that one. :-)

(*) The idea is, that LyX provides a wordlist, removes from it any words in a stoplist and collates by stemming (see http://snowball.tartarus.org/) Then the user can delete any words he/she doesn't want indexed. Another LyX function takes the remaining words and inserts IndexInsets after any one in the original text.

I also think a function which jumps from one IndexInset to the next with the same key could be useful.

This one very necessary with such an approach to indexing, because the author will usually want to delete all non-important references to the word in question. Example: I wrote a book about algorithms, and the word "algorithm" is used in every chapter, almost on every other page. But there is only one index entry for "algorithm", which points out the definition in the preface. Perhaps the program shouldn't add the entries at all, just move from word to word and ask wether to add an entry at that point?

The ultimate goal would be to edit the index in LyX with all the bells and whistles makeindex provides: * subitems and subsubitems * page ranges


You'll find that page ranges are partially supported already, an index entry
that is repeated on several consecutive pages is automatically coalesced
to a range. :-)

* special markup


Now that'd be something - ability to use advanced indexing without having
to type latex, or watch out for specials like "_" and so on.

Helge Hafting

Re: [Patch] gimmick: show word frequencies in new buffer

Reply via email to