Andreas Vox wrote:
Hi!
The attached patch realizes a function to display a sorted word list of the current document in a new buffer. It is based on the code for Tools->Count Words and my first experiments with DocIterator & Co. :-)
Why?
Well, for LyX1.5 I'd like to contribute some functions for index generation.
The first step is to get a list of all words in the text (*)
The other reason is that I wanted to get a feeling for DocIterator & Co.
and this function looked simple enough ...
So, sent me comments!
This seems useful for several purposes, but perhaps a warning about automatic index generation. The frequency is a useful statistic, but high frequency have no impact on wether to index the word. The obvious example here should be the top word on your list. :-) We index what "people might want to look up", not "all we have".
Also, make sure this thing does not get in the way of indexing whole phrases, math expressions, images and other stuff that don't show up in a wordlist. Well, I guess it doesn't, but still.
Finally, and most important: Autoindexing every occurence of some
index-worthy word often yields a useless index. Perhaps there are
cases where such indexing is mandatory. But for an ordinary book the
requirement is not to index every occurence of some word, but the 1,2 or 3
most important places the word occur. Few people want to mess around
with "word, 1,2,6,8,12-16, 14, 18,19,22, 25-31,36" only to discover that
"word" is thorougly explained on page 14 and 26-28, and all the other references
merely mention "word" briefly.
I'm sure there is a simpler way to generate LyXText. And I need a method
to insert insets as well. LCursor.insertInset() crashed when I used it with
larger documents.
At least now we know that the english UserGuide.lyx does not only contain
31640 words but that there are only 3501 distinct ones. The most frequent are:
the 1893 to 842 a 819 One of the most infrequent ones is:
unlucky 1
Good, don't want too many of that one. :-)
(*)
The idea is, that LyX provides a wordlist, removes from it any words in a stoplist
and collates by stemming (see http://snowball.tartarus.org/)
Then the user can delete any words he/she doesn't want indexed.
Another LyX function takes the remaining words and inserts IndexInsets
after any one in the original text.
I also think a function which jumps from one IndexInset to the next with the same
key could be useful.
This one very necessary with such an approach to indexing, because the author
will usually want to delete all non-important references to the word in question.
Example: I wrote a book about algorithms, and the word "algorithm" is used in every
chapter, almost on every other page. But there is only one index entry for
"algorithm", which points out the definition in the preface.
Perhaps the program shouldn't add the entries at all, just move from word
to word and ask wether to add an entry at that point?
The ultimate goal would be to edit the index in LyX with all the bells and whistles
makeindex provides:
* subitems and subsubitems
* page ranges
You'll find that page ranges are partially supported already, an index entry that is repeated on several consecutive pages is automatically coalesced to a range. :-)
* special markup
Now that'd be something - ability to use advanced indexing without having to type latex, or watch out for specials like "_" and so on.
Helge Hafting