On Mon, May 2, 2011 at 2:22 AM, Seth Vidal <skvi...@gmail.com> wrote: > I was reading the knothole today and Grant was talking about an index > to the Rivendell Readers. I've got most of the readers on pdf as a > solstice present a couple of years back. > > So I was noodling around a bit and here's what I did: > > 1. split all of the pdfs out into per-page output > 2. converted all the per-page pdfs to text files. > 3. wrote a python script to do some relatively naive word indexing > 4. enhanced the naivete a bit to avoid really common words and pretty > much anything that appears more than 500 times. > 5. dumped all of this to a series of text files. > > > Limits of its use: > a. it's word-separated not 'phrase' so 'sam' is separate from 'hillborne' > b. the first 10-20 RR on pdf appear to be ocr'd in. So the text is > occasionally garbled which results in 'odd' things. > c. a lot of 'grantisms' in use - so when he says 'pillar and means > 'hunqapillar' well - that's under 'p' not under 'h' > d. if you look for 'rivendell' or 'bike' you're not going to find it > b/c, well, that seemed silly to include for fairly obvious reasons, I > hope. :) > > > If anyone has 36-40 in a pdf I can run this across them too. > > It's not a proper index, of course, but it is a heck of a start for > anyone who wants to refine it down. > > Neat facts: > the first time the word 'atlantis' appears ( RR18 - pg 0011). > > romulus appears 25 times in total. > > that something like 'rambouillet' appears in a variety of interesting > spellings through out. >
might be nice if I sent a link to the results huh? http://sethdot.org/~skvidal/misc/RR-index/ -sv -- You received this message because you are subscribed to the Google Groups "RBW Owners Bunch" group. To post to this group, send email to rbw-owners-bunch@googlegroups.com. To unsubscribe from this group, send email to rbw-owners-bunch+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rbw-owners-bunch?hl=en.