Nrrrrrrrrrrrrdz. Was going to do the same but had no PDFs. On May 2, 2:23 am, Seth Vidal <skvi...@gmail.com> wrote: > On Mon, May 2, 2011 at 2:22 AM, Seth Vidal <skvi...@gmail.com> wrote: > > I was reading the knothole today and Grant was talking about an index > > to the Rivendell Readers. I've got most of the readers on pdf as a > > solstice present a couple of years back. > > > So I was noodling around a bit and here's what I did: > > > 1. split all of the pdfs out into per-page output > > 2. converted all the per-page pdfs to text files. > > 3. wrote a python script to do some relatively naive word indexing > > 4. enhanced the naivete a bit to avoid really common words and pretty > > much anything that appears more than 500 times. > > 5. dumped all of this to a series of text files. > > > Limits of its use: > > a. it's word-separated not 'phrase' so 'sam' is separate from 'hillborne' > > b. the first 10-20 RR on pdf appear to be ocr'd in. So the text is > > occasionally garbled which results in 'odd' things. > > c. a lot of 'grantisms' in use - so when he says 'pillar and means > > 'hunqapillar' well - that's under 'p' not under 'h' > > d. if you look for 'rivendell' or 'bike' you're not going to find it > > b/c, well, that seemed silly to include for fairly obvious reasons, I > > hope. :) > > > If anyone has 36-40 in a pdf I can run this across them too. > > > It's not a proper index, of course, but it is a heck of a start for > > anyone who wants to refine it down. > > > Neat facts: > > the first time the word 'atlantis' appears ( RR18 - pg 0011). > > > romulus appears 25 times in total. > > > that something like 'rambouillet' appears in a variety of interesting > > spellings through out. > > might be nice if I sent a link to the results huh? > > http://sethdot.org/~skvidal/misc/RR-index/ > > -sv
-- You received this message because you are subscribed to the Google Groups "RBW Owners Bunch" group. To post to this group, send email to rbw-owners-bunch@googlegroups.com. To unsubscribe from this group, send email to rbw-owners-bunch+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rbw-owners-bunch?hl=en.