Nrrrrrrrrrrrrdz. Was going to do the same but had no PDFs.

On May 2, 2:23 am, Seth Vidal <skvi...@gmail.com> wrote:
> On Mon, May 2, 2011 at 2:22 AM, Seth Vidal <skvi...@gmail.com> wrote:
> > I was reading the knothole today and Grant was talking about an index
> > to the Rivendell Readers. I've got most of the readers on pdf as a
> > solstice present a couple of years back.
>
> > So I was noodling around a bit and here's what I did:
>
> > 1. split all of the pdfs out into per-page output
> > 2. converted all the per-page pdfs to text files.
> > 3. wrote a python script to do some relatively naive word indexing
> > 4. enhanced the naivete a bit to avoid really common words and pretty
> > much anything that appears more than 500 times.
> > 5. dumped all of this to a series of text files.
>
> > Limits of its use:
> >  a. it's word-separated not 'phrase' so 'sam' is separate from 'hillborne'
> >  b. the first 10-20 RR on pdf appear to be ocr'd in. So the text is
> > occasionally garbled which results in 'odd' things.
> >  c. a lot of 'grantisms' in use - so when he says 'pillar and means
> > 'hunqapillar' well - that's under 'p' not under 'h'
> >  d. if you look for 'rivendell' or 'bike' you're not going to find it
> > b/c, well, that seemed silly to include for fairly obvious reasons, I
> > hope. :)
>
> > If anyone has 36-40 in a pdf I can run this across them too.
>
> > It's not a proper index, of course, but it is a heck of a start for
> > anyone who wants to refine it down.
>
> > Neat facts:
> >  the first time the word 'atlantis' appears (  RR18 - pg 0011).
>
> >  romulus appears 25 times in total.
>
> >  that something like 'rambouillet' appears in a variety of interesting
> > spellings through out.
>
> might be nice if I sent a link to the results huh?
>
> http://sethdot.org/~skvidal/misc/RR-index/
>
> -sv

-- 
You received this message because you are subscribed to the Google Groups "RBW 
Owners Bunch" group.
To post to this group, send email to rbw-owners-bunch@googlegroups.com.
To unsubscribe from this group, send email to 
rbw-owners-bunch+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/rbw-owners-bunch?hl=en.

Reply via email to