On Jan 1, 5:38 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Tue, 01 Jan 2008 04:21:29 -0800,Shriphaniwrote: > > On Jan 1, 4:28 pm, Piet van Oostrum <[EMAIL PROTECTED]> wrote: > >> >>>>>Shriphani<[EMAIL PROTECTED]> (S) wrote: > >> >S> I tried pyPdf for this and decided to get the pagelinks. The trouble > >> >S> is that I don't know how to determine whether a particular page is the > >> >S> first page of a chapter. Can someone tell me how to do this ? > > >> AFAIK PDF doesn't have the concept of "Chapter". If the document has an > >> outline, you could try to use the first level of that hierarchy as the > >> chapter starting points. But you don't have a guarantee that they really > >> are chapters. > > > How would a pdf to html conversion work ? I've seen Google's search > > engine do it loads of times. Just that running a 500odd page ebook > > through one of those scripts might not be such a good idea. > > Heuristics? Neither PDF nor HTML know "chapters". So it might be > guesswork or just in your head. > > Ciao, > Marc 'BlackJack' Rintsch
I could parse the html and check for the words "unit" or "chapter" at the beginning of a page. I am using pdftohtml on Debian and it seems to be generating the html versions of pdfs quite fast. I am yet to run a 500 page pdf through it though. Regards, Shriphani -- http://mail.python.org/mailman/listinfo/python-list