On Thu, Aug 25, 2011 at 15:45, Bob Proulx <b...@proulx.com> wrote: > RiverWind wrote: >> The idea was to concat a large html file and then convert it to >> text. The pdf can be converted to text, and it so far seems like a >> pretty viable translation. > > If I were going to do that for myself I would convert each individual > html file to text first and then concatenate the individual text > files. The reason being that the individual html files are at that > moment completely consistent. Individually they should be able to > convert to text cleanly with no problems. And then the text can be > concatenated. But once you concatenate the html then you have created > a Frankenstein html file that is almost certainly going to be > problematic to convert to text. > > Also, my naive experience with this is that converting html to text is > a lot easier than converting pdf to text. With html it is already a > text type. The mime type is "text/html" after all. But pdf has been > less accessible for conversions for me. The mime time is > "application/pdf" and isn't a text type. That introduces more room > for error to be introduced. >
yes, converting html to text is easier than converting pdf to text - pdf is nice in the native format but when you get into extracting stuff, it's a pain. pdf is not text. you can break the elements into a dom like structure. however, html's dom and pdf's "dom" aren't the same - pdf has an absolute x/y where the element is to be displayed and the element can be binary data (ie a picture). that said, i don't think there will be any accessibility issues with that pdf and it might even convert cleanly (one has a lot to do with the other). so, i would just go with the pdf and be done with it. however, if you are hell bent on converting it to something, i would use something that will keep some formatting - latex or pod come to mind. maybe consider this: http://cpan.uwinnipeg.ca/htdocs/Pod-HTML2Pod/Pod/HTML2Pod.html the latex looks pretty simple too (though i have minimal experience with tex): http://www.iwriteiam.nl/html2tex.html per parsing those html files to figure out chapter, i'd personally use perl and search for the chapter and section in the file, build up a hash of that info and the file that contains it, sort and go from there. it does not seem that there is an easy way to go from pdf -> latex (as i suspected). -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/CAH_OBics+bU-go+i4oizOJbWD0owe__GZaq+L_7=lfuxf+p...@mail.gmail.com