Re: eLyXer for Document Parsing

Rob Oakes Sat, 04 Feb 2012 13:01:38 -0800

Hi Steve,

> Not only possible but easy if you do things the Steve Litt way. eLyXer
> quickly punches out HTML that's clean enough to read with an XML
> parser, I think. So, eLyXer converts to HTML, and then your program's
> DOMbuilder module converts that HTML to in-memory DOM. No muss, no
> fuss, no bother, no picking apart eLyXer code (it's big and not
> immediately obvious, not a single weekend task).


Thanks for the recommendations. I'll need to look into this further. It's 
definitely the easiest way to go, and easy is usually the best. So says the Zen 
of Python (sort of):

If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

I was hoping for a slightly more direct route, though. That would allow me to 
maintain some of the internal data, such as cross-links. But, as I don't have 
months to implement, easy is always better than hard.

> One more question: You sure you want to go in-memory? What happens if a
> guy has a 1200 page book with 100 chapters each containing 10 sections,
> each containing 10 subsections, and tries to parse it on a machine with 512 
> MB RAM? 

I pity this poor man's decision to convert the whole mess to Word, rather than 
splitting it out into individual chapters.

But, I appreciate the voice for reason answer sanity and best practice. Short 
answer, no, not convinced that I want to go in memory. My first pass was to 
just to become comfortable with eLyXer to see if it might meet my needs. I'm 
still try to get comfortable with the structure of LyX documents and .docx 
documents. I've found a nice little python library with support for basic docx 
features and was going to try and refine that to something slightly more usable.

> You in a heap of trouble son. He'll be swapped half way into the next 
> century. If
> instead you used an event parser (e.g SAX) with a few stacks, it will
> probably be slower, and it will be much more hard to write, but for
> practical purposes there won't be an upper limit on input file size.

Good points. The python library makes use of lxml, which supports sax. After 
I've got a better handle on my constraints, I'll spend the time required to 
design something more robust. 

Cheers,

Rob

Re: eLyXer for Document Parsing

Reply via email to