Re: HTML Parser which allows low-keyed local changes (upon serialization)

Robert Mon, 01 Feb 2010 05:42:46 -0800

Stefan Behnel wrote:

Robert, 31.01.2010 20:57:

I tried lxml, but after walking and making changes in the element tree,
I'm forced to do a full serialization of the whole document
(etree.tostring(tree)) - which destroys the "human edited" format of the
original HTML code. makes it rather unreadable.


What do you mean? Could you give an example? lxml certainly does not
destroy anything it parsed, unless you tell it to do so.


of course it does not destroy during parsing.(?)

I mean: I want to walk with a Python script through the parsedtree HTML and modify here and there things (auto alt tags fromDB/similar, link corrections, text sections/translatedsentences... due to HTML code and content checks.)

Then I want to output the changed tree - but as close to theoriginal format as far as possible. No changes to my white spaceidentation, etc.. Only lokal changes, where really tags wherechanged.

Thats similiar like that what a good HTML editor does: After youmade little changes, it doesn't reformat/re-spit-out your wholecode layout from tree/attribute logic only. you have lokal changesonly.But a simple HTML editor like that in Mozilla-Seamonkey outputs awhole new HTML, produces the HTML from logical tree only(regarding his (ugly) style), destroys my whitspace layout andmuch more - forgetting anything about the original layout.

Such a "good HTML editor" must somehow track the originalpositions of the tags in the file. And during each logical changein the tree it must tracks the file position changes/offsets. Thatthing seems to miss in lxml and BeautifulSoup which I tried so far.


This is a frequent need I have. Nobody else's?

Seems I need to write my own or patch BS to do that extra tracking?


Robert
--
http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parser which allows low-keyed local changes (upon serialization)

Reply via email to