On Wed, 28 Nov 2012, samir pendharkar wrote:
1) When header/footer gets extracted as text, it also include what seems like formatting information/metadata. Example - "? DATE \@ "MM/dd/yy" ?09/16/12?" extracted in the text Actual document only shows "09/16/12" in the footer
Looks like some date formatting logic is needed in the parser. Can you open a JIRA for that, attach a simple sample file showing it, and ideally a patch if you can?
2) Usually, header/footer are included in <div class="header"> tag or <div class="footer"> tag as appropriate (and hence can be suppressed if required). But for some documents, no such header/footer enclosing was observed. Header footer were included as normal <p> tags, which makes it impossible to suppress it. How to fix this?
Is the text really tagged as header/footer in the underlying file? Nick
