>
> 1) When header/footer gets extracted as text, it also include what seems
> like formatting information/metadata. Example -
> "? DATE \@ "MM/dd/yy" ?09/16/12?" extracted in the text
> Actual document only shows "09/16/12" in the footer
>

> Looks like some date formatting logic is needed in the parser. Can you
open a JIRA for that, attach a simple sample file showing it, and ideally a
patch if you can?

Actually, its not the problem observed with date format alone. I could also
see, "? PAGE ?5?" in the extracted text whereas actual document shows only
page number "5" in footer. I think, there is problem with header footer
template/fields(like page number, total pages, time, date etc) handling.
Can you provide some pointer about I this could be fixed? I will try to
create patch.


2) Usually, header/footer are included in <div class="header"> tag or <div
> class="footer"> tag as appropriate (and hence can be suppressed if
> required). But for some documents, no such header/footer enclosing was
> observed. Header footer were included as normal <p> tags, which makes it
> impossible to suppress it. How to fix this?
>

> Is the text really tagged as header/footer in the underlying file?

When opened in LibreOffice/MS Office, text is being shown as footer/header.
When I open RTF file in raw text editor, I could see

\par }}{\footerf \pard\plain
\s19\qj\widctlpar\tqc\tx4680\tqr\tx9360\adjustright \cgrid {\fs16 ::a\\b\\c
\par }\pard \s19\qc\widctlpar\tqc\tx4680\tqr\tx9360\adjustright {

I assume that this indeed denotes footer. I could see similar headerf and
footer structures too.

Thanks for the quick reply.

Reply via email to