Hi,

I think there is a bug in the RTF parser. When parsing RTF (generated by 
dragging an Outlook MSG file to the desktop and read in using POI's 
MAPIMessage.getRtfBody()) it seems I get an endElement(qName='title') AFTER 
endElement(qName='head') . it should be coming before the header is closed.

Note: I'm using XmlBeans Sax2Dom to process, as my goal is conversion of RTF to 
HTML.

Example Code:
        String rtf = . bunch of rtf from Outlook MSG .
        handler = new XHTMLContentHandler(sax2dom=new Sax2Dom(), metadata=new 
Metadata());
        (new RTFParser()).parse(new StringInputStream(rtf), handler, metadata, 
new ParseContext());
        Node html = sax2dom.getDOM();

Resulting html is malformed, for example:

<html xmlns="http://www.w3.org/1999/xhtml";>
    <head>
        <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>
            <title></title>
        </title>
        <body></body> <!-- BAD FORMATTING! Should be </head><body>!! -->
        <p>Message body text</p>
        <p>&nbsp;</p>
        <p>.etc.</p>
    </head> <!-- BAD -- should be </body> -->
</html>

I can 'fix' this issue by creating a wrapper class and ignoring all begin / 
endElement's with qName='title', but that's not a real solution :D

Also, another issue is embedded <img> tags are not emitted from the RTF, such 
as this one ...
        {\*\htmltag84 <img width=142 height=59 id="Picture_x0020_1" 
src="cid:[email protected]" alt="Entertainment Information">}

I could upload an example java class, but not sure if attached files are 
allowed in this mailing list.

Thanks!

David Van Camp  | Software Engineer III | 40 Media Drive, Queensbury, NY  12804
Toll Free: 800.833-9581  Ext  2145 | Web: TribuneMediaServices.com | Email: 
[email protected] 
 

Reply via email to