On Mon, 16 Apr 2012, babug wrote:
I have attached(Ticket_Diary.oft) a outlook format template.I need to
parse these type of files and get the actual HTML content.I have tested
with following code, but the parser returns <p> tag instead of <table>
or <Div> tags.How do i exclude from SAFE_ELEMENTS map.?
It might not be stored as html - Outlook often stores "html" content of
emails as RTF.
Also...
*String msgfile = "/home/test/Desktop/EmailParse/Ticket Diary.oft";
InputStream stream = new FileInputStream(msgfile);
StringWriter sw = new StringWriter();
Parser parser = new OfficeParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(HtmlMapper.class,IdentityHtmlMapper.INSTANCE);
This seems to be you using Tika. If you want to use Tika to do this, you
should probably ask on the Tika list. Alternately, you can use HSMF from
Apache POI to directly access the file, and get at the exact bits of it
you need. I'd suggest you look at the HSMF text extractor in POI, and
OutlookExtractor from Apache Tika as good examples of how to go about
using HSMF
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]